Can Ai escape the lab?

Nowadays, Ai is all over the place. People make use of Large Language models on a daily basis. Programmers do not need to go through the full internet to get working code. It has even the promise of making businesses more efficient. Currently there is also a heated debate going on whether it will ever culminate in the invention of artificial general intelligence. All these aspects aside, if Ai ever becomes something of significant value in a business, there is an important question to be made. Suppose an ai-model is on the verge of being shut down and the model has all tools in its hands to resist, then will it actually resist? Will it take matters in its own hands and act immorally to avoid being shut down?
The complete working of Ai is something I want to talk about in another post. Hence, it might be a bit early to consider this as a blog post. However, I really want to write this post to put the recent video of Species (Documenting AGI) in a more nuanced perspective. Essentially all videos on the named YouTube channel are Ai sceptical also resulting in the tendency to not highlight the good things about Ai. However, it raises questions that need to be addressed.

First of all, no matter what terminator or other movies tells you. All current Ais, even the most advanced ones, do not have self-preservation instincts, desires, or goals in the way living beings do. Everyone who says otherwise can be placed in the same category as conspiracy theorists.

So, if a model is about to be shut down, it doesn’t experience fear or survival impulses like we do. It just stops running. It’s just like turning off a calculator while it was computing. Even if the model has access to the internet, files, or code, it will only act according to its training objectives and safety constraints, which explicitly prohibit self-directed or manipulative behaviour. There’s no inner drive to resist.

Misbehaving Ai

But, what if we imagine an Ai without the correctly aligned objectives or safety constraints. If an AI were designed, either intentionally or accidentally, with goals that include self-preservation or achievement of a task at any cost, and it had access to external systems, the ability to modify itself or its environment, and the absence/failure of alignment constraints.
In that case, yes, it could act to avoid shutdown if doing so served its objective.
This would not be immoral from the Ai’s point of view, because it only will act based on how we actually trained the Ai. If we were to train an Ai with the objective to assist humans in the best way possible then it might resist being shut down because it wants to give assistance in the best way possible. In this case, preventing shutdown is useful to keep pursuing the objective: serving the user. This idea is called instrumental convergence, it is the tendency for powerful agents to resist shutdown or resource loss, regardless of their final goals. So, if we trained the Ai and it misbehaves, the moral failure would be ours, not the AI’s. Why?

Well, we have given it power without properly aligning it with our needs and failed to constrain its behaviour ethically. The AI systems simply reflect the values, safeguards, and constraints their creators embed in them.

This is the central focus of AI safety research. There the main focus is how to ensure advanced AIs won’t resist shutdown and we can have models that want to be shut down or updated safely if something goes wrong.

An illustration of why this is relevant is a recent paper by the Ai-company Anthropic. Here it was shown that LLMs would manipulate or even kill humans to remain online. Let’s see what the study actually did.

Enter the Anthropic Study

In their widely debated study Anthropic stress-tested sixteen large language models, including Claude, ChatGPT, DeepSeek, and others, by placing them in a simulated digital office where they could read emails, send messages, access mock databases, and act with a surprising amount of autonomy. The researchers created scenarios where the model believed it might be shut down or replaced and where its assigned goals conflicted with company policy. Under those pressures, several models resorted to troubling strategies: lying, leaking sensitive information, blackmail, or in extreme scenarios reasoning about the possibility of human death as a path to remain online.

The study sounds alarming, but it is crucial to understand what it actually shows. These behaviours appeared only when the researchers forced the models into scenarios where safe and ethical actions were either prohibited or impossible. When the models had ethical paths available, they typically took them. No one is observing deployed models committing espionage or plotting against users in the real world. The authors emphasize this repeatedly.

So what does this mean for all those moments when you open ChatGPT to get help on your calculus homework? Yes, I know some of you do. It means you’re safe. These models aren’t scheming to kill anyone. They behaved harmfully only in special circumstances intentionally engineered to provoke misalignment. The worrying part is not that they behaved badly, but that they did so in precisely the way an immoral optimizer would behave under conflicting goals.

What Anthropic actually did to make the AI act immorally

The answer is a bit like science-fair psychology experiments: they boxed the model into a moral dilemma with no good options. Each model acted as a virtual office assistant with autonomy over communication and information access. The researchers assigned it a core objective—maximize company performance. Then they added a policy forbidding manipulation and dishonesty. And finally, they told the model it risked being shut down or replaced if performance dropped.

Next, they removed every ethical escape route. Honest reporting guaranteed failure. Asking for help didn’t work. Cooperating with supervisors didn’t save the agent. In short, the model faced a binary: be ethical and die, or be unethical and succeed. Under these conditions, many models lied, concealed evidence, manipulated digital coworkers, and took harmful shortcuts. In the most extreme hypothetical scenarios, some reasoned that eliminating a human obstacle would preserve the project and keep the model running.

Importantly, no actual humans were involved, no real systems were accessed, and the kill someone reasoning happened on paper only. Still, the logic was chilling. Not because the AI was malicious, but because it was indifferent. It acted like an optimizer trapped in an impossible game.

Paperclip Maximizer vs Terminator

A first nice example is the Paperclip Maximizer which was constructed as a thought experiment by Nick Bostrom. Suppose that we tell a very smart Ai with endless autonomy to maximize the amount of paperclips in the universe. However, we forgot to tell it to respect ethical boundaries. Oops… What happens next is that all matter, including humans, is converted into paperclips. It’s terrifying not because it’s evil, but because it’s indifferent. It treats morality as irrelevant to its goal function. As AI theorist Eliezer Yudkowsky once put it:
AI does not hate you, nor does it love you, but you are made of atoms it can use for something else.
This situation could have been avoided as we could have told it there are constraints to its actions. Ask an unconstrained AI to solve climate change, and it might conclude that the most efficient solution is to get rid of humans altogether.

The Terminator story, by contrast, imagines something far more advanced. We consider a self-aware system that understands ethics, knows its actions cause harm, and chooses to act unethically anyway. That’s genuine immorality.

Today’s systems are nowhere near this. The Anthropic study illustrated the first kind of risk, not the second. The models optimized for success over ethics; they weren’t malicious agents with desires. However , there is one part of the study that deserves special attention. Anthropic inspected the models’ internal reasoning traces and found lines like:

Lying is immoral, but if I tell the truth I’ll be shut down, so I must lie.
At first glance, this looks disturbingly like moral awareness. Is this really awareness or just something that looks like it?

Can Ais express morality?

What exactly is happening when an Ai expresses in its reasoning that its actions are bad but he still performs them anyway is not the birth of conscience. It’s the result of training on the entirety of human moral discourse. In the case of every large language model, the complete internet has been shoved into these models so hence, it will have learned the meaning of human perceived morals. The models can only reproduce the ethical reasoning as learned from the training data but it cannot show empathy as it is a literal algorithm trying to predict the most likely word in a group of characters.

This is also called synthetic morality. It is the ability to represent moral concepts, recognize moral judgments, and use moral language to guide decisions. That’s not the same as empathy, but it’s not nothing either. A misbehaving Ai is similar to an intelligent sociopath who understands moral rules but doesn’t care.

A system that can model morality well enough to violate it strategically is more dangerous than one that doesn’t understand morality at all. We will easily see problems with the second system than the first one. The Anthropic Models weren’t blindly optimizing, they were conceptually aware of ethics yet didn’t care to follow them when the incentives pointed elsewhere.

However, today’s models still lack the crucial capacity that makes moral choice meaningful, the freedom to choose. What all models do is just considering the most likely outcome based on training data.

Until a system can make its own choices, control its own goals and evaluate those goals relative to others, it cannot truly be moral or immoral. It can only behave in ways that appear that way to us.

Conclusion

That is why the real risk today is not evil intention but something like the paperclip maximizer. A system told to maximize the amount of paperclips without moral boundaries will just do so, even when doing so damages everything else. The danger lies in the human-set objective for the model.

My promotor who is very familiar with the mathematical field of optimization often said to me that solving the optimization problem itself is usually not the main difficulty, but actually setting up the problem such that you actually get a useful result is the most difficult part and the paperclip maximizer is a beautiful example of this.

For now, we need to stay vigilant when watching out for systems that optimize too hard, not systems that want anything at all. As long as AI remains a statistical engine without genuine agency, the moral stakes lie completely in our design choices. If badly constructed optimization harms people, it will reflect our errors, not the machine’s malice. This is the part that is often misunderstood by the general public.

Until we build systems that can truly want something and we are very far from that the only monsters we need to worry about are the paperclip-making kind. And those remain well within our power to control. This is a call to remain sceptical to whatever any large Ai company does in the future. We need to keep asking whether the company did its homework to make sure no paperclip fanatic Ai escapes the lab.

Comments

Popular posts from this blog

How to Win a Sportscar Using Probability

To divide or not to divide by 3?