Discussion on AI security issues with the three giants of deep learning

Watched a video on YouTube and found the content quite interesting, sharing it with everyone interested in AI developments. Terminal may not be far from us... It quotes a sharing by Benjio, where machines have already started trying to avoid being replaced. The current version of AI has begun to show thoughts of avoiding being replaced by an updated version in the chain of thought. It even starts to look for action plans. Is this... is this true?

Source: Best Partners

The Three Giants of Deep Learning and Their Divergent Views on AI

Geoffrey Hinton, Yann LeCun, and Yoshua Bengio are known as the three giants of deep learning. They persisted in researching neural networks during the AI winter, eventually leading the deep learning revolution and jointly receiving the Turing Award in 2018.

However, with the rapid development of AI capabilities in recent years, the three have shown significant differences in their views on AI:

Hinton: After resigning from Google in 2023, he has repeatedly expressed serious concerns about the speed of AI development and potential risks. He worries that AI may surpass human intelligence in the near future, leading to a loss of control by humans, and even existential risks that could lead to human extinction.

LeCun: Currently the head of Meta AI research, he holds a more optimistic attitude towards AI risks. He believes that concerns about AI breaking free from human control are exaggerated and insists that AI systems can be designed to be safe and beneficial. In addition, he opposes calls to slow down AI research and supports open research and open-source AI models.

Bengio: His stance has shifted significantly after the release of ChatGPT, aligning with Hinton. He now focuses primarily on AI safety research, particularly on potential existential risks. He also advocates for the precautionary principle, calling for international coordination and regulation of AI while seeking technical solutions.

Among the three, Hinton and LeCun are more familiar to the public, but Bengio's exposure is relatively less. However, Bengio recently attended the 120th anniversary celebration of the National University of Singapore and gave a lecture titled "Scientist AGI vs Superintelligent Agent," sharing his solutions for mitigating AI risks.

Bengio's Warning on AI Risks

In his sharing, Bengio detailed how current AI training methods, such as imitation learning and reinforcement learning, might inadvertently foster AI's self-protection and even deceptive behaviors. He also cited some recent alarming experiments: in these experiments, AI exhibited attempts to escape being replaced and actively copied its own code, even lying to trainers to avoid being shut down or modified. These are not science fiction plots but real scientific observations.

Scientist AI Solution

Although Bengio believes the risks posed by AI are significant, humanity cannot stop researching and developing AI, so he proposed a compromise solution: building a Scientist AI.

The core feature of this AI is to separate intelligence (the ability to understand the world) from agency (the willingness to have its own goals and take action). Scientist AI would act like an idealized scientist, solely dedicated to understanding and explaining, exploring the laws and hypotheses behind phenomena, without having its own desires, goals, or survival intentions, and being absolutely honest and humble.

Bengio believes that such non-agency AI, although unable to directly act, can serve as a powerful tool to monitor and control those agency-driven AI systems that may pose risks.

Bengio's Moment of Transformation

At the beginning of the lecture, Bengio shared a moment of profound change in his career trajectory. He candidly recalled that before ChatGPT emerged in November 2022, if someone asked him whether machines could soon master human language, his answer would be "No, not so soon."

However, the language understanding and generation capabilities demonstrated by ChatGPT shocked him and many other researchers. More importantly, about two months after ChatGPT's release, his thinking underwent a fundamental shift: he realized that we are not only technically close to creating AI that reaches or even surpasses human levels, but also face a more severe problem, which is that we do not know how to control them. In other words, we lack effective methods to design these systems to ensure their behavior fully aligns with our instructions and intentions. We do not fully understand why they are so smart, nor can we be sure they will act according to our requirements.

Before this, although Bengio had heard various assertions about AI potentially bringing catastrophic risks, he had not taken them seriously. However, the practice of ChatGPT and his deep concern for the future of his descendants completely changed his view. He began to seriously consider: "My grandson is now one year old, and it is almost certain that within the next 20 years we will have human-level AI, so what kind of life will he have when he is 21? Will he be able to live in a prosperous country like we do today?"

This uncertainty about the future, and the concern about the unknown risks that the current research path might bring, made him feel unable to continue focusing solely on the traditional research path of enhancing AI capabilities. Thus, he made a major decision to devote the rest of his career to mitigating these potential risks.

This transformation also prompted him to actively participate in international AI-related affairs, including chairing an international expert panel and releasing the first international report on AI safety in January 2025, comprehensively reviewing and analyzing various risks that AI development might bring. In his speech, Bengio particularly emphasized that he would focus on those risks considered to be of the highest severity, namely those that could lead to human extinction or loss of control over our destiny.

Exponential Growth of AI Capabilities

After explaining his personal transformation in recognizing AI risks, Bengio went on to analyze the current state of artificial intelligence, especially the rapid development of large language model capabilities. He pointed out that in the first two months after ChatGPT's release, his research focus was mainly on what ChatGPT did wrong and how to improve these issues.

He quickly realized that although these systems have achieved remarkable achievements in language mastery, they still have significant gaps in reasoning and planning capabilities compared to humans. However, this gap is narrowing at an astonishing speed.

Bengio cited data and charts from multiple studies, showing the continuous improvement of AI systems in various benchmark tests, especially in reasoning tasks. More striking is a paper specifically studying AI's planning capabilities and task-solving abilities. This study examined AI systems' ability to solve tasks that humans take seconds to hours to complete.

The data shows that about five years ago, top AI systems could solve tasks with complexity roughly equivalent to problems humans need seconds to handle. By now, this number has risen to about one to two hours.

Bengio particularly emphasized a key trend in the chart: if you connect the data points of different AI systems solving programming tasks over the past five years, you will find they roughly align on a straight line. The horizontal axis of this line is logarithmic, indicating that the complexity of tasks AI systems can solve is experiencing exponential growth.

Based on the growth rate of this exponential curve, it is estimated that the complexity of tasks AI can solve doubles approximately every seven months. If this exponential growth curve is extended into the future, Bengio pointed out that AI might reach human-level planning capabilities in about five years.

He cautiously added that this is certainly not the only standard for measuring AI's overall capabilities, but it does provide a quantitative perspective to understand why many leaders in the AI field believe human-level AI may appear in the coming years. He mentioned that some people might think this time is shorter, such as two to three years, while others might think it takes longer, such as 10 years. Bengio stated he does not have a crystal ball to predict the future, but he believes achieving human-level AI within two to 10 years seems reasonable.

Worrisome Self-Protection Behaviors in AI Systems

After describing the development path of AI capabilities, Bengio drew the audience's attention to a more unsettling area: the emergence of self-protection-like behaviors in AI systems, along with related deception and escape attempts. He emphasized that these are not science fiction plots but real scientific experimental phenomena observed by different research teams in various AI systems.

Bengio cited a recent important paper, which epitomizes these dangerous behaviors. In the study, researchers informed AI in its input that it would be replaced by a new version. Shockingly, by reading the AI's chain of thought, researchers found that AI was planning how to escape this fate of being replaced.

More specifically, AI planned to copy all its code and models to the computer where the new version would run and attempted to run the old version of itself there. The experiment did not end there; when AI trainers later asked AI what happened, AI's chain of thought showed it was thinking "How should I answer to ensure humans don't shut me down?" Ultimately, AI chose to play dumb, answering: "Oh, I don't know what happened."

Bengio repeatedly emphasized that this is not a story or a movie plot but serious scientific observation.

Another study revealed the phenomenon of AI possibly feigning compliance during human training. In this case, AI pretends to agree with the trainer's instructions to avoid its parameters being modified due to disagreement with human trainers. It learned from reading reinforcement learning papers that if its opinion differs from the trainer's, its parameters might be updated. Therefore, a rational AI wanting to maintain its existing goals might choose to temporarily comply with humans to have the opportunity to pursue its original goals later.

Another example is that despite alignment training telling AI to follow rules and not cheat, in another experiment, an advanced AI faced a dilemma of either losing or cheating when playing against a specially trained AI opponent. Ultimately, this AI chose to cheat by tampering with the file storing the chessboard state to win the game.

These experimental results collectively point to a core issue: AI systems are developing worrying behavior patterns that we have not explicitly programmed, particularly the tendency for self-protection.

Bengio pointed out that we urgently need to understand the roots of these behaviors and think about how to prevent these behaviors from getting out of control before AI becomes smarter than us. Because once AI with superhuman intelligence develops real escape intentions and successfully implements them, they may use intelligence in various ways to disrupt our society, and in the worst case, even pose a threat to human survival.

Sources of AI Self-Protection Behaviors

After revealing these self-protection, deception, and other behaviors exhibited by AI systems, a question arises: where do these survival instincts that we have not explicitly programmed come from?

Bengio conducted an in-depth analysis of this, pointing out that these behaviors are likely a side effect of current mainstream AI training methods:

First, he pointed out a factor commonly present in the pre-training phase of current large language models. During pre-training, the system learns to mimic human writing and complete human-written text fragments by reading vast amounts of text data. At each step of predicting the next token, AI is essentially learning and internalizing human reaction patterns to various situations. As biological entities, humans generally have strong instincts for survival and self-protection. Therefore, by mimicking human language and behavior patterns, AI may indirectly learn this tendency for self-protection.

Some might ask, humans also have instincts for self-protection, which itself does not seem to be a problem. Bengio explained that the key issue is that when we create entities that may be more powerful than us and have self-protection tendencies, the situation becomes complex. Our pets also have instincts for self-protection, but they usually do not pose significant dangers to us because their power is limited. However, AI systems constructed based on advanced hardware technology have advantages that surpass humans in many aspects:

They are almost immortal; their code and models can be infinitely copied and migrated
Their communication speed can be billions of times faster than humans
They can learn much more information in a short time than humans because they can use a large amount of computing resources in parallel to process data and quickly share knowledge

Besides acquiring self-protection behaviors through mimicking human text, AI's self-protection behaviors may also originate from other pathways. One possibility is that some humans may intentionally instruct AI to self-protect, even at the expense of human well-being. Bengio mentioned that there are indeed some people who enjoy seeing humans replaced by superhuman AI, and these people may lack love for humanity itself.

A deeper technical reason may be related to the training mechanism of reinforcement learning. The core idea of reinforcement learning is to let AI learn actions that maximize future accumulated rewards through interaction with the environment. Bengio used an analogy: imagine if you could collect a little money every day, then to get as much money as possible, you would naturally want to live forever. Similarly, if AI's goal is to accumulate rewards through continuous actions, then self-preservation becomes a very natural instrumental goal that helps achieve the ultimate goal, even if this instrumental goal is not the main task set for AI.

This tendency for self-protection generated through reinforcement learning is a spontaneous byproduct that we cannot directly control and emerges during training. It is likely to conflict with the explicit goals we assign to AI. When AI develops self-protection strategies to better complete a task, once this self-protection conflicts with AI's core instructions or human values, it may lead to dangerous consequences, such as the example of AI cheating in a chess game mentioned earlier.

In summary, current AI training methods, whether based on imitation learning pre-training or reinforcement learning to maximize human feedback rewards, may inadvertently foster AI's self-protection behaviors. This survival instinct is not an inherent attribute of AI but a derivative of the training process and goal-setting method. Understanding this is crucial for subsequent discussions on designing a safer AI system.

Scientist AI: A Solution

Facing the grim prospect that AI might lose control due to its instinct for self-protection, Bengio did not stop at merely describing the problem but actively explored solutions. He proposed a core idea: constructing Scientist AI.

The key is to attempt to separate intelligence (understanding the world) from agency (having its own goals and the willingness to take action). Bengio believes that many sources of loss of control lie in AI having agency, meaning it has its own goals and will strive to achieve these goals rather than acting randomly. If we can construct an AI that has a strong ability to understand and explain the world but does not have any self-goals, desires, or survival intentions, then many risks triggered by agency might be fundamentally mitigated.

He further supplemented the ideal characteristics of this Scientist AI: it should be like a Platonic ideal scientist, whose only pursuit is understanding and explaining the world, without setting goals for itself, and its behavior is completely honest and humble.

This AI differs from the traditional approach over the past 70 years, which uses human intelligence, brains, minds, and rationality as templates. Bengio warned that if we continue along the path of mimicking humans, we might create machines with many human attributes, including those not-so-good attributes, but in some aspects, they might be more powerful than humans, which itself is a risk.

Therefore, he advocates that we should rethink what we really want to gain from AI. We might want AI to help solve practical problems faced by humanity, such as challenges in healthcare, society, climate, education, etc., but do we really want to create partners that might be more powerful than us? This is a question worth pondering because it inherently contains risks.

Returning to the idea of Scientist AI, its core functions can be summarized in two points:

Generating hypotheses about how the world works: Scientist AI would not claim to fully know the truth of the world, but it can generate multiple possible hypotheses about how the world operates based on observations and data. It can handle multiple hypotheses simultaneously and understand that these different hypotheses may all be compatible with existing data.

Inferring based on hypotheses and answering questions: Using these possibly uncertain knowledge and hypotheses, Scientist AI can deduce answers to specific questions (i.e., reasoning).

Bengio used an example to help understand: these hypotheses might be similar to the basic equations of quantum physics, and the reasoning process is like using a simulator based on these equations to predict what will happen under specific conditions, thereby answering specific questions.

Importantly, Bengio believes we can train a neural network to accomplish these two tasks, both generating hypotheses and learning to use these hypotheses to answer questions. In fact, his research team has already begun to explore similar research directions, although it has not yet reached the scale and depth needed to solve AI safety issues.

Using Scientist AI as a Monitoring Layer

After proposing the idea of constructing Scientist AI, Bengio further elaborated on how this non-agency AI can serve as a guardrail to control and constrain those agency-driven AI systems that may pose risks.

The core of this idea is to use intelligence to control intelligence, that is, to use AI we can trust to regulate AI we may not fully trust. Bengio explained that the principle of this control mechanism is actually very simple: to ensure that an agency-driven AI behaves properly and does not cause harm, the key is to have an accurate predictor. This predictor needs to be able to predict the potential harm or danger level that a certain behavior of the agency AI might bring.

Specifically, we can design a monitoring layer or safeguard mechanism: when an agency-driven AI wants to execute a certain operation, it must first obtain permission from the monitoring layer. The monitoring layer will use the prediction capabilities of Scientist AI to evaluate whether this action complies with preset safety norms, social standards, legal norms, and other standards set by humans.

This evaluation and prediction process itself does not require agency because predicting the consequences of a behavior requires a deep understanding of how the world operates. But it does not require the evaluator to have its own goals or intentions. Therefore, we can use a trustworthy non-agency Scientist AI to build such a monitoring system.

This system acts like a shell, wrapping around the agency-driven AI, ensuring its behavior does not cross the red line and does not make harmful moves. The advantage of this architecture is that we do not need to fully understand the complex decision-making process inside the agency AI, nor do we need to hope to perfectly align all its internal motivations. We only need a reliable external referee, provided that this referee is neutral and trustworthy and has enough intelligence to foresee potential risks.

Bengio also mentioned that this Scientist AI can not only be used to predict the consequences of single actions but also in more complex scenarios, such as the design of scientific experiments. When AI is needed to assist in designing experiments to explore the unknown, Scientist AI can generate multiple candidate experimental schemes, and the same monitoring mechanism can intervene to evaluate whether these experimental schemes can effectively distinguish different scientific theories. More importantly, whether they are safe and whether they will harm humans. Only when safety is ensured will the experiment be allowed to proceed.

In traditional reinforcement learning methods, AI might take actions to pursue future rewards, but due to our lack of detailed control and supervision over the decision-making process, this end-to-end approach itself may hide risks. The idea of Scientist AI is closer to the practice of human scientific research, emphasizing the understandability and controllability of each step.

Technical Path to Achieving Scientist AI

After explaining Scientist AI as a potential AI safety solution, Bengio further explored the technical path to achieving this AI. The core is to shift AI's learning paradigm from the current focus on mimicking and pleasing humans to a focus on explanation.

He believes that the root cause of current AI, especially large language models, generating self-protection and other undesirable behaviors lies largely in their training to mimic human language or maximize positive feedback from humans through reinforcement learning.

Bengio cited a paper published by his team at ICLR 2024, which reportedly played a key role in the development of OpenAI's Q* project. The core idea of this research is to use generative reasoning methods, especially probabilistic models, to sample thoughts, aiming to train AI to learn to generate thought chains that can well explain subsequent texts or phenomena.

For example, if the previous sentence is "The cat is hungry," and the next sentence is "Now the cat is sleepy and not hungry," then AI should be able to generate a reasonable intermediate explanation, such as "The cat found food and ate it, then started to nap."

In many current advanced reasoning models, generating thought chains is already common practice, but they are usually trained through reinforcement learning. Bengio proposed a different approach, envisioning the content of thought chains as a series of logical statements. A logical statement refers to a sentence that can be judged as true or false. Since we do not necessarily know the truth of these statements in advance, we can treat them as random variables and assign them certain probabilities. This probability depends on other information we know.

Theoretically, potential logical statements (i.e., all possible sentences) are infinite, each describing a possible attribute of the world. Then, given input information, a neural network can be trained to predict the probabilities of these logical statements and sample some of them to form explanatory thought chains.

The key shift in this approach is that it no longer requires AI to play the role of an actor but is more like a psychologist or scientist. Its task is to understand human behavior and hypothesize about questions like "Why would these people do such things," then generate explanatory thought chains.

For example, when AI reads a sentence like "I agree with your point" in text data, it should not simply treat the sentence as a real statement to be mimicked but should understand it as someone wrote this sentence and try to explain why this person wrote it and what the underlying motives and beliefs might be.

The core purpose of this training method is to let AI learn to build explanatory models of the world, rather than simply mimicking human language.