How can Gemini turn the tide? Google's Chief AI Architect: Start by acknowledging the lag and find your own rhythm

Wallstreetcn
2025.11.27 13:15
portai
I'm PortAI, I can summarize articles.

Acknowledging setbacks is the first step for Google to restart. They have restructured their underlying architecture and built core advantages in multimodal understanding; they have made "usability" the main battlefield, achieving a qualitative change in the experience of large models; in addition, by reorganizing teams into "parallel systems" and reactivating infrastructure, this tech giant has ultimately found its rhythm again

"For a long time, this has been a chase."

When Google's Chief AI Architect and DeepMind CTO Koray Kavukcuoglu publicly admitted on camera that they had "been left behind," it was hard not to realize that this tech giant, which once defined the golden age of deep learning, had gone through a genuine crisis— the explosion of ChatGPT shifted the entire industry's attention to OpenAI, while Google was seen as the "laggard."

But this period of headwinds has become a thing of the past.

With the full release of Gemini 3— regaining the forefront in several key benchmarks and achieving "same-day deployment" across its product matrix including Search, YouTube, Maps, and Android— Google has demonstrated through action that it has not only caught up but has also reshaped its organizational methodology and technological path, reclaiming its own rhythm.

In a recent nearly hour-long in-depth conversation, Kavukcuoglu rarely unpacked the story behind this "technological renaissance": how did Google turn around its lagging position into an industry-level systematic lead in just two years?

Google's Chief AI Architect and DeepMind CTO Koray Kavukcuoglu

The Real Starting Point: Acknowledging Lag

Kavukcuoglu's candor is exceptionally rare. "When we started working on Gemini, we knew we were behind. But you have to be honest enough to acknowledge reality before you can innovate."

This actually marks a turning point in internal consensus: Relying solely on long-term research traditions can no longer keep pace with the speed of the times.

Past DeepMind was known for scientific breakthroughs: AlphaGo, AlphaFold, MuZero, each milestone achievement built the aura of a "technological leader." However, when models need to enter large-scale user scenarios, this research-driven pace has proven unable to directly translate into product capabilities.

Acknowledging this was the first step for Google to restart.

Multimodality is Not Just an Add-On, But an Inevitable Underlying Architecture

In the interview, Koray mentioned "multimodality" at the core more than once. His explanation was devoid of promotional flair, but rather pure engineering logic: The world is not linear, and therefore the intelligence that understands the world cannot be linear either.

Text only describes one-dimensional logic, images represent spatial structures, audio contains temporal clues, while video is a combination of these dimensions. A model that can truly serve as a general intelligence system cannot rely solely on text input and output.

Google's choice is to unify from the architectural level, allowing different modalities to be understood and jointly trained within the same model. This is the most challenging route, as it requires not only changes to the model structure but also a complete overhaul of tokenization methods, training losses, optimizers, and even inference paths But it is precisely this underlying reconstruction that allows Gemini to quickly pull ahead in areas such as chart analysis, document understanding, and cross-modal tasks. The outside world often uses "how well it draws" to judge image models, while Google's approach is completely different—the significance of multimodality lies in enabling the model to better understand the world, rather than just generating beautiful images.

The Secret of Google's Acceleration: A Complete Rewrite of Organizational Structure

The true turnaround for Gemini comes from changes at the organizational level.

In the past, Google resembled a serial assembly line: research teams trained models, engineering teams were responsible for deployment, product teams took over user experience, and security teams ensured quality at the final stage. This structure was effective in the era of internet products, but in the era of large models, it magnified a fatal consequence—slow iteration and fragmented links.

Now, Google has reorganized all teams into a "parallel system." Koray particularly emphasizes that starting from Gemini 3: product managers participate in task design from day one of training; engineering teams optimize inference paths and implementation costs in sync; security policies are embedded in the training process rather than patched before launch; real user data is directly connected to the training pipeline, no longer separated by layers of organizational structure.

This change allows Gemini's iteration pace to finally catch up with competitors and makes the model more "product-like"—stable, better at understanding intent, and capable of executing real tasks, rather than just showcasing laboratory capabilities.

For a large company with over 200,000 employees, the difficulty of such organizational restructuring far exceeds that of a single model iteration.

The Leap in Gemini Experience: Intelligence Enhancement is Not the Main Reason

Over the past year, many users have noticed a significant improvement in the Gemini experience. However, Koray's explanation is not that "the model has become smarter," but that Google has finally made "usability" a core goal, including:

First, a significant improvement in instruction understanding capabilities. This is the most intuitive aspect for users and the starting point for the model's move towards execution intelligence.

Second, internationalization adaptation has entered the core capability set. Google has global users, and Gemini's training process has systematically incorporated multicultural and cross-scenario corpora for the first time, rather than just doing translation.

Third, a leap in toolchain and code execution capabilities. This lays the foundation for Gemini's transition to an Agent—from "being able to answer" to "being able to complete tasks."

Gemini's "improvement" is not a breakthrough at a single point, but an inevitable result of the maturity of system engineering.

Infrastructure Becomes Google's Confidence Again

When discussing competitive advantages, Koray emphasizes not model capabilities, but infrastructure—an aspect often overlooked by the outside world.

TPU, global data centers, cross-product distribution capabilities, a mature security system, and a vast entry point built on Search and Android... Once these capabilities are combined with a unified model, they create a network effect that is difficult to replicate.

The enhancement of Gemini is essentially the reactivation of infrastructure. This is also a major reason why Google can quickly return to the center of the industry from being a laggard From the interview, one can sense a change: the success of Gemini is not the result of a sudden inspiration from a scientific genius, nor is it due to a sudden surge in a model parameter, but rather an inevitable product of a giant organization regaining its unified rhythm.

Google took two years, from acknowledging the problem to restructuring the system, and then to forming a new product logic. This kind of "system reversal" is often less eye-catching than a stunning demo, yet it holds more long-term value than any model leap.

At some point, defining new frontiers and new benchmarks is a good thing, and defining benchmarks is very important. There is a distinction between technological advancement and benchmark testing. Ideally, they should be 100% aligned, but they have never been 100% aligned.

The Next AI War: From Language Intelligence to Action Intelligence

Koray's judgment about the future is clear and direct: the next stage of competition is not about who has a better conversational model, but rather who can better accomplish multi-step tasks.

This competition will occur in: workflow automation, developer toolchains, enterprise task intelligence, search and information organization methods, and system-level AI (Android, Chrome, Workspace).

AI is transitioning from language models to "task operating systems." The goal of Gemini is to become the underlying capability of such systems.

For the capital market, this is a key difference: conversational models are products, while action models are platforms. The commercial value of a platform far exceeds that of a product.

The following are the main points of Koray Kavukcuoglu:

The most important criterion we use to measure progress is the application of models in the real world. Scientists use it to advance research, students use it to assist learning, lawyers use it to analyze cases, engineers use it to write code—ranging from professional fields to daily life, from simple email writing to complex creative work, people are using this technology to accomplish a variety of tasks. The breadth of application that spans different fields and covers diverse scenarios is precisely the most important value metric.

If we are to achieve general artificial intelligence, we must do so through products and through deep connections with users and ecosystems. My core mission is to ensure that every Google product receives the most advanced technological support. We are not trying to create products ourselves—we are not product experts, but rather technology developers. We focus on technology research and model building.

We always iterate in sync with AI models and achieve coordinated releases with Gemini applications—this is no easy task. It is precisely because these teams have been deeply involved since the early stages of research and development that we can ensure all products are upgraded simultaneously at the moment the model is ready. This collaborative mechanism has become our standard process.

Whenever asked about the biggest risk facing Gemini, my answer remains the same: the depletion of innovation is our real concern. I never believe we have mastered the ultimate formula, nor do I believe that merely executing mechanically can lead us to the finish line.

(Do you also feel that sense of a comeback?) Indeed, there is such a feeling, even before that. Because when LLM truly demonstrates its powerful capabilities, I honestly feel that we were once at the forefront of AI laboratories, at DeepMind At the same time, I also realize that our investment in certain areas is still not enough... This is a race to catch up. For a long time, we have been striving to catch up.

I have never agreed with the viewpoint that "Google is too large and too difficult to push forward." I believe we can turn this into an advantage because we have unique resources and capabilities.

We are now clearly moving towards a multimodal direction—integrating multimodal inputs and outputs. With technological advancements, architectural concepts across different fields are increasingly permeating each other. These architectures, which were originally quite different, are becoming more and more compatible, not through forced assembly, but through the natural convergence of technological development. When everyone realizes the paths to efficiency improvement and the direction of conceptual evolution, the technological routes will naturally merge.

The following is the full text of the conversation (translated with AI assistance)

Logan Kilpatrick (Host):

Hello everyone, welcome back to Release Notes. I’m Logan Kilpatrick, working with the DeepMind team. Today, I’m honored to invite Koray Kavukcuoglu, who is the CTO of DeepMind and the new Chief AI Architect of Google Core. Thank you for being here. I’m excited to chat.

Koray Kavukcuoglu:

Yes, very excited. Thank you for the invitation.

Logan Kilpatrick:

Of course, Gemini 3, we’re sitting here, and the model has been released. The response seems very positive. I think when we released it, we clearly had a sense of how well the model would perform. The leaderboard looks great, but I think the real test is getting the model into the hands of users and actually releasing it.

Koray Kavukcuoglu:

That’s always the test, right? I mean, benchmarking is the first step. Then we did testing. We had tested with trusted testers in previous versions, etc. So you get a sense that, yes, this is a good model, it has strong capabilities. It’s not perfect, right? But I’m quite satisfied with the feedback. People seem to really like the model, and we think the interesting aspects are also interesting to them. So, that’s good. So far, it’s good.

Logan Kilpatrick:

We were chatting yesterday, and the main theme of the conversation was about appreciating that this progress hasn’t slowed down, which I found resonated with me. Looking back to the last time I interviewed you, we were at the I/O conference when we released 2.5, listening to Devis and Sergey talk about AI, etc. I feel like the progress hasn’t slowed down, which is really interesting. When we released 2.5, it felt like a top-tier model, and in fact, we pushed the frontier on multiple dimensions. I feel like 3.0 has done that again Yes, I am very curious about how the discussion on scaling continues. What are your thoughts now?

Koray Kavukcuoglu:

Yes, I mean, I am very excited about the progress. I am excited about the research. When you are really in the research, there are many exciting things in all these areas, right? I mean, from data, pre-training, post-training, to everywhere, we see a lot of excitement, a lot of progress, and many new ideas.

Ultimately, this whole thing really relies on innovation, relies on ideas, right? The more impactful things we do that people use in the real world, the more ideas you actually get because your exposure increases, and the types of signals you receive also increase. I think the problems will become harder, and the problems will become more diverse. Along with that, I think we will face challenges, and those challenges are a good thing. Yes, I think that is also the driving force behind our building intelligence, right? That’s how it happens.

I feel that sometimes if you only look at one or two benchmarks, you might see bottlenecks. But I think that’s normal because benchmarks are defined when a task or challenge is at hand. You define that benchmark, and of course, as technology advances, that benchmark is no longer cutting-edge. It no longer defines the frontier. Then what happens is you define a new benchmark. This is very normal in machine learning, right? Benchmarks and model development always complement each other. You need benchmarks to guide model development, but only when you are close to the frontier do you know what the next frontier is so that you can define it. With new benchmarks, yes.

Logan Kilpatrick:

I agree. There are several benchmarks where all models perform poorly, probably only 1% or 2%. I think the latest DeepThink model can achieve over 40%, yes, that’s crazy. RKGI2 initially had all models that could hardly solve any problems. Now it can achieve over 40%. So yes, that’s interesting. It’s also interesting to see that I’m not quite sure why those static benchmarks, which have indeed stood the test of time, still exist even though we might only improve by about 1%.

Koray Kavukcuoglu:

There are really difficult problems there. I mean, those difficult things we still cannot do. Yes, that’s right. They are still testing certain things. But if you think about where we are on gpqa, hey, it’s not like, oh, you are now in the 20s and need to reach the 90s, right? So, the number of unresolved problems it defines is certainly decreasing.

So at some point, defining new frontiers and new benchmarks is a good thing, and defining benchmarks is very important. Because if we view benchmarks as the definition of progress, that is not always completely consistent, right? There is a distinction between technological advancement and benchmarks. Ideally, they are 100% aligned, but they are never 100% aligned.

For me, the most important measure of progress is how our models are used in the real world. Scientists use them, students use them, lawyers use them, engineers use them, and then people use them for all sorts of things, writing, creativity, emailing, whether simple or complex. The scope is important, different subjects, different fields. If you can continuously provide greater value there, I think that is progress, and these benchmarks help you quantify that.

Logan Kilpatrick:

How do you view, and perhaps even give a specific example from 2.5 to 3, or any model version change? Where are we doing local optimizations? In a world now filled with countless benchmarks, you can choose which direction to optimize, how are you thinking about what we should optimize for Gemini overall, and perhaps specifically for the Pro model?

Koray Kavukcuoglu:

There are several important areas, right? One of them is instruction following, which is crucial. Instruction following means the model needs to be able to understand user requests and be able to follow them, right? You don't want the model to just answer what it thinks it should answer, right? So this ability to follow instructions is important. This is what we have been working on. Then, for us, internationalization is important. Google is very internationalized, and we want to reach everyone in the world. So that part is important.

Logan Kilpatrick:

I feel that 3.0 Pro at least... I was chatting with Toldsee this morning, and she commented that this model is excellent for languages (historically we haven't been good at) like that, and it's great to see that.

Koray Kavukcuoglu:

You have to continuously focus on some of these areas, right? They may not seem like, oh, this is the cutting edge of knowledge, but they are indeed very important because you want to be able to interact with users out there, because as I said, it’s all about getting signals from users. Then talking about more technical areas, function calls, tool calls, agent actions, and code, these are all very important, right? Function calls and tool calls are important because I think this is a completely different multiplier of intelligence, both from the model being able to naturally use all the tools and functions we create and use them in its own reasoning, and from the model writing its own tools, right? You can think of the model itself as a tool to some extent. So this is a big deal. Obviously, code is important, not just because we are also targeting engineers, but because we know that through code, you can actually build anything that happens on your laptop. And on your laptop, what happens is not just software engineering. Turning any idea into reality, right? So much of what we do now happens in the digital world, and code is the foundation of all of this, able to integrate with anything happening in your life, not everything, but a lot This is why I believe that the combination of the two provides a significant coverage for users.

Let me give you an example of network coding, which I really like. Why? Because many people are creative. They have ideas, and suddenly you make them efficient, right? From being creative to being efficient, in a way that you just write it down, and then you can see the application presenting itself in front of you, it's like, I mean, most of the time it works, and when it works, it's fantastic, right?

I love that process we call the loop; I think it's great. So suddenly, enabling more people to become builders, to build something, I mean, that's amazing.

Logan Kilpatrick:

I love it. Thank you, this is a promotion for AI Studio. We'll clip this out and put it online. An interesting thread you mentioned is the importance of having this product scaffolding to help optimize quality from a model perspective, obviously. Yes, tool invocation and coding.

Koray Kavukcuoglu:

This is very important to me. I think. Like Anti Gravity as a product itself, yes, it's exciting, but from a model perspective, if you think about it, it's two-sided, right? Let's talk from the model perspective first. From the model perspective, being able to integrate with the end users (in this case, software engineers) and learn directly from them about where the model needs to improve is indeed crucial for us. I mean, in areas like the Gemini application, for the same reasons, this is also important, right? I mean, understanding the users directly is very important. The same goes for Anti Gravity. The same goes for AI Studio.

Having these products we work closely with, and understanding and learning, getting those user signals, I think is tremendous. Anti Gravity has always been a very important release partner. They haven't been on board for long, right? But in the last 2 or 3 weeks of our release process, their feedback has indeed played a key role. The same goes for searching AI models, right? I mean, AIO reviews, even from there we got a lot of feedback. So for me, this integration with the product and getting signals is the main driving force for our understanding. Of course, we have benchmarks. So we know how to drive intelligence in science, technology, mathematics, etc. But what's really important is that we actually understand the real-world use cases because, I mean, this has to be useful in the real world.

Logan Kilpatrick:

In your new role as Chief AI Architect, you are now also responsible for ensuring that we not only have good models but that the products can actually adopt these models and build excellent product experiences across Google. From DeepMind's perspective, how much complexity do you think this adds? Obviously, I think it's the right thing for users to apply Gemini 3 to all products and services from day one, which is an incredible achievement for Google Oh, I hope there will be more products and services in the future. Sometimes, life was simpler a year and a half ago.

Koray Kavukcuoglu:

But just like we are building intelligence. I play both roles, and essentially their goals are aligned. If we are to achieve general artificial intelligence, it must be done through products and through deep connections with users and the ecosystem. My core mission is to ensure that every Google product is supported by cutting-edge technology. We are not here to create products ourselves—we are not product experts, but technology developers. We focus on technology research and model building.

Of course, just as all creators adhere to their own philosophies, we also have our own technical propositions. But for me, the most important thing is to provide models and technologies in the best way possible, and then collaborate with product teams to create the best product experience in this AI era.

Because this is indeed a whole new world. This emerging technology is reshaping user expectations, defining product interaction logic, determining how information is presented, and giving rise to unprecedented application scenarios. My responsibility is to drive the implementation of this technology across the entire Google product matrix, working closely with all product teams.

This deep integration excites me—not only because of the sense of achievement brought by product innovation but also because it achieves our most important goal: direct connection with users. Being able to perceive user needs in real-time and obtain feedback from real scenarios is crucial for us. That is why I firmly believe this is the only way to general artificial intelligence: achieving intelligent evolution through productization. Yes, this is the path we have chosen.

Logan Kilpatrick:

This is a great tweet for you to put out at some point because I find it very interesting. I share this person's view that, in a sense, we are building AGI together with customers and other PAs. This is not purely research work in a lab. It is a collective effort with us and the world.

Koray Kavukcuoglu:

I think this is actually a very trustworthy recommendation system, and I believe we are increasingly adopting a very engineering-oriented mindset. I think it is important to have an engineering mindset in this matter. Because when something is well-engineered, you know it is robust and safe to use. So what we are doing in the real world, we are adapting all the trusted and tested ideas about how to build things in some way.

I think this is reflected in how we think about safety, how we think about safety, right? We try to think from an engineering perspective, starting from the basics, thinking from the beginning rather than considering things at the end, right? We don’t do that. So when we are doing post-training on models, when we are doing pre-training, when we look at data, we always make everyone think about this issue. Do we have a safety team? Obviously, we have a safety team. They bring all the relevant technologies, and we have a safety team that brings all the technologies, but we also make sure that everyone in Gemini can actively participate in that development process, making this a top priority And these teams are part of our post-training team, right? So when we are iterating and releasing candidate versions, just like we look at benchmarks like gpqa and HLE, we also look at their safety and security measures. I think this engineering mindset is very important. Yes, I.

Logan Kilpatrick:

I completely agree with you. I think this is also very natural for Google, and it helps a lot given the scale and scope of collaboration in this work. Yes, releasing the Gemini model. I mean.

Koray Kavukcuoglu:

For Gemini 3, I think we are just reflecting on this. One of the important things for me is that this model is a very team Google model.

Logan Kilpatrick:

We should check the data. I might be like, I mean, maybe like the Apollo NASA program had a lot of people involved, but I think it’s the massive global team at Google, including the global efforts of all our teams, that made it happen, which is crazy. Every single one.

Koray Kavukcuoglu:

Each generation of Gemini's release embodies the hard work of teams from continents—Europe, Asia, and teams from around the world. Our R&D network spans the globe, which not only includes the DeepMind team but is also a collaborative effort of the entire Google ecosystem. Yes, this can be described as a globally scaled collaborative innovation.

We are always synchronously iterating with the AI model and achieving coordinated releases with Gemini applications—this is no easy task. It is precisely because these teams have been deeply involved since the early stages of R&D that we can ensure a synchronized upgrade of all products as soon as the model is ready. This collaborative mechanism has become our standard process.

When we say "full Google collaboration," we are referring not only to the core R&D personnel but also to the contributions of all product teams in their respective fields—from search to office suites, from cloud services to mobile ecosystems, every team plays a key role in this common goal.

Logan Kilpatrick:

I have a question, and maybe this is not a controversial question, but you know, Gemini 3 is state-of-the-art on many benchmarks, and we are synchronously releasing on many benchmarks, you know, across Google product interfaces and our partner ecosystem interfaces.

The feedback has been very positive. The atmosphere around the model is great. If you look to the future, I don’t know, hopefully smoothly, if we look to Google’s next major model release, is there anything that you still have on your checklist that you hope we are doing X, Y? So, how can it be better than Gemini 3? Or should we just enjoy the moment of Gemini 3? Koray Kavukcuoglu:

We should certainly strive for more breakthroughs—but at this moment, what we need more is to enjoy this milestone. The release day is worth celebrating, and seeing users recognize the model is something the team can be proud of.

However, while we celebrate, we remain clear-headed: there is still room for improvement in every area. Our writing capabilities have not yet reached perfection, and there is room for improvement in programming support—especially in the areas of agent behavior and code generation, where the most exciting development potential lies.

We must acknowledge that we have made significant progress. It can be said that this model can meet the needs of 90% to 95% of developers—whether they are professional engineers or creative creators, it is undoubtedly one of the best tools available today. But precisely because of this, we must remain focused on the remaining 5%, continuously improving in those areas that still require breakthroughs.

Logan Kilpatrick:

I have a sharp question regarding coding and tool usage. What do you think? If you look back at the history of Gemini, it is clear that we were very focused on multimodality in 1.0, and I think in 2.0 we started doing some, yes, agent infrastructure work. Please tell us why we did that. To clarify, I think the pace of progress looks very strong, but why didn’t we maintain a consistent and stable focus on agent tool usage from the very beginning, as we did with multimodality? For example, in terms of multimodality, we were cutting-edge from Gemini 1 and maintained that for a long time.

Koray Kavukcuoglu:

I like this, and I don’t think it was intentional. I think it’s like, honestly, when I look back, I closely associate it with using models and developing environments tied to the real world; the closer we connect, the better we can understand these real needs.

I think in our journey with Gemini, we started from a certain point, and of course, I mean, Google has a long history of AI research, right? The number of amazing researchers we have and the incredible history of AI research at Google is fantastic.

But Gemini is also a journey from that research environment to this one, like the engineering mindset we talk about, and entering a space where we are truly connected to the product, right? When I look at the team, I have to say I feel very proud because this team is still primarily made up of people like me, right? Just like five years ago, we were still writing papers. We were researching AI. And now we are actually at the forefront of that technology. And that technology, you are developing it through products, with users; that’s a completely different mindset. We build a model every six months and then update it every month and a half. It’s an amazing transformation. I think we have gone through that transformation.

Logan Kilpatrick:

I like that. The progress of Gemini is fantastic. Another thing to consider further is how we view generative media models overall. Historically, I don't think they have been a huge focus. I don't mean that they haven't been attended to; they have always been interesting. But I feel that with VO3, Vo 3.1, as well as the Nano and Ada models, we have achieved such great success from the perspective of product externalization. I'm curious how you see all of this intertwining as we pursue what we want to build AGI. Yes, I think sometimes I can convince myself that a video model is not part of that story. I don't think that's true. I think overall, you should understand the world, physics, and other things. So I'm curious how you see all these things coming together, if you do.

Koray Kavukcuoglu:

Actually, looking back 10 to 15 years ago, generative models were mainly focused on the image domain. Because at that time, we could more intuitively test the model's effectiveness, and the understanding of the world and physical laws was the main driving force behind the development of image and sound generation models. About ten years ago—now it feels like twenty years ago—we were still focusing on image models. I was researching generative image models during my PhD, and that field was very active at that time.

We went through a complete development cycle, with representative works like PixelCNN, which are essentially image generation models.

Later, we realized that the text domain could actually bring faster technological breakthroughs. But now the return of image models completely aligns with the development trend. DeepMind has long accumulated deep technical expertise in image, video, and audio models. This is exactly what I want to illustrate: the fusion of these modalities is a natural result of evolution.

We are now clearly moving towards a multimodal direction— including the multimodal fusion of inputs and outputs. With technological advancements, architectural concepts across different fields are increasingly permeating each other. These architectures, which were originally significantly different, are becoming more and more compatible. This is not a forced patchwork, but a natural convergence of technological development. When everyone realizes the paths for efficiency improvement and the direction of conceptual evolution, the technological routes will naturally merge.

The birth of Nano is a typical example of this fusion process: you can both iteratively process images and directly converse with the model. The text model builds a cognitive understanding of the world through language data, while the image model forms another understanding of the world from a visual dimension. When the two are combined, it produces surprising effects—users can clearly feel that the model truly understands those subtle intentions that are difficult to articulate.

Logan Kilpatrick:

I have another question about Nano. Vanessa, do you think we should give all our models some silly names? Do you think that would help?

Koray Kavukcuoglu:

Not entirely. Listen, I mean, I don't think we are being intentional about it. Gemini 3

Logan Kilpatrick:

If we don't call it Gemini 3, what should we call it? Some ridiculous names.

Koray Kavukcuoglu:

I don't know. I'm not good at naming, right? I think I like, I mean, this is an update, right? Just like an update, we actually use the Gemini model. Those are codenames. We also use the Gemini model to come up with those codenames. And Nano Banana is not one of them. Right? We didn't use Gemini, there's a story about it. I think it's been published somewhere. I mean, as long as these things are natural and organic, I think I'm happy because I think it's good for the team building the model to have that kind of connection. And then when we release them, I think that's like, I mean, that happens because we tested the model on Ella Marina with codenames, and people loved it. I think, I don't know, I'd like to think it's so organic that it just became popular. I'm not sure if you can create a process to generate that kind of name.

Logan Kilpatrick:

I agree with you. That's my feeling.

Koray Kavukcuoglu:

So if we have it, we should use it. If you don't, having something standard is good.

Logan Kilpatrick:

We should talk about Nano Banana Pro, which is our next-generation state-of-the-art image generation model built on Gemini ro. I think the team, I think even when they were finishing Nano Banana, there were already early signals indicating. Doing this in Pro form, it's like you can get more performance in more nuanced use cases, like text rendering and world understanding, etc. There's something particularly worth noting, I know there are a lot of things going on, but I feel like.

Koray Kavukcuoglu:

This might be where we start to see these different technologies align, right? I mean, because for the Gemini model, we've always said that each model version is a model family, right? We have Pro, Flash, flashlight, like these model families. Because at different scales, you have different trade-offs in speed, accuracy, cost, etc. As these things converge, of course, we have the same experience in images. So I think the team naturally thought, well, there's a 3.0 Pro architecture, we're actually tuning this model more, leveraging everything we learned in the first version, and scaling it up to make it a generative image model.

I think what we ultimately get is a much more capable model that can understand very complex, like some of the most exciting use cases are, you have a very complex set of documents that you can input We rely on these models to ask questions. You can also ask it to generate an infographic about that content. Then it can work, right? So this is where the natural input modality and input-output modality come into play. It's great.

Logan Kilpatrick:

Yes, it feels like magic. I don't know. I hope by the time this video is released, everyone has seen examples. But yes, just seeing so many internal examples being shared is so cool. It's crazy.

Koray Kavukcuoglu:

Yes, I agree. It's like, when you suddenly see, oh my god. Yes, such a large amount of text, concepts, and complex things explained in such a wonderful way with a single image. When you see those things, it's like, this is good, right? You realize the model is capable.

Logan Kilpatrick:

And yes, there are so many nuances in there, which is really interesting. I have a parallel question about this, maybe from last December, December 2024. Tulsi promised us how we would have these unified Gemini model checkpoints. I think what you're describing is that we are actually very close to that goal now, just like historically the architecture was done separately.

Koray Kavukcuoglu:

Unification refers to image generation. I see.

Logan Kilpatrick:

I'm curious, do you think like that? I assume that's a goal, like we want these things to actually integrate into the model. Some things will naturally prevent that from happening. I'm curious if there are any background or high-level reasons.

Koray Kavukcuoglu:

Listen, I think as I said, the technology and architecture are aligning, right? So we see this happening in regular iterations. People are trying, but it's just a hypothesis. You can be ideological about it, right? The scientific method is the scientific method. We try things, we have a hypothesis, and then you look at the results. Sometimes it works, sometimes it doesn't. That's the process we go through. It's getting closer. I'm sure in the near future, we'll see some things come together. I think gradually, it will become more like a single model.

But it does require a lot of innovation, right? It's hard. If you think about it, the output space is crucial for the model because that's where your learning signals come from. Right now, our learning signals come from code and text. That's a major driver of the output space. That's why you're getting good at it now. Being able to generate images is... our demands for image quality are so high. It's a hard thing to do, right? Generating something that is truly image-quality, pixel-perfect is difficult. And the images also have to be conceptually coherent, just like every pixel, quality is important, but how it fits into the overall concept of the picture is also important. That's important, right? Training something that can do these things is harder. But the way I look at this issue is that, for me, I believe it is absolutely possible. It will become possible. It's just about finding the right innovations in the model to achieve it.

Logan Kilpatrick:

I love it. I'm excited. Hopefully, this will also make our service situation a bit easier. If.

Koray Kavukcuoglu:

I used to say I don't know.

Logan Kilpatrick:

A single model checkpoint.

Koray Kavukcuoglu:

That's impossible to say.

Logan Kilpatrick:

It's impossible. I agree with you. An interesting clue as we sit here is that, you know, DeepMind has a bunch of the best AI products in the world, hopefully 5 coding and AI Studio, Gemini applications, Anti Gravity, and now it's happening within Google. We have a great state-of-the-art model, Gemini 3. We now have banana, we have view, all these models are at the forefront. The world looked very different 10 years ago, even 15 years ago. I'm a bit curious, in terms of your personal journey to get to this point, you, when we were chatting yesterday, you mentioned, I had no idea, I mentioned it to another person, they didn't know either, you were the first deep learning researcher at DeepMind. I think bringing this clue to where we are now feels like a crazy leap from people not being enthusiastic about this technology, I don't know how long ago you joined DeepMind, probably 10 years.

Koray Kavukcuoglu:

12 years.

Logan Kilpatrick:

12, 13 years. Yes, that's crazy. 13 years ago, people were not enthusiastic about this technology, and now it is actually powering all these products and is a major thing. I'm curious, when you reflect on this, what do you think? Is it surprising or is it obvious? Well, I.

Koray Kavukcuoglu:

I mean, I think this is a hopeful positive outcome in our case, right? I say this because when I was doing my PhD, I think every PhD student is the same. You believe that what you are doing is important or will become important, right? You are really interested in that subject, and you think it will have a huge impact. I think I had the same mindset back then. That's why when Dennis and Shane reached out to me, and we talked, I was very excited about DeepMind. I was really excited to learn that there was a place really focused on building intelligence, and deep learning was at its core. It was actually like me and my friend Carol Gregor, we were both at NYU's Youngslab, and we joined DeepMind at the same time It's just very specific. At that time, having a startup focused on deep learning and AI-first was quite unusual. Yes, I think that was very visionary and an amazing place. It was really exciting there. Then I started the deep learning team, and it grew. I think one of the things I liked was, I mean, my approach to deep learning has always been about the mindset of how you tackle problems. The first principle, it’s always based on learning. That’s what DeepMind is about. Everything gets better through learning. Starting from where we were at that time, it has been an exciting journey.

Then there was reinforcement learning, agents, and everything we did along the way, just like you get into these things, at least that’s how I think I got into them, hoping for a positive outcome to happen. I reflect and say we were lucky, right? We were lucky to live in this era because I think many people study AI or their truly passionate subjects, but they think this is their era, this is when it will succeed, but it’s happening now. We have to realize that AI is happening now, not just because of machine learning and deep learning, but also because the evolution of hardware has reached a certain state. The internet and data have reached a certain state, right? So there are many things coming together, and I feel very fortunate to be truly engaged in AI and working up to this moment. I think it’s like when I reflect on this, I feel, yes, they are all choices, just like we study AI, we made, I made that specific choice to study AI, but at the same time, I also feel very lucky to be in this position in this era. It’s very exciting.

Logan Kilpatrick:

Yes, I feel the same way. I’m curious, what are the, I was watching thinking game videos, I wasn’t involved in AlphaFold. So the only background I have is reading about it and listening to people talk about it. I’m curious, when you look back at having gone through many of those things, how is today different from before? I’ll give you an example, which you just mentioned off-camera, that is, this is not exactly your words, but the gist is that we’ve figured out how to make these models and bring them to the world. That’s basically what you meant, and I agree. I’m curious if it feels like, yes, how does this compare to some of the previous iterations, I think.

Koray Kavukcuoglu:

How to organize or cultural traits about what is important tends to succeed in transforming difficult scientific and technical problems into successful outcomes. I think we learned this through many projects we did, starting from DQN, AlphaGo, AlphaZero, AlphaFold, all of these things were quite impactful in their own ways, and we learned a lot about how to organize around specific goals, specific missions, and to organize into a large team, right?

Just like I remember in the early days at DeepMind, we had 25 people working together on a project, and we would write papers with those 25 people. Then everyone would tell us that it couldn't possibly be done by 25 people together. I would say, no, they really did, right? I mean, we organized ourselves because that's not common in science and research, right? I think that kind of knowledge, that mindset, is key.

We evolved through that process. I think that's very important. At the same time, I think in the last two or three years, as we've talked about, we've integrated. But what we've integrated is this notion that now it's more like an engineering mindset, and we have a mainline model being developed. We've learned how to explore on this backbone, how to explore with these models.

A good example, every time I see or think about this, I feel quite happy, is our models. Those are the models we took to participate in the IMO competition, ISBC competition, ICPC competition. I think that's a very cool and good example because we're exploring, you choose these big goals, competitions are very important, right? Those are very hard problems, and it's like paying tribute to all the students competing there. It's really great. Being able to model there. Of course, you have the impulse to customize something for that competition. What we try to do is take that opportunity to evolve what we already have or propose new ideas that are compatible with our existing models because we believe in the universality of the technology we have. Then we create something and make it available for everyone, right? So everyone can use a model that was actually used in the IMO competition.

Logan Kilpatrick:

Yes, just to draw an analogy to what you said about the 25 people in the paper. I think the current version is, you see, I will definitely have a contributor list for Gemini 3. And I’m sure people will conservatively say, yes, I’m sure people will think it’s impossible for 25 people to have really contributed, but they really did, which is crazy, and it's fascinating to see the scale of these problems.

Koray Kavukcuoglu:

Now. Yes, the field. I think that's important for us. That's also one of the great things about Google. There are so many people who are amazing experts in their fields. We benefit from that. Google has this full-stack approach, right? We benefit from that. So you have experts at every layer, from data centers to chips, to networks, to how to run these things at scale, right? All of this reaches a state, again talking about this engineering mindset, reaches a state where these things are inseparable, right? When we design a model, we design it knowing what hardware it will run on. And when we design the next hardware, we know where the model might go

But it's wonderful, right? I mean, coordinating this, yes, of course, you have thousands of people working together and contributing. I think we need to recognize that. It's a beautiful thing. Amazing.

Logan Kilpatrick:

It's not easy to do. An interesting clue is about going back to this legacy of DeepMind, sort of doing all these different scientific approaches, trying to solve these very interesting problems. And today, we actually know that this technology is effective in a bunch of areas, and we really just need to continue to scale it up, and obviously, innovation is needed to keep doing that. But I'm curious, how do you see DeepMind in today's era, balancing, you know, pure scientific exploration with just trying to scale up Gemini, and maybe we can use my favorite example, Gemini diffusion, as a concrete embodiment of that decision-making to some extent.

Koray Kavukcuoglu:

This is indeed the most critical proposition—how to find a balance between exploration and implementation. Whenever I'm asked about the biggest risks facing Gemini, my answer is always the same: the exhaustion of innovation is our real concern.

I never believe we have mastered the ultimate formula, nor do I believe that merely executing mechanically will get us to the finish line. The road to building general artificial intelligence is fraught with thorns, and users and products will bring us endless challenges. That ultimate goal remains distant, and I firmly believe there is no such thing as a "standardized solution." It is a dangerous illusion to think that breakthroughs can be achieved simply by scaling up or optimizing processes.

Real breakthroughs always come from innovation—it may arise from a deep dive into existing technologies or burst forth from entirely different technological paths. Maintaining this multidimensional exploratory capability is our core competitiveness.

Of course, we have the Gemini model, and within the Gemini project, we explore a lot. We explore new architectures, we explore new ideas, we explore different ways of doing things. We have to do this. We continue to do this, and this is the source of all innovation. But at the same time, I think DeepMind or Google DeepMind as a whole is doing more exploration. I think this is very critical for us. We have to do those things because, again, there may be some things that the Gemini project itself may limit too much to explore certain things. So, I think the best thing we can do is to explore a wide range of ideas at Google DeepMind and also at Google Research, right? We will explore a variety of ideas, and then we will bring those ideas in because, ultimately, Gemini is not that architecture, right? Gemini is the goal you want to achieve. It's the intelligence you want to realize through your products, allowing all of Google to truly run on this AI engine. To some extent, what the specific architecture is doesn't matter. We currently have something, we have a way to evolve through it, and we will evolve through it And its engine will be innovation. It will always be innovation. So finding that balance, or finding opportunities to do things differently, I think is very crucial.

Logan Kilpatrick:

There is a parallel question to this, which is at the I/O conference, I sat down with Sergei and talked. I commented to him that I personally felt this way at the I/O conference, which is gathering everyone together to release these models and innovate. When you do that, you feel the warmth of humanity, and that’s really interesting. I mention this because, you know, I was also sitting next to you listening to them speak, and I felt your warmth. I say this very personally because I think it reflects how DeepMind operates as a whole, right? I feel that Dems also has this characteristic, which is that it has a deep scientific foundation, but at the same time, people are just very kind and friendly. There’s an interesting aspect, which is, I don’t know how much people appreciate how important this culture is and how it is manifested. I’m curious, when you think about helping shape and run this, how that, yes, how that situation is reflected.

Koray Kavukcuoglu:

First of all, I mean, thank you very much, you made me feel embarrassed, but I think it’s important. I believe in the team we have, I believe in trusting people, giving people opportunities. Teamwork is important. I think at least for me, I can say I’ve learned that by working at DeepMind. Because we are a small team, of course, it’s like you build that trust there, and then as you grow, how do you maintain that trust?

I think having an environment like that is important: making people feel, well, we really care about solving those challenging technical scientific problems that can have a significant impact on the real world. I think that’s still what we’re doing, right? Like Gemini, as I said, is about building intelligence, which is a highly technical and challenging scientific problem. We have to treat it that way. We also have to approach it with that humility, right? We must always question ourselves. Hopefully, the team feels the same way. I’m like, that’s why I’ve always said I’m very proud of how well the team can work together.

That bell, just like when we were chatting in the micro kitchen upstairs today, right? I mean, I told them, yes, it’s exhausting. Yes, it’s hard. Yes, we’re all worn out. But that’s just how it is. We don’t have a perfect structure to do this. Everyone is coming together, working hard, supporting each other. It’s tough. But making it fun, enjoyable, and enabling you to tackle real challenges, I think largely comes down to bringing the right team together to work together. In my view, the burden is more about clearly understanding the potential of the technology we have. I can absolutely say that 20 years from now, it will not be the same LLM architecture at all. I’m sure it won’t, right? So I think pushing for new explorations is the right thing to do We talked about how GDM as a whole, together with Google Research, must advance many different directions in collaboration with the academic research community. I think that's perfect. Defining what is right and what is wrong, like, I don't think that's an important conversation. I believe that capability and the demonstration of those capabilities in the real world are what truly matter.

Logan Kilpatrick:

My last question is, I'm also curious about your reflections on this. Personally, I feel that during my first year and a half at Google, it felt (and I actually quite liked this feeling) somewhat like a comeback story for Google, you know, despite all the infrastructure advantages and so on, but for me personally, it was about showcasing like.

Koray Kavukcuoglu:

April.

Logan Kilpatrick:

April 2024. So it's like, for me and the background of AI Studio as well. It's like we were building this product, and oh right, now I remember, we didn't have users at that time. We had, or let's say we had 30,000 users. We had no revenue. We were, yes, in the early stages of the Gemini model lifecycle. I want to fast forward to today, obviously not like the days when the model started rolling out and I received a ton of alerts. You know, from friends across the ecosystem, I believe you received a lot too. People are very, I think they finally really realize this is happening. But I'm curious from your perspective, did you also feel that sense of comeback at that time? I emphasize again, I had faith, that's why I joined Google, we would reach this point. But did you also feel that sense of comeback? I'm curious how you think that feeling will manifest in the team after we turn this corner?

Koray Kavukcuoglu:

There was indeed that feeling, even before that. Because when LLM truly showcased its powerful capabilities, I honestly felt that we were once at the forefront of AI labs at DeepMind. But at the same time, I also realized that our investment in certain areas was still insufficient. This was an important lesson for me, which is why I have always been cautious that we need to have a broad layout. Exploration is important, not being fixated on a specific architecture.

I have always kept the team completely transparent. About two and a half years ago, when we started taking LLM more seriously and launched the Gemini project, I made it clear to the team: we are still far from the industry's top level.

There are many things we don't know how to do. Even the things we know how to do, we haven't done them to the best of our ability. This is a race to catch up. For a long time, we have been striving to catch up. Now I feel we have entered the leading ranks. Yes, I am satisfied and positive about our operational pace. We have formed a good working rhythm and team dynamics

Yes, we have been catching up. You have to be honest with yourself, right? In the process of catching up, you have to pay attention to what others are doing and learn what you can. But ultimately, you have to walk your own path of innovation. That's what we have done.

That's why I feel this is essentially a great comeback story, right? We insist on independent innovation and have found our own solutions - whether in technology, models, processes, or operational methods, right? This is unique to us, right? We collaborate with the entire Alphabet, and what we are doing now is on a completely different scale.

I never agree with the notion that "Google is too big and too hard to push" that some people say. I think we can turn this into an advantage because we have unique resources and capabilities. So, I am quite satisfied with where we are now, but this has been achieved through continuous learning and innovation. This is the right way to achieve our current accomplishments. And we still have a long way to go, right? I mean, I feel like we are just catching up. We have just arrived at this position. Although there will always be comparisons, our goal is to build true intelligence, right? We want to achieve this goal, and we want to do it the right way. This is the direction in which we invest all our wisdom and innovation.

Logan Kilpatrick:

I think the next six months are likely to be as exciting as the past six months. Looking back at that interview six months ago, it feels like a long time ago. Thank you very much for taking the time for this conversation today; it has been a very enjoyable exchange.

I hope to have a deeper conversation with you again before next year's I/O conference. Although it feels like there is still a long time to go, time always flies. And I guess we might start preparing for the planning meeting for I/O 2026 next week.

Sincere thanks for taking the time to participate in this conversation. At the same time, I want to congratulate you, the DeepMind team, and all the model researchers for making innovations like Gemini 3 and Nano Banana Pro possible. Thank you, everyone.

Koray Kavukcuoglu:

Thank you very much. It has been great to have this conversation. Thank you for inviting me