Google launches the next generation AI model Gemini 2.0 Flas, supporting image generation

Wallstreetcn
2024.12.11 21:22
portai
I'm PortAI, I can summarize articles.

Google has launched the next-generation AI model Gemini 2.0 Flash, which supports the generation of images, audio, and text. This model is available through the Gemini API and the Google AI development platform, with audio and image generation features currently open only to early partners, and a full launch planned for January next year. 2.0 Flash has significant improvements in speed and performance, running at twice the speed of Gemini 1.5 Pro, making it the flagship model of Gemini

Author: Zhao Yuhe

To respond to the numerous new products launched by OpenAI, Google on Wednesday introduced the next-generation significant artificial intelligence model Gemini 2.0 Flash, which can natively generate images and audio while also supporting text generation. 2.0 Flash can also utilize third-party applications and services, enabling it to access Google Search, execute code, and more.

Starting Wednesday, the experimental version of 2.0 Flash will be available through the Gemini API and Google's AI development platforms (AI Studio and Vertex AI). However, the audio and image generation features are only open to "early access partners" and are planned for a full launch in January next year.

In the coming months, Google stated that it will release different versions of 2.0 Flash for products such as Android Studio, Chrome DevTools, Firebase, and Gemini Code Assist.

Upgrades of Flash

The first generation Flash (1.5 Flash) could only generate text and was not designed for particularly demanding workloads. According to Google, the new version 2.0 Flash model is more versatile, partly because it can invoke tools (such as search) and interact with external APIs.

Tulsee Doshi, head of the Google Gemini model product, stated,

"We know that Flash is favored by developers for its good balance of speed and performance. In 2.0 Flash, it still maintains its speed advantage but is now even more powerful."

Google claims that, based on internal testing, 2.0 Flash runs twice as fast as the Gemini 1.5 Pro model in certain benchmark tests and has "significantly" improved in areas such as coding and image analysis. In fact, the company stated that 2.0 Flash has replaced 1.5 Pro as the flagship model of Gemini due to its better mathematical performance and "factuality."

As mentioned earlier, 2.0 Flash can generate and modify images while supporting text generation. The model can also read photos, videos, and audio recordings to answer questions related to that content (e.g., "What did he say?").

Audio generation is another key feature of 2.0 Flash, which Doshi described as "controllable" and "customizable." For example, the model can read text using eight voices optimized for different accents and languages.

However, Google has not provided samples of images or audio generated by 2.0 Flash, so it is impossible to assess the quality of its output compared to other models.

Google stated that it is using its SynthID technology to watermark all audio and images generated by 2.0 Flash. On software and platforms that support SynthID (i.e., some Google products), the model's output will be marked as synthetic contentThis move aims to alleviate concerns about abuse. In fact, "deepfake" is becoming an increasingly serious threat. According to data from identity verification service Sumsub, the number of deepfakes detected globally has quadrupled from 2023 to 2024.

Multimodal API

The productivity version of 2.0 Flash will be launched in January next year. Meanwhile, Google has introduced an API called Multimodal Live API to help developers build applications with real-time audio and video streaming capabilities.

Through the Multimodal Live API, Google states that developers can create real-time multimodal applications with audio and video input from cameras or screens. The API supports tool integration to complete tasks and is capable of handling "natural conversation modes," such as interruptions—similar to OpenAI's real-time API capabilities.

The Multimodal Live API has been fully opened for use as of Wednesday morning.

More news, ongoing updates

Risk warning and disclaimer

The market has risks, and investment should be cautious. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at their own risk