Google2026-05-20

Google Launches Gemini Omni: The Multimodal AI Model Unifying All Inputs and Outputs

As anticipated, the recently leaked Gemini Omni has been officially announced. However, unlike the simple "video generation model" that many expected, Google defines Gemini Omni as a model that "accepts any input and generates any output." Video is just one facet of its capabilities.

DeepMind CEO Demis Hassabis showcased multiple demonstrations at the announcement. For instance, uploading a personal photo allows Omni to instantly alter the subject's real environment and adjust it to diverse styles. Drawing a simple circle can generate a black hole, while a sunset stroll scene renders environments in various artistic styles. Any material becomes a canvas for constructing new realities through Omni.

Gemini Omni's core capability lies in integrating text, video, images, and interactive simulations into a single generative framework.

Specifically, it combines the capabilities of Google's most advanced generative media models: the image model "Nano Banana," the video generation model "Veo," and the world model "Genie." For example, inputting "create an explanatory animation of protein folding" directly outputs an educational video demonstrating α-helices and β-sheet structures, not just text.

Prompt: claymation explainer of protein folding, everything is made out of clay, no hands, stop motion, accurate

Some users have immediately conducted detailed comparisons between Omni and Seedance 2.0, verifying generated quality, dynamics, and consistency. Overall, Seedance 2.0 shows stable performance, but Omni excels in expressiveness in certain scenarios.

According to the official blog, Omni's capabilities are particularly focused on "video editing" and "physical simulation."

Interactive Video Editing: Entering the Era of "Controllable" AI Videos

Beyond creating educational videos, video editing is a key application scenario for Omni. Users can upload selfie footage or any material and use natural language to chat with a human video editor, iteratively refining the video through dialogue—adjusting styles or adding elements. This interaction logic builds on the approach from previous Nano Banana image editing.

Official demos revealed unique abilities. For example, uploading a video of a hand touching a mirror and instructing Omni to "make the mirror spread beautiful ripples like liquid when touched, and transform the person's arm into reflective material" yielded astonishing results. Instead of redrawing the entire video, the person's actions were preserved, while only the mirror's physics and arm material were accurately replaced.

Notably, the "multi-turn conversation ability" ensures that new instructions lead to continuous generation based on previous results, maintaining maximum consistency in characters, environment, physical effects, and scene context.

Understanding the World's Physics, Not Just Pixels

Physical simulation is the most technically advanced aspect of Gemini Omni. Google highlights a "qualitative leap" in simulating phenomena like kinetic energy and gravity, enabling more realistic video, image, and interactive simulation content.

For instance, requesting "marbles rolling at high speed on a track of chain reactions" demonstrates Omni's accurate grasp of gravity and kinetic energy. A more complex example is the "alphabet object video," where for each of the 26 English letters, rare corresponding objects (e.g., C for capybara, D for mirror ball, L for lava lamp) were generated. Omni simultaneously processed letter-object correspondence, screen rhythm, subtitle format, frame count, music style, and video conclusion—showcasing deep integration of language, image, and meaning.

Currently, Gemini Omni Flash has been rolled out across all Google products, available to Google AI Plus, Pro, and Ultra subscribers worldwide via the Gemini app and Google Flow. On the Gemini web and app versions, selecting "Video Generation" allows users to experience Omni's features.

Gemini offers 18 preset styles, including "young and fashionable," "montage," "American comic," "talking pet," "party invitation," "moon," "transformation emoji," "graffiti effect," and "pixel adventure" (Pro accounts can generate up to 3 times per day). In practice, inputting a prompt like "a male automotive YouTuber dressed in JK uniform with twin tails, standing in front of a car" with the "80s MV style" preset generated a highly impactful video.

Google also announced that YouTube Shorts and YouTube Create app users can access these features for free starting this week. In the coming weeks, Gemini Omni will be opened to developers and enterprise customers via the API.

Omni can read images, text, video, and audio as reference materials and integrate them into a coherent output. To address AI forgery concerns, all Omni-generated videos include invisible "SynthID" digital watermarks for easy source verification. Additionally, an "Avatar" (digital twin) feature enables cloning of appearance and voice for real human faces.

Over the past year, Google has advanced Gemini's multimodal capabilities into image generation and editing with Nano Banana. Now, Gemini Omni extends this approach to the video domain, aiming to create a "Nano Banana moment" for video generation.

For video creators, the direct impact is a further reduction in production barriers—videos shot on smartphones, single reference images, or music tracks become interactively editable materials. The broader change is that videos can be continuously rewritten with a single command, transforming content production speed, authenticity verification, copyright boundaries, and platform governance into a new paradigm.

Comments (0)

Share:X Hatena

Back to Blog

Google Launches Gemini Omni: The Multimodal AI Model Unifying All Inputs and Outputs

Interactive Video Editing: Entering the Era of "Controllable" AI Videos

Understanding the World's Physics, Not Just Pixels

Comments (0)

Post a Comment