BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI

Beijing Academy of Artificial Intelligence (BAAI) introduces OmniGen2, a next-generation, open-source multimodal generative model. Expanding on its predecessor OmniGen, the new architecture unifies text-to-image generation, image editing, and subject-driven generation within a single transformer framework. It innovates by decoupling the modeling of text and image generation, incorporating a reflective training mechanism, and implementing a purpose-built […] The post BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI appeared first on MarkTechPost.

Jun 25, 2025 - 01:20

BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI

Beijing Academy of Artificial Intelligence (BAAI) introduces OmniGen2, a next-generation, open-source multimodal generative model. Expanding on its predecessor OmniGen, the new architecture unifies text-to-image generation, image editing, and subject-driven generation within a single transformer framework. It innovates by decoupling the modeling of text and image generation, incorporating a reflective training mechanism, and implementing a purpose-built benchmark—OmniContext—to evaluate contextual consistency.

A Decoupled Multimodal Architecture

Unlike prior models that use shared parameters across text and image modalities, OmniGen2 introduces two distinct pathways: an autoregressive transformer for text generation and a diffusion-based transformer for image synthesis. It also employs a novel positioning strategy named Omni-RoPE, which allows flexible handling of sequences, spatial coordinates, and modality distinctions, enabling high-fidelity image generation and editing.

To preserve the pretrained text generation ability of the underlying MLLM (based on Qwen2.5-VL-3B), OmniGen2 feeds VAE-derived features only to the diffusion pathway. This avoids compromising the model’s text understanding and generation capabilities while maintaining rich visual representation for the image synthesis module.

Reflection Mechanism for Iterative Generation

One of the standout features in OmniGen2 is the reflection mechanism. By integrating feedback loops during training, the model is capable of analyzing its generated outputs, identifying inconsistencies, and proposing refinements. This process mimics test-time self-correction and significantly enhances instruction-following accuracy and visual coherence, especially for nuanced tasks like modifying color, object count, or positioning.

The reflection dataset was constructed using multi-turn feedback, enabling the model to learn how to revise and terminate generation based on content evaluation. This mechanism is particularly useful in bridging the quality gap between open-source and commercial models.

OmniContext Benchmark: Evaluating Contextual Consistency

To rigorously assess in-context generation, the team introduces OmniContext, a benchmark comprising three primary task types: SINGLE, MULTIPLE, and SCENE, across Character, Object, and Scene categories. OmniGen2 demonstrates state-of-the-art performance among open-source models in this domain, scoring 7.18 overall—outperforming other leading models like BAGEL and UniWorld-V1.

The evaluation uses three core metrics: Prompt Following (PF), Subject Consistency (SC), and Overall Score (geometric mean), each validated through GPT-4.1-based reasoning. This benchmarking framework emphasizes not just visual realism but semantic alignment with prompts and cross-image consistency.

Data Pipeline and Training Corpus

OmniGen2 was trained on 140M T2I samples and 10M proprietary images, supplemented by meticulously curated datasets for in-context generation and editing. These datasets were constructed using a video-based pipeline that extracts semantically consistent frame pairs and automatically generates instructions using Qwen2.5-VL models. The resulting annotations cover fine-grained image manipulations, motion variations, and compositional changes.

For training, the MLLM parameters remain largely frozen to retain general understanding, while the diffusion module is trained from scratch and optimized for joint visual-textual attention. A special token “<|img|>” triggers image generation within output sequences, streamlining the multimodal synthesis process.

Performance Across Tasks

OmniGen2 delivers strong results across multiple domains:

Text-to-Image (T2I): Achieves an 0.86 score on GenEval and 83.57 on DPG-Bench.
Image Editing: Outperforms open-source baselines with high semantic consistency (SC=7.16).
In-Context Generation: Sets new benchmarks in OmniContext with 7.81 (SINGLE), 7.23 (MULTIPLE), and 6.71 (SCENE) task scores.
Reflection: Demonstrates effective revision of failed generations, with promising correction accuracy and termination behavior.

Conclusion

OmniGen2 is a robust and efficient multimodal generative system that advances unified modeling through architectural separation, high-quality data pipelines, and an integrated reflection mechanism. By open-sourcing models, datasets, and code, the project lays a solid foundation for future research in controllable, consistent image-text generation. Upcoming improvements may focus on reinforcement learning for reflection refinement and expanding multilingual and low-quality robustness.

Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI appeared first on MarkTechPost.