DeepSeek AI launches Janus: a 1.3 billion multimodal model with imaging capabilities

Multimodal ai models are powerful tools capable of understanding and generating visual content. However, existing approaches typically use a single visual encoder for both tasks, leading to suboptimal performance due to fundamentally different comprehension and generation requirements. Comprehension requires high-level semantic abstraction, while generation focuses on local details and global coherence. This mismatch results in conflicts that limit the overall efficiency and accuracy of the model.

Researchers from DeepSeek-ai, the University of Hong Kong, and Peking University propose Janus, a novel autoregressive framework that unifies multimodal comprehension and generation by employing two distinct visual coding pathways. Unlike previous models that use a single encoder, Janus introduces a specialized path for each task, which are processed through a unified transformer. This unique design alleviates conflicts inherent to previous models and provides greater flexibility, allowing for different coding methods that best suit each modality. The name “Janus” aptly represents this duality, much like the Roman god, with two faces representing transitions and coexistence.

<img fetchpriority="high" decoding="async" width="1024" height="597" data-attachment-id="63967" data-permalink="https://www.marktechpost.com/2024/10/18/deepseek-ai-releases-janus-a-1-3b-multimodal-model-with-image-generation-capabilities/screenshot-2024-10-18-at-12-21-33-am/” data-orig-file=”https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM.png” data-orig-size=”1534,894″ data-comments-opened=”1″ data-image-meta=”{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}” data-image-title=”Screenshot 2024-10-18 at 12.21.33 AM” data-image-description=”” data-image-caption=”<p>https://arxiv.org/abs/2410.13848</p> ” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM-300×175.png” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM-1024×597.png” tabindex=”0″ role=”button” src=”http://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM-1024×597.png” alt=”” class=”wp-image-63967″ style=”width:674px;height:auto” srcset=”https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM-1024×597.png 1024w, https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM-300×175.png 300w, https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM-768×448.png 768w, https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM-721×420.png 721w, https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM-150×87.png 150w, https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM-696×406.png 696w, https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM-1068×622.png 1068w, https://www.marktechpost.com/wp-content/uploads/2024/10/Screenshot-2024-10-18-at-12.21.33-AM.png 1534w” sizes=”(max-width: 1024px) 100vw, 1024px”/>

https://arxiv.org/abs/2410.13848

The Janus architecture consists of two main components: a comprehension encoder and a generation encoder, each of which is tasked with handling multimodal inputs differently. For multimodal understanding, Janus uses a high-dimensional semantic feature extraction approach via SigLIP, transforming features into a sequence compatible with the language model. For visual generation, Janus uses a VQ tokenizer that converts visual data into discrete representations, enabling fine-grained image synthesis. Both tasks are processed by a shared transformer, allowing the model to operate in an autoregressive manner. This approach allows the model to decouple the requirements of each visual task, simplifying implementation and improving scalability.

The training is divided into three stages: training adapters, unified pre-training, and supervised tuning, all of which improve your multimodal capabilities while maintaining consistency between different tasks.

Experimental results demonstrate that Janus significantly outperforms previous models on several benchmarks. In multimodal understanding, Janus achieved impressive results, outperforming LLaVA-v1.5 and other unified models and even matching or outperforming task-specific models in certain cases. Specifically, Janus achieved scores of 69.4, 63.7, and 87.0 on multimodal benchmarks such as MMBench, SEED-Bench, and POPE, respectively, outperforming larger models such as Qwen-VL-Chat (7B). In visual generation tasks, Janus also showed superior performance, achieving a Fréchet onset distance (FID) of 8.53 on MSCOCO-30K, demonstrating better consistency with user cues than competing models. , such as DALL-E 2 and SDXL. In particular, these results show that Janus offers a balanced ability to understand and generate visual content while being more parameter-efficient.

In conclusion, Janus represents a major step forward in the development of unified multimodal ai models by resolving the conflicts between understanding and generation. Their decoupling approach proves to be effective and efficient, enabling high-quality semantic understanding along with detailed visual generation. This flexibility makes Janus a promising candidate for future developments in multimodal ai, with potential applications extending to additional modalities, such as point clouds or audio data. Janus' extensibility, flexibility, and robust performance highlight its potential to serve as inspiration for the next generation of unified multimodal models.

look at the Paper, ai/Janus-1.3B”>Model card in hugged faceand ai/Janus”>GitHub page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 50,000ml.

(Next live webinar: October 29, 2024) Best platform to deliver optimized models: Predibase inference engine (promoted)

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. Their most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.

Listen to our latest ai podcasts and ai research videos here

DeepSeek AI launches Janus: a 1.3 billion multimodal model with imaging capabilities

Technical Terrence Team

This awesome passive income stock just paid me £217. All part of my plan to make a million.

Leave a Reply Cancel reply

Recommended.

BP shares look cheap. Should I buy them today?

Carnival Cruise Line Shares a Key Cabin Hack You Might Not Know

How to watch Rivian's R2 electric SUV reveal event

Accounts Payable: Debit or Credit?

Bitcoin derivatives market heats up again: prepare for impact?

Categories

Important Links

DeepSeek AI launches Janus: a 1.3 billion multimodal model with imaging capabilities

Related

Technical Terrence Team

This awesome passive income stock just paid me £217. All part of my plan to make a million.

Leave a Reply Cancel reply

Recommended.

BP shares look cheap. Should I buy them today?

Carnival Cruise Line Shares a Key Cabin Hack You Might Not Know

How to watch Rivian's R2 electric SUV reveal event

Accounts Payable: Debit or Credit?

Bitcoin derivatives market heats up again: prepare for impact?

Categories

Important Links

Get daily news updates to your inbox!