Researchers from Datategy SAS in France and the Math & ai Institute in Turkey propose a possible direction for recently emerging multimodal architectures. The core idea of their study is that the well-studied formulation of Named Entity Recognition (NER) can be incorporated into a multimodal large language model (LLM) environment.
Multimodal architectures such as LLaVA, Kosmos or AnyMAL have been gaining ground recently and have demonstrated their capabilities in practice. These models tokenize data from non-text modalities, such as images, and use modality-specific external encoders to embed them in a joint linguistic space. This allows architectures to provide a means to configure multimodal data mixed with text in an interleaved manner.
Authors of this paper propose that this generic architectural preference can be extended to a much more ambitious environment in the near future, which they refer to as an “omnimodal era.” The notions of “entities”, which are somehow connected to the concept of NER, can be imagined as modalities for these types of architectures.
For example, current LLMs are known to have difficulty deducing complete algebraic reasoning. Although research is being done to develop specific “mathematics-friendly” models or use external tools, a particular horizon for this problem could be to define quantitative values as a modality in this framework. Another example would be implicit and explicit date and time entities that can be processed by a specific temporal cognitive modality encoder.
LLMs are also having a very difficult time in geospatial understanding, where they are far from being considered “geospatially aware”. Furthermore, numerical global coordinates need to be processed accordingly, where notions of proximity and adjacency must be accurately reflected in the linguistic embedding space. Therefore, incorporating locations as a special geospatial modality could also provide a solution to this problem with a specifically designed encoder and joint training. In addition to these examples, the first potential entities that could be incorporated as a modality that come to mind are people, institutions, etc.
The authors argue that this type of approach promises to resolve the parametric/nonparametric knowledge scale and context length limitation, as complexity and information can be distributed to numerous modality encoders. This could also solve the problems of injecting updated information across modalities. The researchers simply provide the limits of such a potential framework and discuss the promises and challenges of developing an entity-driven language model.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you'll love our newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.
<!– ai CONTENT END 2 –>