Google AI proposes LANISTR: an attention-based machine learning framework for learning from language, image and structured data

Google Cloud ai researchers have introduced LANISTR to address the challenges of effectively and efficiently handling structured and unstructured data within a framework. In machine learning, handling multimodal data (comprising language, images, and structured data) is increasingly crucial. The key challenge is the issue of lack of modalities in large-scale, unlabeled, structured data such as tables and time series. Traditional methods often run into problems when one or more types of data are missing, leading to suboptimal model performance.

Current methods for pre-training multimodal data typically rely on the availability of all modalities during training and inference, which is often not feasible in real-world scenarios. These methods include various forms of early and late fusion techniques, where data from different modalities are combined at either the feature or decision level. However, these approaches are not suitable for situations where some modalities may be completely missing or incomplete.

Google's LANISTR (Language, Image and Structured Data Transformer), a novel pre-training framework, leverages unimodal and multi-modal masking strategies to create a robust pre-training target that can handle missing modalities effectively. The framework is based on an innovative similarity-based multimodal masking objective, which allows it to learn from available data while making educated guesses about missing modalities. The framework aims to improve the adaptability and generalization of multimodal models, particularly in scenarios with limited labeled data.

The LANISTR framework employs unimodal masking, where portions of the data within each modality are masked during training. This forces the model to learn contextual relationships within the modality. For example, in text data, certain words may be masked and the model learns to predict them based on the surrounding words. In images, certain patches can be masked and the model learns to infer them from the visible parts.

Multimodal masking expands on this concept by masking entire modalities. For example, in a data set containing text, images, and structured data, one or two modalities may be completely masked at random during training. The model is then trained to predict the masked modalities from the available ones. This is where similarity-based targeting comes into play. The model is guided by a similarity measure, ensuring that the representations generated for the missing modalities are consistent with the available data. The effectiveness of LANISTR was evaluated on two real-world data sets: the amazon Product Review data set from the retail sector and the MIMIC-IV data set from the healthcare sector.

LANISTR showed effectiveness in out-of-distribution scenarios, where the model encountered data distributions that were not observed during training. This robustness is crucial in real-world applications, where data variability is a common challenge. LANISTR achieved significant improvements in accuracy and generalization even with the availability of labeled data.

In conclusion, LANISTR addresses a critical problem in the field of multimodal machine learning: the challenge of missing modalities in large-scale unlabeled data sets. By employing a novel combination of unimodal and multimodal masking strategies, along with a similarity-based multimodal masking objective, LANISTR enables robust and efficient pre-training. The evaluation experiment demonstrates that LANISTR can effectively learn from incomplete data and generalize well to new and unseen data distributions, making it a valuable tool for promoting multimodal learning.

Review the Paper and Blog. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.

If you like our work, you will love our Newsletter..

Don't forget to join our 42k+ ML SubReddit

Pragati Jhunjhunwala is a Consulting Intern at MarktechPost. She is currently pursuing B.tech from the Indian Institute of technology (IIT), Kharagpur. She is a technology enthusiast and has a keen interest in the scope of data science software and applications. She is always reading about the advancements in different fields of ai and ML.

Join the fastest growing ai research newsletter read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

Google AI proposes LANISTR: an attention-based machine learning framework for learning from language, image and structured data

Technical Terrence Team

Should you be keeping an eye on the Greatland Gold (LSE:GGP) share price?

Leave a Reply Cancel reply

Recommended.

Cambio de tamaño de imagen usando OpenCV usando Python

These are the worst consumer rates

Here's how to prevent X from using your posts to train its AI

What is a 457 plan?

Binance stops GBP deposits and withdrawals

Categories

Important Links

Google AI proposes LANISTR: an attention-based machine learning framework for learning from language, image and structured data

Related

Technical Terrence Team

Should you be keeping an eye on the Greatland Gold (LSE:GGP) share price?

Leave a Reply Cancel reply

Recommended.

Cambio de tamaño de imagen usando OpenCV usando Python

These are the worst consumer rates

Here's how to prevent X from using your posts to train its AI

What is a 457 plan?

Binance stops GBP deposits and withdrawals

Categories

Important Links

Get daily news updates to your inbox!