Multi-Modal Learning in Artificial Intelligence
The Multi-Product Line of Google’s Gemini and the Future of AI
Image by Alphabet Inc.
This week’s launch of Google’s Gemini in Machine Learning is like the Olympics, a multiple sports collection of events. The analogy is not perfect, but it gets the point across that the impact of Artificial Intelligence will continue to appear all-encompassing in our lives. The different segments of Machine Learning (ML), like the text of Natural Language Processing (NLP) in isolation, ignore images and other data types that may improve its accuracy and effectiveness. By combining different kinds of machine learning, like the Olympics in sports, artificial intelligence will come closer to natural intelligence. The gap between the non-biological and biological will narrow.
A few people know about Multi-Model Learning in Artificial intelligence, but here is a description, its use cases, and its limitations. There are many ways to approach this, but let’s keep this short . . . for Substack.
Video by Alphabet Inc.
Multi–Modal Learning (MML) aims to combine inputs such as image, text, speech, audio, video, and more to have comprehensive inputs of an environment. MML, like human beings, processes multiple inputs for their behaviors. Not to mention the different forms of outputs, such as the text of (NLP) with the numerical data of a predictive graph and other combinations. Overall reasoning and accuracy would improve by integrating and correlating multiple signals to align more closely to human perception.
Image by Anonymous
Real-world use cases are small for MML, but here are three of them.
Next Generation Recommendation Engines
Unify user browsing history with purchasing patterns extracted via NLP from comments and articles read with collaborative filters.
Comprehensive Medical Diagnosis
Radiology, dermatology, and pathology experts combine insights from medical imaging like X-rays, MRIs, and microscopy with health record text and genomic data.Robot Perception
A combination of elements of an environment of images, video, depth sensors data, motion dynamics, and positional tracking for enhanced robot navigation decisions.
Image by Anonymous
There are several significant limitations that only a few people are discussing, but here are three to start considering.
Operational Complexity of Data Fusion
Sophisticated neural architectural engineering is required for multiple input types as teams preserve the different modal learnings from text to image and more. More importantly, there needs to be a management framework that has yet to exist today. How do you manage the different teams and find common themes to work together? Also, we know the operational model of ML but not of MML.Financial costs
The increased computational load of integrating data from varied text, images, and more modalities imposes significantly higher processing, data storage, and memory costs, hampering scalable deployment. MML can be very expensive, so what is the cost model for MML?Evaluation Complexity
Reconciling the different types of valuation metrics is difficult, given the sets of different valuations to validate multiple outputs from image, text, audio, and other data types. There is no aggregating framework, and the risk of overgeneralizing with multiple fitted datasets leads to overfitting and the capture of meaningless data noise in your use case outcomes. How do we find robust generalized concepts from one data type to the next that do not undergeneralize or overgeneralize?
Image by Anonymous
Nonetheless, the small number of use cases and limitations does not detract from a crucial step Google’s Gemini presents. One day, we will look back at it as a critical milestone in our advancement of artificial intelligence, even though no one can seriously calculate the size of the MML market with any accuracy today. This reminds me of one of my favorite quotes as a veteran data scientist after so many years.
Quote by Rumi
To learn more, click on https://www.learngids.com






