From RAGs to Riches: Multimodal Retrieval

RAGs to Riches

Jul 16

Introduction to RAG

Imagine an AI assistant that not only understands natural language but also has instant access to the most up-to-date information from your company's databases and beyond. By retrieving relevant information from external sources and integrating it with the Large Language Models (LLM) output, Retrieval Augmented Generation (RAG) ensures that generated text is not only coherent but also accurate and applicable to the user's specific needs.

In this series, we will explore some of the most groundbreaking advancements in RAG, as presented in top 2024 publications from journals in the field of Natural Language Processing (ACL, LREC-COLING, NeurIPS). Each of these papers tackles a critical aspect of RAG deployment, from resolving knowledge conflicts and fact-checking to domain-specific retrieval and language model personalization. By understanding these developments, businesses can leverage the power of RAG to thrive in the newly AI-driven world.

Multimodal Retrieval

While large language models (LLMs) have made remarkable strides in processing and generating text, they often struggle with visual information. This limitation has led to a text-only bias in many AI systems, making it difficult to effectively incorporate or generate visual content. This bias can be a significant hurdle for businesses looking to create more engaging and informative AI-powered experiences. Multimodal retrieval is a cutting-edge approach that aims to bridge this gap by enabling AI systems to understand and retrieve both textual and visual content.

UNIMUR: Teaching AI to Think with Image and Text

Wang et al. (2024) introduce Unified Embeddings for Multimodal Retrieval (UNIMUR) which tackles the text-only bias challenge by introducing a unified embedding space for both textual and visual information, almost like a universal translator between text and images. Here's how it works:

Image-to-Text Mapping: UNIMUR first learns to map images into the LLM's input space, allowing the model to "understand" visual information in a language-like format.
Dual Alignment Training: The heart of UNIMUR lies in its innovative training strategy. It aligns the LLM's output embeddings with both visual and textual semantics, creating a unified multimodal embedding space.
Joint Retrieval: During inference, UNIMUR uses this unified embedding to retrieve both visual and textual outputs simultaneously, ensuring better cross-modal consistency. This allows the model to represent visual information in a "language-like" format that the LLM can process.

The Key Benefits of UNIMUR:

Reduces the AI's tendency to ignore images and focus only on text.
Ensures that when the AI talks about images, it actually matches what's in the picture.
Improves overall performance in tasks involving both images and text.
Doesn't require a complete overhaul of existing AI systems, making it cost-effective to implement.

HOSA: Teaching AI to Pay Attention to Details

Introduced by Gao et al. in their 2024 paper, the High-Order Semantic Alignment (HOSA) framework presents a different approach to multimodal retrieval, focusing on capturing the fine-grained relationships between images and text. The process starts by breaking down both images and text into smaller, meaningful parts. For images, it identifies important regions or objects. For text, it looks at individual words and their contexts. This detailed breakdown allows HOSA to match specific parts of an image with specific words or phrases, rather than just matching whole images with whole texts.

The core idea behind HOSA is that each part of an image can be described using a combination of words from the text, and vice versa. To achieve this, HOSA uses a mathematical technique called the tensor product. This allows the model to efficiently compute how well different parts of the image match with different parts of the text, considering multiple levels of correspondence simultaneously.

Multi-Level Relationship Modeling

HOSA doesn't just look at one-to-one matches between image parts and words, it considers relationships at multiple levels: between specific image regions and words (local), between the whole image and entire text (global), and how local elements relate to the overall context (local-global). This multi-level approach allows HOSA to capture complex, nuanced relationships between images and text.

When measuring how well an image matches a piece of text, HOSA uses a clever strategy. For each word in the text, it finds the most relevant part of the image. Then, it combines these best matches to get an overall similarity score. This approach allows HOSA to focus on the most important connections between the image and text, resulting in more accurate and meaningful retrieval results.

The key strength of HOSA lies in its ability to capture these detailed, multi-level relationships while remaining computationally efficient. This makes it particularly well-suited for tasks that require a deep understanding of how images and text relate to each other, such as detailed product searches or complex visual question-answering systems.

The Key Benefits of HOSA:

Focuses on fine-grained relationships between images and text
Breaks down images into regions and text into words/phrases
Matches specific image parts with specific textual elements

Conclusion

While both UNIMUR and HOSA aim to improve multimodal retrieval, they approach the problem from different angles. UNIMUR focuses on creating a unified embedding space for both modalities, while HOSA emphasizes fine-grained alignment through high-order semantic relationships. If you have any questions or would like to discuss how RAG can be implemented to optimize your business processes, we invite you to contact us and schedule a consultation. We're here to help you navigate the exciting world of RAG and unlock new possibilities for your organization. For more AI news, don’t forget to sign up for future blog posts!

RAGs to RichesAIRAGLLMs

Reece Suchocki