[#22] NVLM: Nvidia's new family of multimodal LLM

Plus new models for human-like conversations and better RAG, chain-of-thought is only useful in math tasks, and a new health AI newsletter

In partnership with

Hello readers, in this issue we cover

  • NVLM is a new family of open-source frontier models released by NVIDIA

  • Moshi is an open source model for more natural, human-like conversations

  • Smartglasses can be used to predict how engaged a person is during a conversation

  • Meeting summary evaluations without the original meeting transcript

  • Chain of thought reasoning is only helpful with math and symbolic reasoning tasks

  • SFR-RAG is a small LLM that outperforms large models at RAG tasks

Also, I’m launching a new, weekly newsletter dedicated to health AI. There is so much progress with AI in medicine and healthcare, I couldn’t fit it in this one. Would appreciate if you took a moment to subcribe!

🌅 NVLM: a new frontier, open-source LLM from NVIDIA

NVIDIA announced NVLM 1.0, a set of cutting-edge multimodal large language models that perform at state-of-the-art levels on vision-language tasks, matching or surpassing both proprietary models like GPT-4o and open-source models like Llama 3-V 405B and InternVL 2.

The architecture is based on both decoder-only multimodal models (e.g. LLava) and cross-attention-based models (e.g. Flamingo) which optimizes training efficiency and multimodal reasoning.

They also introduced a 1-D tile-tagging system for handling dynamic high-res images, which enhances performance in multimodal reasoning and OCR tasks.

Nvidia also developed production-grade multimodality in the NVLM-1.0 models, enabling them to excel at vision-language tasks while improving text-only performance. It was achieved by incorporating a high-quality text dataset into multimodal training, along with a large amount of math and reasoning data, resulting in better math and coding capabilities across both text and image modalities.

The Daily Newsletter for Intellectually Curious Readers

If you're frustrated by one-sided reporting, our 5-minute newsletter is the missing piece. We sift through 100+ sources to bring you comprehensive, unbiased news—free from political agendas. Stay informed with factual coverage on the topics that matter.

🗣️ Moshi is an open-source model for more natural, human-like conversations

Moshi merges several components to create more fluid conversations with AI

Moshi is a new speech-to-speech dialogue model designed to create more natural and fluid conversations. Unlike traditional spoken dialogue systems that rely on separate components such as voice activity detection, speech recognition, and text-to-speech, Moshi integrates these functions to reduce delays and retain non-verbal cues like emotion and interruptions, which are often lost in conventional systems.

Moshi achieves this by framing dialogue as direct speech-to-speech generation, bypassing the text-based intermediate step. Using a neural audio codec, it models both user and system speech in parallel streams, eliminating the need for strict turn-taking and allowing for dynamic, real-time interactions.

The framework also extends previous models by predicting time-aligned text tokens before generating audio, enhancing the linguistic quality of the speech output. This approach enables Moshi to perform real-time speech recognition and text-to-speech, achieving practical latency as low as 200ms. It is the first full-duplex spoken large language model designed for real-time conversations.

👓 CMU researchers predict conversation engagement with smart glasses

Data from smart glasses, combined with LLMs, can predict how engaged a person is

This paper predicts engagement in 2 person interactions by analyzing verbal and non-verbal cues using smart glasses equipped with cameras. Data from 34 participants were collected, with each providing self-reported engagement ratings. A novel fusion strategy integrates multiple behavior modalities into a "multimodal transcript" processed by LLMs for behavioral reasoning tasks. Results show this approach performs comparably to established fusion techniques. This approach represents one of the first attempts to use LLMs to reason about real-world human behavior, with applications in communication, collaboration, mental health support, and accessibility.

📄 CREAM, by JPMorgan, evaluates quality of dialogue summarization without transcripts

CREAM is a new framework designed to evaluate dialogue / meeting summaries without relying on reference summaries. It combines chain-of-thought reasoning with key facts alignment to assess how concise and complete a summary is. By using an ELO ranking system, CREAM compares the performance of different models or prompt setups, offering a more accurate and scalable way to rank summaries, especially for complex tasks like long-context or dialogue-based summarizations.

The CREAM framework evaluates two summaries without needing a reference or the original document. It uses a two-step process: first, the model extracts concise, non-redundant key facts from a combined version of both summaries. Then, it compares these facts to each summary, checking if each fact is accurately reflected and identifying supporting sentences. Then, the results are used to assess the completeness and conciseness of the summaries.

🧮 Chain-of-thought is mainly useful for math and symbolic reasoning

Chain of thought improves mostly math and algorithmic tasks, but falls short in Q&A

Chain-of-thought (CoT) prompting is used to enhance reasoning in large language models (LLMs), but its effectiveness varies by task. To investigate this, a meta-analysis of over 100 papers was conducted, along with evaluations on 20 datasets across 14 models. The findings reveal that CoT significantly improves performance on tasks involving math and logic but offers minimal benefits on other types of tasks.

Generating answers directly without CoT achieves nearly the same accuracy as CoT, except when questions involve symbolic reasoning, indicated by the presence of equals signs.

These results suggest that CoT should be used selectively, optimizing performance and reducing costs.

🤖 SFR-RAG: A small LLM that outperforms larger models at RAG tasks

SFR-RAG outperforms Command R+ and GPT-4o at RAG tasks

RAG enhances large language models (LLMs) by integrating external context to improve factual accuracy and relevance. RAG models need to accurately understand context, minimize hallucinations, manage irrelevant or low-quality information, handle complex reasoning, and provide reliable citations.

This paper introduces SFR-RAG, a small LLM designed for context-grounded generation and minimizing hallucinations. It also presents ContextualBench, an evaluation framework that standardizes multiple popular RAG benchmarks, such as HotpotQA and TriviaQA, for consistent model evaluation. Experiments show that the SFR-RAG-9B model outperforms larger models like Command-R+ (104B) and GPT-4o, achieving top results in 3 out of 7 benchmarks with significantly fewer parameters. The model is robust to changes in contextual information and performs well even when relevant context is removed.

🤯 Today I Learned

Every issue, we highlight new AI concepts and terminology to help educate our readers. This issue we learned about:

Chain of Verification

Chain of Verification (CoV) is a method used in large language models (LLMs) to improve the accuracy and reliability of generated responses by breaking down complex tasks into smaller, verifiable steps. Unlike Chain-of-Thought (CoT) prompting, which focuses on reasoning through intermediate steps to arrive at an answer, CoV emphasizes validating each step of the reasoning process to ensure the final output is correct.

Decoder-only Models

A decoder-only model is a type of neural network architecture used in large language models (LLMs) that focuses solely on generating or predicting the next sequence of data based on input. Unlike encoder-decoder models, which process input data (encoding) and then generate output (decoding), decoder-only models work by directly predicting the next token in a sequence.

Decoder-only models are popular for their simplicity and effectiveness in tasks where generating coherent and contextually appropriate sequences is crucial.

Parameter Efficient Fine-Tuning

Parameter Efficient Fine-Tuning (PEFT) is a method that updates only a small part of a large AI model to adapt it to specific tasks, rather than fine-tuning the entire model. This approach saves time, reduces costs, and requires less computing power. PEFT keeps most of the model’s original knowledge intact while making it easier to customize for new applications. It’s a practical way to fine-tune big AI models without the heavy resource demands.