Issue #5: LLMs can finally understand spreadsheets

Plus video-to-audio technology, new Nemo model from Mistral, closing the performance gap between open source and proprietary models

Issue 5: LLMs can finally understand spreadsheets

Hello readers! In this issue, we cover:

  • Google DeepMind is working on video-to-audio technology to automatically generate music, audio, or dialogue for a given video

  • Microsoft helps LLMs understand the spreadsheet

  • Apple researchers improve efficiency of LLMs and accelerate time-to-first-token

  • Mistral and Nvidia announce the Nemo model, a small multi-language model with 128K context window

  • GPT4 can categorize political bias from news sources

🎥 Google DeepMind is Working On Video-to-Audio (V2A) Technology

DeepMind's new video-to-audio (V2A) technology generates synchronized soundtracks for silent videos automatically, without a user having to generate the music themselves. Not only does it generate music, it can also generate realistic dialogue and sound-effects. Using a diffusion-based approach, V2A refines audio iteratively to match visual inputs, achieving high-quality, synchronized audio output.

Read the post on Google’s blog and listen to the stunning audio samples

đź’± Microsoft releases SpreadSheetLLM to encode spreadsheets for large language models

Spreadsheets are challenging for LLMs due to their 2D grid structure, flexible layouts, and numerous formatting options. Their expansive grids can often exceed the token limitation of popular LLMs. SpreadsheetLLM, produce by Microsoft, aims to enhance LLMs understanding and reasoning capabilities for spreadsheets such as Excel or Google Sheets.

The team also proposes SheetCompressor, the encoding framework behind SpreadsheetLLM, which enhances the storage efficiency of tabular data, reducing duplicate values and reducing token usage by 96%. arxiv

🍎 Apple researchers introduce LazyLLM to speed up time-to-first-token

Apple researchers introduce a technique to speed up LLMs like Llama3 and GPT4. Normally, these models need to process every word in a large chunk of text, which can be slow. LazyLLM selectively ignores less important words and focuses only on the crucial ones. This method greatly enhances efficiency, especially in tasks involving long documents. The technique can also be integrated with existing transformer based LLMs. arxiv

LLM vs LazyLLM performance

🌎 Mistral and Nvidia release new model designed for global, multilingual applications

Nemo, Mistral and Nvidia’s new 12B parameter model with a 128K token context window, outperforms recent open-source models such Gemma 2 and Llama 3 on some benchmarks.

The model is designed for global, multilingual applications. It’s trained on function calling, has a large context window, and is particularly strong in English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.

👨🏼‍💻 Nvidia makes open-source LLMs as performant as proprietary LLMs for RAG and long-context understanding

Nvidia introduces ChatQA 2, a model based on Llama3, aimed at matching the the performance of proprietary models like GPT-4-Turbo in understanding long contexts and retrieval-augmented generation (RAG). Researchers expanded Llama3-70B-base model's context window from 8K to 128K tokens, and used a three-stage instruction tuning to boost instruction-following, RAG performance, and long-context understanding.

Results show that the Llama3-ChatQA-2-70B model performs on par with GPT-4-Turbo-2024-0409 in many long-context tasks and even outperforms it in the RAG benchmark. Additionally, the study reveals that a long-context retriever can mitigate context fragmentation in RAG, enhancing long-context task performance. arxiv

Nvidia’s new model, ChatQA2, does well in the needle in the haystack test.

🏛️ GPT4 can label political bias from websites

Researchers at The University of Cambridge investigate the ability of GPT-4 to classify the political bias of news sources based on URLs. It compares GPT-4's classifications with those from Media Bias/Fact Check (MBFC) and finds a high correlation. GPT-4 abstained from classifying many sources, especially less popular ones, and showed a slight leftward skew. The study suggests GPT-4 could be a useful tool for bias classification, complementing human judgment to mitigate biases. arxiv

🤯 Today I Learned

Every issue, we highlight new AI concepts and terminology to help educate our readers. This issue we learned about:

Needle In A Haystack Test

The "needle in a haystack" test in the context of Large Language Models (LLMs) refers to a type of evaluation designed to assess an LLM's ability to retrieve specific, rare, or obscure information from its vast knowledge base.

This test typically involves asking the model questions about very specific facts or details that are not commonly known or frequently discussed. The idea is to see if the model can accurately recall and present this information, much like finding a needle in a haystack.

Time-to-first-token (TTFT)

Time-to-first token (TTFT) in the context of Large Language Models (LLMs) refers to the time it takes for the model to generate and output its first token (word or subword) after receiving an input prompt.

Key aspects of TTFT include:

  1. Latency measure: It's a critical metric for assessing the responsiveness of an LLM.

  2. User experience: TTFT significantly impacts how fast a user perceives the model's response, especially in interactive applications.

  3. Technical factors: It can be influenced by model size, hardware, optimization techniques, and the complexity of the input prompt.

  4. Trade-offs: There's often a balance between TTFT and overall response quality or length.

  5. Streaming: Many LLM applications use token streaming to improve perceived responsiveness, where tokens are displayed as they're generated rather than waiting for the full response.

TTFT is particularly important in real-time or interactive applications of LLMs, such as chatbots or AI assistants, where quick initial responses can greatly enhance user experience.

Diffusion Model

Diffusion models are a type of generative AI primarily used for image creation and manipulation. They work by learning to reverse a process of gradually adding noise to images. To generate new images, these models start with random noise and progressively denoise it to form a coherent picture. Diffusion models are known for producing high-quality, diverse images and can be used for tasks like text-to-image generation, image editing, and style transfer. They have gained popularity in recent years, with examples including Stable Diffusion and DALL-E 2.

Instruction Tuning

Instruction tuning is a fine-tuning technique for large language models that aims to improve their ability to follow specific instructions or prompts. It involves training the model on a dataset of instruction-output pairs, teaching it to generate appropriate responses to a wide variety of tasks and commands. This process enhances the model's versatility and ability to understand and execute diverse user instructions, making it more suitable for general-purpose use in applications like chatbots, virtual assistants, and task-specific AI tools.