[#24] Meta releases Llama 3.2, models for mobile & edge devices

Plus detecting cheating during exams with AI, AI2 releases open-source language models, mitigating hallucinations in vision language models, and zero-shot detection of AI images

New: I’m writing a new newsletter dedicated to AI in health and medicine. There is so much new research, it deserves a dedicated newsletter. Subscribe here.

Hello readers, in this issue we cover

  • Meta releases Llama 3.2 which includes lightweight models for mobile and edge devices

  • Duolingo researchers can detect cheating during online exams through AI-powered gaze-tracking

  • Allen Institute for AI (AI2) releases new open-source large vision models

  • Mitigating hallucinations in language vision models

  • Zero-shot detection of AI generated images

🦙 Meta releases Llama 3.2, LlamaStack, for mobile and edge devices

Meta released Llama 3.2 which includes small and medium-sized LLMs and lightweight text-only models that fit onto edge and mobile devices. These models support context lengths of 128K and intended for on-device use-cases such as summarization, instruction following, and running tasks.

Meta also announced Llamastack, a standardized interface for canonical toolchain components to customize Llama models and build agentic applications

👀 Duolingo detects cheating during online exams

Each frame of the test session is shown as a point in the gaze plot, and the position of each point represents the gaze direction in each frame

Researchers at Duolingo developed an AI-assisted gaze tracking system to detect when a user is looking away from the screen, as such behavior would indicate consulting external resources. Their systems allows proctors to identify frames in a recording that could be suspicious. For example, if a test-taker is looking away from the screen frequently, a proctor could playback the video to the point in time and identify the suspicious behavior.

🏔️ New open-source vision models from Allen Institute for AI

The Molmo architecture combines a language model with a vision encoder.

Researchers have developed a new set of open-source visual language models called Molmo, and the training data set called Pixmo. Unlike many top models that are private or rely on data from closed systems, Molmo is built from scratch using:

  1. A new dataset of detailed image captions created by humans speaking about images

  2. A mix of training data including Q&A and image pointing tasks

  3. Careful design of the model architecture and training process

The best 72 billion parameter Molmo model performs well against both open and proprietary models on various tests. The team plans to release all their work - including model code, training data, and instructions - for others to use and build upon.

This research aims to advance open-source AI capabilities in understanding and working with both images and text.

👴🏼 Mitigating Hallucinations in Vision Language Models

This paper introduces, a new approach to reduce hallucinations in Large Vision-Language Models (LVLMs), called Dentist. Some key points include:

  1. Hallucination is a common issue in LVLMs, causing inconsistencies between generated text and image content.

  2. Current solutions don't effectively address various query types and their associated hallucinations.

  3. Dentist's approach:

    • Classifies queries (e.g., into perception or reasoning categories)

    • Applies different hallucination reduction methods based on the query type

  4. The process is compared to a dentist examining teeth before treatment.

  5. In experiments, Dentist improved accuracy on perception-based visual question answering tasks:

    • 13.44% improvement over InstructBLIP

    • 10.2% improvement over LLaVA

    • 15.8% improvement over VisualGLM

The goal is to provide a more effective and flexible way to reduce hallucinations in LVLMs across different types of queries.

🏀 Zero-shot detection of AI generated images

This paper introduces ZED (Zero-shot Entropy-based Detector), a novel method for detecting AI-generated images. As AI architectures rapidly advance, traditional detection methods struggle to keep up with the constant stream of new models and improved image quality.

ZED addresses this challenge by taking a fundamentally different approach. Unlike conventional detectors, ZED doesn't require training on AI-generated images or knowledge of specific generative architectures. Instead, it measures how "surprising" an image is compared to a model of real images, using a lossless image encoder to estimate pixel probabilities.

The detector employs a multi-resolution architecture for efficiency and relies on a single discriminative feature. This approach allows ZED to be independent of generator architectures and eliminates the need for synthetic training data. In tests across a wide variety of generative models, ZED achieved state-of-the-art performance, with an average improvement of over 3% in accuracy compared to previous methods. The creators have made the code publicly available on Github

🤯 Today I Learned

Every issue, we highlight new AI concepts and terminology to help educate our readers. This issue we learned about:

Zero-shot

Zero-shot learning is a machine learning approach where a model can recognize or classify things it hasn't been explicitly trained on. It leverages knowledge from seen classes to generalize to unseen ones, often using semantic relationships or attributes. This allows AI systems to handle new tasks or categories without additional training.

CLIP

CLIP (Contrastive Language-Image Pre-training) is a neural network model developed by OpenAI. It's trained on a large dataset of image-text pairs to understand and connect visual and textual information. CLIP can perform various tasks like image classification and text-based image retrieval without specific training for each task, making it highly versatile for visual and language understanding.