- General Intelligence
- Posts
- [#24] Meta releases Llama 3.2, models for mobile & edge devices
[#24] Meta releases Llama 3.2, models for mobile & edge devices
Plus detecting cheating during exams with AI, AI2 releases open-source language models, mitigating hallucinations in vision language models, and zero-shot detection of AI images
New: I’m writing a new newsletter dedicated to AI in health and medicine. There is so much new research, it deserves a dedicated newsletter. Subscribe here.
Hello readers, in this issue we cover
Meta releases Llama 3.2 which includes lightweight models for mobile and edge devices
Duolingo researchers can detect cheating during online exams through AI-powered gaze-tracking
Allen Institute for AI (AI2) releases new open-source large vision models
Mitigating hallucinations in language vision models
Zero-shot detection of AI generated images
🦙 Meta releases Llama 3.2, LlamaStack, for mobile and edge devices
Meta released Llama 3.2 which includes small and medium-sized LLMs and lightweight text-only models that fit onto edge and mobile devices. These models support context lengths of 128K and intended for on-device use-cases such as summarization, instruction following, and running tasks.
Meta also announced Llamastack, a standardized interface for canonical toolchain components to customize Llama models and build agentic applications
👀 Duolingo detects cheating during online exams
Each frame of the test session is shown as a point in the gaze plot, and the position of each point represents the gaze direction in each frame
Researchers at Duolingo developed an AI-assisted gaze tracking system to detect when a user is looking away from the screen, as such behavior would indicate consulting external resources. Their systems allows proctors to identify frames in a recording that could be suspicious. For example, if a test-taker is looking away from the screen frequently, a proctor could playback the video to the point in time and identify the suspicious behavior.
🏔️ New open-source vision models from Allen Institute for AI
The Molmo architecture combines a language model with a vision encoder.
Researchers have developed a new set of open-source visual language models called Molmo, and the training data set called Pixmo. Unlike many top models that are private or rely on data from closed systems, Molmo is built from scratch using:
A new dataset of detailed image captions created by humans speaking about images
A mix of training data including Q&A and image pointing tasks
Careful design of the model architecture and training process
The best 72 billion parameter Molmo model performs well against both open and proprietary models on various tests. The team plans to release all their work - including model code, training data, and instructions - for others to use and build upon.
This research aims to advance open-source AI capabilities in understanding and working with both images and text.
👴🏼 Mitigating Hallucinations in Vision Language Models
This paper introduces, a new approach to reduce hallucinations in Large Vision-Language Models (LVLMs), called Dentist. Some key points include:
Hallucination is a common issue in LVLMs, causing inconsistencies between generated text and image content.
Current solutions don't effectively address various query types and their associated hallucinations.
Dentist's approach:
Classifies queries (e.g., into perception or reasoning categories)
Applies different hallucination reduction methods based on the query type
The process is compared to a dentist examining teeth before treatment.
In experiments, Dentist improved accuracy on perception-based visual question answering tasks:
13.44% improvement over InstructBLIP
10.2% improvement over LLaVA
15.8% improvement over VisualGLM
The goal is to provide a more effective and flexible way to reduce hallucinations in LVLMs across different types of queries.
🏀 Zero-shot detection of AI generated images
This paper introduces ZED (Zero-shot Entropy-based Detector), a novel method for detecting AI-generated images. As AI architectures rapidly advance, traditional detection methods struggle to keep up with the constant stream of new models and improved image quality.
ZED addresses this challenge by taking a fundamentally different approach. Unlike conventional detectors, ZED doesn't require training on AI-generated images or knowledge of specific generative architectures. Instead, it measures how "surprising" an image is compared to a model of real images, using a lossless image encoder to estimate pixel probabilities.
The detector employs a multi-resolution architecture for efficiency and relies on a single discriminative feature. This approach allows ZED to be independent of generator architectures and eliminates the need for synthetic training data. In tests across a wide variety of generative models, ZED achieved state-of-the-art performance, with an average improvement of over 3% in accuracy compared to previous methods. The creators have made the code publicly available on Github
🤯 Today I Learned
Every issue, we highlight new AI concepts and terminology to help educate our readers. This issue we learned about:
Zero-shot
Zero-shot learning is a machine learning approach where a model can recognize or classify things it hasn't been explicitly trained on. It leverages knowledge from seen classes to generalize to unseen ones, often using semantic relationships or attributes. This allows AI systems to handle new tasks or categories without additional training.
CLIP
CLIP (Contrastive Language-Image Pre-training) is a neural network model developed by OpenAI. It's trained on a large dataset of image-text pairs to understand and connect visual and textual information. CLIP can perform various tasks like image classification and text-based image retrieval without specific training for each task, making it highly versatile for visual and language understanding.