• General Intelligence
  • Posts
  • Issue #9: Meta researchers use prompts to "pick" items from images and video

Issue #9: Meta researchers use prompts to "pick" items from images and video

Plus SAM2 on medical images, CharacterAI open sources their prompting framework, IKEA explores LLMs for product recommendations, and designing efficient inference clusters

Hello readers! In today’s issue, we cover:

  1. Meta announces their SAM2 foundation model

  2. Results when SAM2 gets applied to medical images

  3. CharacterAI open sources their prompt design framework

  4. How IKEA uses LLMs for product recommendations

  5. GPUs are expensive. There’s a new framework for managing efficient inference clusters

🎥 Meta Researchers Announce SAM2, a Foundation Model for Segmenting Videos

Using prompts to segment cats in an image

SAM (Segment Anything Model) is a new foundation model that can “cut out” any object in an image. While SAM originally worked on images, SAM2 now works on both images and video. It’s also faster and more accurate than SAM.

With video, SAM2 can support many important applications in AR/VR, robotics, autonomous vehicles, and video editing.

🩻 When SAM2 Is Applied to Medical Images

Medical Images overlaid with annotation masks

Medical image segmentation is a critical task for multiple clinical applications such as disease diagnosis and clinical analysis.

Duke University researchers explore using the new SAM2 model to segment both 2D and 3D medical images, including CT, MRI, PET, X-ray, and ultrasound. The study evaluates the model's performance in two scenarios: multi-frame 3D segmentation and single-frame 2D segmentation. SAM 2 performs similarly to its predecessor in 2D segmentation and has variable performance in 3D segmentation.

✍️ CharacterAI Open Sources Their Prompt Design Framework

When CharacterAI struck a licensing deal with Google, they also released PromptPoet, a tool that allows both developers and non-technical users efficiently design and manage their production prompts.

Using a mix of YAML and Jinja2, Prompt Poet allows for flexible, dynamic prompt creation, enhancing the efficiency and quality of interactions with AI models. It saves time on engineering string manipulations, enabling everyone to focus more on crafting the optimal prompts for their users.

Read their blog post or see the code on Github

🛋️ IKEA Fine-Tunes LLMs for Product Recommendations

IKEA search queries

Scientists at IKEA present a method to improve large language models (LLMs) for making product recommendations by teaching them about product details through synthetic search queries with product IDs. This helps the models understand product inventories better and respond more contextually to queries.

However, there are some limitations. The system still produces hallucinations, and whenever new products are introduced, the LLM must be retrained. They suggest some enhancements that produce moderate performance improvements.

⚡️Designing Inference Clusters for Performance and Efficiency

Dynamo LLM architecture

Scientists at Microsoft and Azure introduce DynamoLLM, a system to improve the energy efficiency of LLM clusters used for processing queries. It highlights how these clusters, which rely on power-hungry and expensive GPUs, can consume significant energy and produce high carbon emissions. DynamoLLM dynamically adjusts cluster configurations to balance energy use, performance, and costs, leading to a 53% reduction in energy use, 38% in carbon emissions, and 61% in customer costs while meeting performance goals.

🤯 Today I learned

Pre-trained Models

Pre-training refers to the process of training a model on a large dataset before fine-tuning it for a specific task. The goal is to create a base model with general knowledge and understanding that can be adapted for various, specific tasks. BERT and GPT are well known pre-trained models.

Video Object Segmentation

Video object segmentation (VOS) is a computer vision task that aims to identify and isolate specific objects or regions of interest throughout a video sequence. It involves accurately delineating the boundaries of target objects in each frame, addressing challenges such as object movement, deformation, occlusions, and changes in lighting. VOS can be semi-supervised (with manual annotation in the first frame), unsupervised (automatic detection), or interactive (with occasional user guidance). It has numerous applications, including video editing, autonomous vehicles, augmented reality, and surveillance.

Memory Attention Module

A memory attention module is a component in neural network architectures designed to selectively access and utilize relevant information from past inputs or states. It allows the model to focus on important historical data when processing current inputs, improving performance on tasks requiring long-term dependencies.

In other words, think of a memory attention module as a smart notepad for a neural network. As the network processes data, it jots down important points in this notepad. Later, when the network needs to make a decision, it doesn't just look at the current input - it also checks its notes.

The clever part is how it uses these notes. Instead of reading everything, it has a system (the attention mechanism) that quickly picks out the most relevant bits. It's like having a great assistant who can instantly pull up the right file when you need it.

This is super useful for tasks where context matters. For example, in language translation, understanding a word often depends on words that came much earlier in the sentence. The memory attention module helps the network remember and use that earlier context.