• General Intelligence
  • Posts
  • [#30] Meta releases better technique for more efficient video-language understanding

[#30] Meta releases better technique for more efficient video-language understanding

Plus AMD reduces the cost of training diffusion models, lung cancer detection, autonomous driving from aerial photos, and seamless AI voice conversation

Hello readers, in this issue we cover

  • Meta releases LongVU for efficient video-language understanding

  • AMD introduces new method for reducing the cost of diffusion models

  • Medical AI for Early Detection of Lung Cancer

  • Learning autonomous driving from aerial imagery

  • Seamless voice conversation with a new GPT model

🎥 Meta releases LongVU for efficient video-language understanding

Multimodal LLMs struggle with long videos due to limitations in processing capacity. To tackle this, teams at Meta developed LongVU, a method that compresses video data while keeping important visual details intact.

LongVU works by identifying and removing similar frames to cut down on unnecessary information, using features from a model called DINOv2. It also uses text prompts to selectively reduce the amount of data from frames based on their relevance to the overall video. This adaptive approach allows us to handle many frames without losing much visual quality.

📸 AMD introduces new method for reducing the cost of diffusion models

Researchers at AMD introduce a new pruning method to reduce the cost of training diffusion models. They create a "SuperNet" that incorporates connections based on similar features, and designed a pruner network that identifies and removes unnecessary computations. The method allows them to find the best simplified model with minimal adjustments.

AMD tested their approach on various diffusion models, including the Stable Diffusion series, and found that it can speed up processing by 4.4 times without losing accuracy, outperforming previous methods.

🫁 Medical AI for Early Detection of Lung Cancer

Statistical learning architectures vs deep learning

Lung cancer is a major health issue globally, making early diagnosis crucial for better treatment and outcomes. Computer-aided diagnosis (CAD) systems that analyze CT images have proven effective in detecting and classifying lung nodules, improving early-stage lung cancer detection rates.

This review highlights advancements in deep learning techniques such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Generative Adversarial Networks (GAN), as well as the use of ensemble models. These modern algorithms have greatly enhanced the accuracy and efficiency of analyzing lung nodules, especially in classification tasks.

🚁 Learning autonomous driving from aerial imagery

This paper addresses controlling ground vehicles using only aerial imagery. They use a technique called Neural Radiance Field (NeRF) to create viewpoints efficiently. NeRF helps generate new perspectives from the ground vehicle’s point of view, which can then be used for tasks like autonomous navigation. They show how this method can train a system to control vehicles in a custom mini-city environment and even help the vehicle figure out its location in the real world.

🎙️ Seamless voice conversation with a new GPT model

OmniFlatten is a new system designed to improve full-duplex spoken dialogue systems, which allow people to speak and listen at the same time, similar to natural human conversations. Full-duplex systems are more advanced than traditional turn-based systems but are challenging to implement because they need to handle interruptions, overlapping speech, and backchannel responses.

OmniFlatten, based on a GPT model, solves this by using a special training method that progressively teaches the model to handle speech and text together in real time. This makes conversations more natural and efficient without changing the core model structure. The approach unifies the training process across different tasks and modalities, paving the way for better full-duplex dialogue systems.

🤯 Today I Learned

Every issue, we highlight new AI concepts and terminology to help educate our readers. This issue we learned about:

DINOv2

DINOv2 is an advanced vision model that self-supervises without the need for labeled data. It learns to recognize and represent images by understanding patterns and structures, making it useful for tasks like image classification, segmentation, and object detection. It's a follow-up to the original DINO, improving performance and versatility in visual recognition tasks.

Neural Radiance Field (NeRF)

Neural Radiance Field (NeRF) is a technique that uses deep learning to generate 3D scenes from 2D images. By modeling how light interacts with objects in a scene, NeRF creates realistic, high-quality 3D renderings, allowing users to view the scene from different angles. It's widely used in applications like virtual reality, 3D modeling, and visual effects.