[#15] Meta releases Human Vision Models

Plus new models from A21Labs, generating sketches from an image, a database of AI risks, and answering questions about a first person scene

Hello readers! In this issue we cover

  • Meta releases 4 Human Vision Models

  • AI21Labs releases 2 LLMs with 256K token context length

  • MIT releases database of over 700 AI Risks

  • New model generating sketches from a given image

  • Meta releases system that can recognize scenes and answer questions abou them

🧍🏻‍♀️Meta releases human vision models that can generate photorealistic humans in 3D

Sapiens is a group of models designed to handle four key tasks related to human vision: estimating 2D body poses, segmenting body parts, estimating depth, and predicting surface normals. These models are powerful, supporting high-resolution images and easily adapting to different tasks by fine-tuning them on a vast dataset of over 300 million human images.

By using self-supervised learning on a specialized human image dataset, Sapiens significantly improves performance across various tasks, even when data is limited or synthetic. The models are scalable, performing better as parameters scale from 0.3 to 2 billion, and consistently outperform previous top models in multiple benchmarks, showing substantial improvements in pose estimation, body-part segmentation, depth estimation, and surface normal prediction.

🗣️ A21Labs releases 2 open source large language models

AI21 Labs releases 2 new large-scale, open source models called Jamba-1.5-Large and Jamba-1.5 Mini, boast a 256K context length. They are based on the Jamba hybrid Transformer-Mamba architecture, and they achieve excellent performance in academic benchmarks, chatbot evaluations, and long-context evaluations, while offering improved latency and throughput, especially for long contexts.

⚠️ MIT has a database of 700 AI risks

Researchers at MIT created a public database of 777 risks extracted from 43 taxonomies. This repository features a systematic review of AI risk classifications and expert input, organized into a high-level Causal Taxonomy and a mid-level Domain Taxonomy. The Causal Taxonomy categorizes risks by factors such as entity, intentionality, and timing, while the Domain Taxonomy divides risks into seven domains and 23 subdomains. This initiative aims to provide a structured, extensible framework for more effective discussion, research, and management of AI risks.

✍️ Generating Multiple Face Sketches with Photos

The Facial Sketch Synthesis (FSS) model is a tool that creates sketch portraits from photos, which could be useful in areas like face recognition, entertainment, and art. However, making high-quality sketches is tough due to three main issues: not enough artist-created examples, limited sketch styles, and problems with how existing models handle input information.

To address these, researchers developed a new, efficient model that can turn photos into sketches in different styles without needing extra data like 3D shapes. They used techniques like semi-supervised learning and style guidance to improve the sketch quality.

💡Meta releases Lumos, a system that can recognize scenes and answer questions about them

Meta researchers created Lumos, a system that can extra text from images taken in the first person perspective and answer questions about them.

The researchers addressed problems such as latency (how do you quickly transfer an image from a device to the cloud for inference?), in-the-wild-text, and poor image quality.

🤯 Today I Learned

Every issue, we highlight new AI concepts and terminology to help educate our readers. This issue we learned about:

Scene Text Recognition

Scene text recognition involves detecting and interpreting text in natural images, like signs or advertisements. It includes identifying text regions, recognizing the characters or words within those regions, and refining the results for accuracy. This technology is used in applications such as autonomous driving and augmented reality.

Semi-supervised Learning

Semi-supervised learning is a type of machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. This approach leverages the limited labeled data to guide the learning process, while the unlabeled data helps to improve the model's generalization and performance. It is particularly useful when labeling data is expensive or time-consuming, allowing the model to learn more effectively from the vast amounts of unlabeled data. Semi-supervised learning strikes a balance between supervised learning (using only labeled data) and unsupervised learning (using only unlabeled data).

Mamba Architecture

Mamba is an optimized variation of the Transformer architecture designed to improve efficiency and scalability, particularly for handling long sequences of data. It reduces computational complexity and memory usage while maintaining or enhancing performance. Mamba is suitable for tasks in natural language processing, computer vision, and other areas that involve large-scale sequence modeling.