• General Intelligence
  • Posts
  • [#20] Are better deepfakes here? TikTok's parent company unveils enhanced dubbing framework.

[#20] Are better deepfakes here? TikTok's parent company unveils enhanced dubbing framework.

Plus research from DeepMind, indecision can be incorporated into recommender systems, a dataset for jailbreaking LLMs, text-to-3D models,

Hello readers, in this issue we cover

  • TikTok’s parent company, ByteDance, creates a framework that better lip syncs and matches facial expressions to target audio

  • DeepMind explores how machines can learn like humans in vision tasks

  • MIT and Harvard researchers show that response delay can be incorporated into a recommendation system

  • A dataset for multi-turn jailbreak attacks, an emerging vulnerability in frontier models

  • Better text to 3D models with DreamMapper

👯 Are Better Deepfakes Here? PersonaTalk, created by TikTok’s parent company, can sync a person’s lips and facial movements to audio

Taylor Swift speaking Chinese, with matching voice and facial expressions

PersonaTalk allows lip-sync visual dubbing while preserving individuals' talking style and facial details. In other words, it’s a framework that can make it look like someone is saying something they didn't actually say. It's like a really good deepfake, but it focuses on making the person's face move and talk like they're saying the new words. It's better than other methods at making it look natural and keeping the person's personality.

PersonaTalk could especially help in the realm of multi-lingual translations without the use of subtitles, such as translating content or educational lectures.

🧠 DeepMind explores how machines can learn like humans in vision tasks

DeepMind’s model better aligns human and machine performance

Deep neural networks have been successful in various applications, including modeling human behavior in vision tasks, but they often fail to generalize as robustly as humans. One key difference is that human conceptual knowledge is hierarchically organized, while neural networks do not fully capture these levels of abstraction. To address this, researchers train a "teacher model" that imitates human judgments and transfers this structure to state-of-the-art vision models. These human-aligned models more accurately reflect human behavior and improve generalization and robustness in machine learning tasks. Infusing neural networks with human-like knowledge can lead to more robust, interpretable, and human-like AI systems.

❓MIT and Harvard researchers show that response delay can be incorporated into a recommendation system

Indecision when selecting between two options provides extra insight

In interactive preference learning, we often rely on simple yes-or-no feedback to understand what people prefer, but this doesn’t tell us how strongly they feel about their choices. To get a better sense of preference strength, this research uses how quickly people make decisions, as faster responses tend to indicate stronger preferences. The study incorporates a model that uses both choices and response times to estimate preferences more effectively. By treating this as a regression problem, the researchers developed a new, efficient way to predict preferences. Their findings show that including response times provides extra insight, especially when people have strong preferences, making the learning process faster and more accurate. Tests on real-world data confirmed that this approach speeds up learning when budgets are tight.

👾 Multi-turn jailbreak attacks, an emerging vulnerability in frontier models

All frontier models could be jailbroken in single-turn and multi-turn attacks

Multi-turn jailbreak attacks involving having a conversation with an LLM to compromise it. This study introduces a dataset of jailbreak examples in both single-turn and multi-turn formats, revealing that even if the content is the same, the success of an attack can vary based on the input structure. It also shows that filters designed to block harmful content perform differently depending on the structure of the input, not just the content itself. Protecting against single-turn jailbreaks does not protect against multi-turn jailbreaks.

This dataset helps facilitate research to explore weaknesses in advanced models in both single-turn and multi-turn jailbreak attempts.

🎸 Introducing chords into AI generated music

New model architecture that incorporates chords into song generation

The Song Generation task aims to create music with vocals and accompaniment based on lyrics. Existing models like Jukebox, created by OpenAI, struggle to control music quality. To address this, the authors introduce chords, a key element in music composition that supports harmony and melidy, into song generation models. They develop a new model, called Chord-Conditioned Song Generator, or CSG, that using a cross-attention mechanism to better incorporate chord information and reduce errors. CSG, shows improved musical quality and precision compared to other methods.

🗿Text to 3D Models with DreamMapper

Score Distillation Sampling (SDS) is a popular method for creating 3D content from text descriptions, but it often produces overly smooth or color-saturated results. The researchers analyzed SDS and developed two new techniques to improve it:

  1. Variational Distribution Mapping (VDM): This method speeds up the process by treating rendered images as degraded versions of AI-generated images. It's more efficient because it avoids complex calculations in the AI model.

  2. Distribution Coefficient Annealing (DCA): This technique further improves the accuracy of transferring information from 2D to 3D.

Using these methods with a 3D representation called Gaussian Splatting, the researchers created a new text-to-3D generation system. Tests showed that VDM and DCA can produce more realistic and detailed 3D assets more efficiently than previous methods.

🤯 Today I Learned

Every issue, we highlight new AI concepts and terminology to help educate our readers. This issue we learned about:

Linear Bandit

Linear Bandit is a type of reinforcement learning problem where the agent learns to make sequential decisions in an environment with a linear reward function. This means that the reward associated with a particular action can be expressed as a linear combination of features of the current state.

Key characteristics of linear bandit problems:

  • Linear Reward Function: The reward is a linear function of the state features.

  • Exploration vs. Exploitation: The agent faces a trade-off between exploring new actions to learn their rewards and exploiting actions that have been shown to be rewarding.

  • Bandit Feedback: The agent only receives feedback about the reward of the chosen action, not the rewards of other possible actions.

Linear bandit problems have applications in various domains, including advertising, recommendation systems, and clinical trials.

Multi-turn Jailbreak Attacks

A multiturn jailbreak attack is a type of adversarial attack on a language model where the attacker engages in a multi-turn conversation with the model, gradually guiding it towards generating harmful or unsafe content.

Unlike single-turn attacks, which involve a single prompt, multiturn attacks exploit the model's ability to maintain context and build upon previous interactions. This allows the attacker to gradually escalate the conversation and manipulate the model's responses.

Score Distillation Sampling

Score Distillation Sampling (SDS) is a technique that uses a denoising model to optimize images based on text prompts. It's a popular method that's used in a variety of applications, including: 

  • Image editing: SDS can be used to introduce new content while preserving the original structure and motion of a video. 

  • Text-to-3D synthesis: SDS can be used to generate 3D objects from text prompts. 

  • Image translation: SDS can be used to train a network to translate images based on a specified task.