Issue #7: GPT4 can predict earnings better than humans

Plus more on how to avoid model collapse, measuring transparency in foundation models, accurately captioning videos, and identifying offensive language

Hello readers! In this issue we cover:

  • How researchers can avoid model collapse (hint: it’s with mixing human generated data with synthetic data)

  • How to measure transparency in foundation models

  • GPT-4 can predict earnings and is better than humans at financial analysis

  • Nvidia creates a framework to accurately caption videos

  • Measuring and identifying toxic language online

⛓️‍💥 Model Collapse Is Actually Not A Problem?

Models produce gibberish when successively trained on AI-generated synthetic data

Recently, this Nature article on model collapse got a lot of attention. It showed that if you train models on 100% AI-generated data, the model’s performance degrades over time and will produce gibberish, leading to model collapse.

Now, researchers at Stanford say that model collapse can be avoided by training on a combination of human-generated data and synthetic data, and that this combination of data is arguably more realistic on how models are actually trained.

🪟 How to Measure Transparency in Foundation Models

Foundation Model Transparency Index scores

The Foundation Model Transparency Index (FMTI) measures how transparent 14 major foundation model are across 100 different indicators. One such indicator is “do they disclose wages paid for data labor?”

Almost all model developers such as Anthropic and Google became more transparent about their practices over time. Microsoft and Meta are some of the most transparent while Google, Amazon, and OpenAI are some of the least transparent.

📈 GPT-4 is Better Than Human Financial Analysts

Charging Bull in New York City

Researchers from the Chicago Booth School of Business tested if GPT-4 could analyze financial statements like a professional analyst. They found that even without extra context, GPT-4 predicted earnings changes better than human analysts. Its predictions were as accurate as advanced machine learning models and provided valuable insights. Trading strategies based on GPT-4’s predictions also performed better.

🎥 NVIDIA creates automated framework to accurately caption videos

Wolf: A Work Summarization Framework

Accurately captioning videos is crucial because it helps with content understanding and aids in the training process of video and image foundation models. Current methods of captioning are inaccurate, time consuming, and expensive. The Wolf framework, developed by researchers at Nvidia, takes advantage of multiple models to automatically caption videos and enhance content understanding. It outperforms both GPT-4 and Gemini by 50%. They also use introduce the CapScore, a metric to assess caption quality, which is used to establish a benchmark for captioning as well as a leaderboard. Gemini Pro is better at video captioning than GPT-4, while Wolf is better than both.

🤬 Measuring Toxic Language with ML models

Macro F1 score of the publicly available offensive language detection models

The prevalence of offensive content on the internet, encompassing hate speech and cyberbullying, is a pervasive issue online. This paper studies how well various LLM models and ML approaches can detect offensive language across social media platforms such as X and Facebook. A new benchmark is introduced, called GenOffense that quantifies how well these models perform in detecting offensive language generally across many data sources and platforms.

🤯 Today I Learned

Every issue, we highlight new AI concepts and terminology to help educate our readers. This issue we learned about:

Video Language Model

A video language model is a type of artificial intelligence model designed to understand and generate text descriptions or captions for video content. It combines techniques from natural language processing (NLP) and computer vision to analyze and interpret the visual and temporal information in videos. VLMs are used for video captioning, video retrieval (searching for videos based on text queries), video question-and-answering, and video generation.

ToxicBERT

ToxicBERT is a variant of the BERT (Bidirectional Encoder Representations from Transformers) model, fine-tuned specifically for detecting toxic or harmful content in text. It leverages the pre-trained BERT model's understanding of language and is further trained on datasets containing examples of toxic language, such as hate speech, abusive comments, and other forms of offensive content. By doing so, ToxicBERT can effectively identify and classify text as toxic or non-toxic, making it a valuable tool for moderating online content and combating cyberbullying.

F1 Score

The F1 score is a way to measure how well a model makes predictions. Think of it like grading a test where the score considers both how many correct answers you got (precision) and how many real answers you missed (recall). It's useful when you want to balance the importance of not missing any true positives and not falsely labeling negatives as positives. A perfect F1 score is 1, meaning the model is making all correct predictions without any mistakes.