Issue #13: Amazon creates new RAG evaluation framework

Plus automated scientific discovery, improving data quality, flexible and visual memory, predicting future events

Hello readers! In this issue we cover

  • Automating scientific discovery for cheap

  • DeepMind researchers introduce visual memory

  • ScalingFilter, a novel method to improve data quality used for pre-training

  • Amazon researchers include new RAG evaluation framework

  • LLMs used for predicting open ended events based on past and relevant history

👩‍🔬 Automated Scientific Discovery for less than $300

LLM, # of ideas generated, and cost to produce a scientific paper

This paper demonstrates an agent that can conduct automated scientific research and write high quality papers that meet the standard for top conferences, at a cost between $10-$300.

The process includes idea generation, experiment design, execution, visualization, and writing the results to a manuscript. Additionally, a automated reviewer achieves human level performance across a variety of metrics.

✏️ DeepMind Researchers Can Make Neural Networks Unlearn and Re-Learn

Interpretable decision-making. A retrieval-based visual memory enables a clear visual understanding of why a model makes a certain prediction

After training a neural network, modifying the knowledge within the network is impossible. The network is rigid, with the knowledge being distributed across the model’s weights. DeepMind researchers explore an alternative approach where they introduce the flexibility of a database.

In the example of image classification and search, they build a flexible memory that can add data across scales, remove data through unlearning and pruning, and mechanism where the researchers can intervene to control its behavior.

💽 ScalingFilter improves data quality for pre-training

Data quality is crucial for effective pre-training of model, however, current methods rely on referencing existing high quality datasets which may introduce bias and compromise diversity.

To overcome this, researches created ScalingFilter, a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data, eliminating the influence of the reference dataset in the filtering process.

The team found their method improves zero-shot performance of pre-trained models. They also introduce a new metric, called semantic diversity, which measures dataset diversity.

🔎 Amazon Researchers Evaluate RAG Performance with New Metrics

RAGChecker Overall, Retriever, and Generator Metrics

Researchers at Amazon AWS created a new framework called RAGChecker that can evaluate retrieval and generator methods. RAGChecker evaluated 8 different RAG systems across 10 domain specific datasets, and the RAG system that exhibited the best performance in their experiments is E5-Mistral_GPT-4, owing to the strong retrieval capability of E5-Mistral coupled with the adept comprehension abilities of GPT-4.

🔮 Predicting the Future with AI

OpenEP (Open Ended Future Event Prediction) generates diverse and flexible predictions based on real-world scenarios. Previous research in event prediction relied on more discrete or fixed number of outcomes instead of it being open ended. Researchers feed similar and related events to an LLM which can then predict what may happen in the future.

🤯 Today I Learned

Every issue, we highlight new AI concepts and terminology to help educate our readers. This issue we learned about:

Event Prediction

Event prediction involves using data and algorithms to forecast future occurrences based on patterns and trends. It can apply to a wide range of scenarios, such as predicting weather events, stock market changes, or social behaviors. By analyzing past data, event prediction models can estimate the likelihood and timing of specific events. This helps in planning, decision-making, and risk management across various fields.

Perplexity

Perplexity is a measurement used in natural language processing (NLP) to evaluate how well a language model predicts a sample of text. It quantifies the model's uncertainty when predicting the next word in a sequence. A lower perplexity indicates that the model is better at making predictions, meaning it is more confident and accurate. Essentially, perplexity can be thought of as the average number of possible next words the model considers likely, so a lower value means the model is more certain about its predictions.

Semantic Diversity

Semantic diversity refers to the range of different meanings or concepts represented in a set of words, sentences, or texts. In natural language processing and linguistics, it measures how varied the content is in terms of its meaning. High semantic diversity means the content covers a wide array of topics or ideas, while low semantic diversity indicates repetition or a focus on a narrow subject matter. It's often used to assess the richness and variability of language in communication or text generation.