Google DeepMind and Stanford have developed an AI data validation system - corrects 76% of false answers

One of the biggest drawbacks of AI-based chatbots is the so-called "hallucinations", when AI invents false information, in other words, lies. Some experts say that this is one of the interesting features of AI, and it can be useful for generative models that create images and videos. But not for speech models that provide answers to users' questions expecting accurate data.

Google DeepMind Lab and Stanford University seem to have found a way to solve the problem. Researchers have developed a verification system for large language models of artificial intelligence: Search-Augmented Factuality Evaluator, or SAFE, which checks long answers created by AI chatbots. Their research is available as a preprint on arXiv along with all experimental code and datasets.

The system analyzes, processes, and evaluates responses in four steps to check their accuracy and relevance. First, SAFE breaks down the response into individual facts, reviews them, and compares them to Google search results. The system also checks the relevance of individual facts to the query provided.

To assess the performance of SAFE, researchers created LongFact, a dataset of approximately 16,000 facts. Then they tested the system on 13 large language models from four different families (Claude, Gemini, GPT, PaLM-2). In 72% of cases, SAFE provided the same results as human verification. In cases where AI disagreed with the results, SAFE was correct in 76% of cases.

Researchers claim that using SAFE is 20 times cheaper than human verification. Thus, the solution turned out to be economically viable and scalable. Existing approaches to assessing the relevance of content created by the model usually rely on direct human evaluation. Despite its value, this process is limited by the subjectivity and variability of human judgment and the scalability issues of applying human labor to large datasets.

Source: Marktechpost