21.3 C
Casper
Thursday, September 19, 2024

AI Startup Benchmark Tests Top Language Models

Must read

Galileo Technologies released the Hallucination Index, which benchmarks LLMs on accuracy and cost-effectiveness. Claude 3.5 Sonnet leads in accuracy, while Gemini 1.5 Flash offers the best value.

Artificial intelligence startup Galileo Technologies Inc. released the results of a benchmark test that compared the accuracy of the industry’s most popular large language models.

The Hallucination Index, as the benchmark is called, evaluated 12 open-source and 10 proprietary LLMs. Galileo measured the models’ accuracy across three task collections. Some task collections were completed with perfect accuracy by LLMs based on open-source and cost-optimized designs, demonstrating that such models can provide a competitive alternative to frontier AI systems.

Also Read: Explained: Composite AI

“Our goal wasn’t just to rank models but rather to give AI teams and leaders the real-world data they need to adopt the right model for the right task, at the right price,” said Galileo co-founder and Chief Executive Officer Vikram Chatterji.

San Francisco-based Galileo is backed by more than $20 million in venture funding. It provides a cloud-based platform for AI teams to measure the accuracy of their neural networks and debug technical issues. In May, the company updated the software with a tool for protecting LLMs from malicious input.

Galileo evaluated the models it assessed as part of the Hallucination Index benchmark using a feature of its platform called Context Adherence. According to the company, the feature works by providing an LLM with a test prompt and then measuring the quality of its response using a second LLM. Galileo used OpenAI’s flagship GPT-4o model to assess AI responses.

Each test prompt in the Hallucination Index comprised a question and a piece of text containing the answer. Galileo evaluated the 22 LLMs by deducing the answer from the provided text.

The company evaluated the most accurate LLM, Anthropic PBC’s Claude 3.5 Sonnet. It’s the midrange model in a planned LLM series that Anthropic began rolling out last month. Claude 3.5 Sonnet is a scaled-down, less expensive version of the most advanced model in the series, which has not yet been publicly released.

Each LLM that Galileo evaluated received three sets of questions as part of the test. The prompts in the first set had up to 5,000 tokens of data, while the second set comprised questions with between 5,000 and 25,000 tokens. The questions in the third set ranged from 40,000 to 100,000 tokens. Claude 3.5 Sonnet completed the second and third task collections with perfect accuracy, while its responses to the first set scored 0.97 out of 1.

Also Read: Top AI Chatbots for Every Need

Galileo ranked Google LLC’s Gemini 1.5 Flash as the language model that provides the best value for money. The lightweight LLM, which debuted in May, costs nearly 10 times less than what Anthropic charges for Claude 3.5 Sonnet. Google’s model achieved accuracy scores of 0.94, 1, and 0.92 across the Hallucination Index’s short, medium, and long prompt collections, respectively.

An LLM called Qwen-2-72b-instruct from Alibaba Group Holding Ltd. achieved the highest score among the open-source models that Galileo tested. It answered medium-length prompts containing 5,000 to 25,000 tokens apiece with perfect accuracy. Galileo pointed out that Qwen-2-72b-instruct can process prompts with up to 128,000 tokens, significantly more than the data supported by the other open-source LLMs the company evaluated.

More articles

Latest posts