Claude 3.5 Sonnet comes out on top in Galileo’s Hallucination Index

The AI firm Galileo has simply introduced its newest Hallucination Index, which is a framework that evaluates 22 main generative AI fashions.

Fashions are examined utilizing a metric known as context adherence, which measures “closed-domain hallucinations: circumstances the place your mannequin mentioned issues that weren’t supplied within the context.”

The perfect performing mannequin general for RAG, based on the rating, is Claude 3.5 Sonnet from Anthropic. Galileo mentioned that this mannequin and Anthropic’s different mannequin Claude 3 Opus had close to excellent scores, beating out OpenAI’s fashions, which received final 12 months.

From a value perspective, the very best performing mannequin was Google’s Gemini 1.5 Flash. And Alibaba’s Qwen2-72B-Instruct was general the very best performing open supply mannequin, although in brief context RAG checks, Meta’s llama-3-60b-instruct was the very best.

Damaged down by context size, the very best closed-source mannequin in brief context RAG was Claude 3.5 Sonnet, in medium context RAG was Google’s Gemini-1.5-flash-001 (with value being the tiebreaker with different fashions that additionally scored an ideal rating), and in giant context RAG was once more Claude 3.5 Sonnet.

“In at this time’s quickly evolving AI panorama, builders and enterprises face a crucial problem: tips on how to harness the ability of generative AI whereas balancing value, accuracy, and reliability. Present benchmarks are sometimes based mostly on educational use-cases, moderately than real-world functions. Our new Index seeks to deal with this by testing fashions in real-world use circumstances that require the LLMs to retrieve information, a standard observe in enterprise AI implementations,” says Vikram Chatterji, CEO and co-founder of Galileo. “As hallucinations proceed to be a serious hurdle, our purpose wasn’t to only rank fashions, however moderately give AI groups and leaders the real-world information they should undertake the fitting mannequin, for the fitting job, on the proper value.”

You might also like…

Anthropic’s new Claude 3.5 Sonnet model already competitive with GPT-4o and Gemini 1.5 Pro on multiple benchmarks

Meta’s new Llama 3.1 model competes with GPT-4o and Claude 3.5 Sonnet

Source link