HHEM | Flash Update: Anthropic Claude 3
See how Anthropic’s new Claude 3 LLM hallucinates compared to other foundation models in the Hughes Hallucination Evaluation Model (HHEM)
2-minute read timeIn an exciting development on March 4, 2024, Anthropic, a formidable competitor of OpenAI, unveiled its latest innovation: the Claude 3 suite of AI models. This groundbreaking collection includes three advanced models named Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus, each with unique attributes: Haiku is celebrated for its “light and fast” capabilities, Sonnet for its “hard-working” nature, and Opus for its “powerful” performance.
In a significant benchmarking revelation, Anthropic’s “powerful” model, Claude 3 Opus, has demonstrated performance levels that either match or surpass those of OpenAI’s GPT-4, a model previously considered the pinnacle of AI technology.
Our team has utilized the Hughes Hallucination Evaluation Model (HHEM) to assess the tendency of the Claude 3 Opus and Sonnet models to generate factually inconsistent summaries. We plan to extend this analysis to Claude 3 Haiku in the near future when it becomes available.
On our updated leaderboard, both Claude 3 Opus and Sonnet models have shown higher factual consistency rates compared to Google’s recently introduced Gemma model. It’s noteworthy, however, that despite being considered as the more “powerful,” Claude 3 Opus ranks slightly below Claude 3 Sonnet in our leaderboard. This ranking, based on a limited evaluation set, should not be hastily interpreted as a definitive measure of superiority between the models in the context of model hallucination.
The leaderboard, as of March 6, 2024, is illustrated below, presenting a comparative view of the models’ performance against other foundational models:
The findings highlight notable improvements in performance when compared to Claude 2, shedding light on the overall progress within the industry – spanning both open and closed models – on our benchmark. Nevertheless, the claim that Claude 3 models surpass GPT-4 warrants a cautious examination, particularly with regard to factual consistency.
As opposed to recent releases of many open-source models, Claude 3 models are not open-sourced and can be accessed through the Anthropic API.