top of page

Comparing Meta, OpenAI, Anthropic and Cohere A.I. models on Their Ability to Fabricate Outputs

The researchers at Arthur AI compared the AI models of Meta, OpenAI, Cohere, and Anthropic and discovered that certain models generate more false information, or "hallucinate", than others. Cohere's AI was observed to be the most prone to these instances, while Meta's Llama 2 demonstrated more of this behavior than GPT-4 and Claude 2. Overall, GPT-4 showed the most reliable performance and the researchers noticed that compared to its older model GPT-3.5, it hallucinates, e.g. when asked math questions, between 33% and 50% less. According to a report from Arthur AI, a machine learning monitoring platform, if the tech industry's top AI models had superlatives, Microsoft-backed OpenAI's GPT-4 would be best at math, Meta's Llama 2 would be most middle of the road, Anthropic's Claude 2 would be best at knowing its limits and Cohere AI would receive the title of most hallucinations and most confident wrong answers. This research comes at a time when the public is more concerned than ever about misinformation stemming from AI systems, especially in light of the upcoming 2024 U.S. presidential election. In one experiment, the researchers tested the AI models for competency in combinatorial mathematics, U.S. presidents and Moroccan political leaders. On math questions, GPT-4 hallucinated 33-50% less than its prior version, GPT-3.5. However, Claude 2 performed best in the U.S. presidents category. When asked about Moroccan politics, GPT-4 came in first, while Claude 2 and Llama 2 mostly chose not to answer. In another experiment, the researchers evaluated the AI models' propensity to use warning phrases to avoid risk. GPT-4 had a 50% relative increase in hedging compared to GPT-3.5, while Cohere's AI model did not hedge at all. Claude 2 was found to be most reliable in terms of self-awareness, meaning accurately gauging what it does and doesn't know. Adam Wenchel, co-founder and CEO of Arthur, concluded that the key takeaway for users and businesses is to test the AI models on their exact workloads to understand how they perform in the real world.

コメント


bottom of page