The new research has exposed weaknesses in large language models such as ChatGPT, which struggle to correctly answer queries from Securities and Exchange Commission filings. This demonstrates some of the difficulties that AI models face with companies, particularly inregulated areas such as finance, as they seek to combine modern technology into their operations - from customer service to research. Anand Kannappan, co-founder of Patronus AI, commented: "That level of performance is simply not acceptable. It has to be a lot higher to be put into operational use."
Researchers from startup Patronus AI discovered that large language models, such as the one at the heart of ChatGPT, often struggle to answer questions generated from Securities and Exchange Commission filings. Even with the capacity to read almost an entire filing in conjunction with the question, the top-performing AI model configuration, OpenAI's GPT-4-Turbo, only achieved 79% accuracy on Patronus AI's new evaluation. In some cases, the large language models either refused to answer or erroneously generated facts and figures not present in the SEC filings. According to Patronus AI co-founder Anand Kannappan, this rate of accuracy is unsatisfactory and should be substantially higher for the technology to be used in a practical manner.
The difficulties encountered by AI models underscore the issues that major companies, especially in heavily regulated industries like finance, face in implementing modern technology for customer service and research. ChatGPT's potential to quickly analyze financial narratives and extract useful numbers upon its release late last year was instantly seen as one of its most beneficial applications. Consequently, Bloomberg LP created its own AI model for financial data, professors studied whether ChatGPT can comprehend financial headlines, and JPMorgan is creating an AI-driven automated investing tool, as reported by CNBC. A McKinsey report predicted that generative AI could increase the banking industry's annual profits by trillions of dollars.
Still, GPT's entrance into the industry has been bumpy. Microsoft launched Bing Chat, with OpenAI's GPT as the base, demonstrating the chatbot's capacity to swiftly summarize an earnings press release. However, it was soon highlighted that the numbers in Microsoft's example were incorrect and some were even fabricated.
When it comes to incorporating LLMs into products, the co-founders of Patronus AI note that due to the nondeterministic nature of these models, companies are having to perform much more rigorous testing to make sure they remain on-topic and provide dependable results. Rebecca Qian and Mahendran Kannappan are the two co-founders of Patronus, and prior to founding the company, the two worked on AI problems at Meta (the parent company of Facebook), which were related to ascertaining how the models arrive at their conclusions and making them more responsible. They have now received seed funding from Lightspeed Venture Partners to develop software which can be used to automate LLM testing so that businesses can be certain their AI will not deliver unexpected answers that are off-topic or inaccurate. According to Qian, the evaluation process at present is mainly manual and companies often refer to it as “vibe checks”. Patronus AI has established a set of more than 10,000 questions and answers that have been derived from SEC filings of large public companies. This dataset, termed FinanceBench, not only contains the expected responses, but it also indicates the exact location within every filing where the answers can be found. The questions may necessitate simple mathematics and require further contemplation in order to determine the answers. The co-founders explain that the dataset serves as a baseline "minimum performance standard" for natural language AI utilized in the financial sector. Some examples of questions from the dataset provided by Patronus AI are: Has CVS Health paid dividends to common shareholders in Q2 of FY2022?; Did AMD report customer concentration in FY22?; and What is Coca Cola's FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.
Patronus AI tested four language models – OpenAI's GPT-4 and GPT-4-Turbo, Anthropic's Claude 2 and Meta's Llama 2 – using a subset of 150 questions they had produced. They tested different configurations and prompts, such as the "Oracle" mode where the OpenAI models were given the exact relevant source text for the question. They also tested where the models were informed of where the underlying SEC documents were stored, or given "long context" which included practically the entirety of the SEC filing along with the question. GPT-4-Turbo failed "closed book" tests where it had no access to SEC source documents, correctly answering only 14 out of the 150 questions it was asked, and producing an incorrect answer 88% of the time. However, when pointed to the exact text of the answer, GPT-4-Turbo responded correctly 85% of the time, yet still gave the wrong answer 15% of the time. In contrast, Llama 2 performed terribly, providing the wrong answer as much as 70% of the time and only 19% of the time giving the right answer when given underlying documents. On the other hand, Claude 2 did well when given "long context," 75% of the time giving the correct answer, 21% the wrong answer, and only 3% failing to answer. GPT-4-Turbo also fared well in this setting, answering 79% correctly and giving a wrong answer 17% of the time. The co-founders of Patronus AI were shocked about how the models did, especially when pointed to the exact answers. Even when the models performed well, they were not satisfactory enough. The co-founders, however, remain optimistic that language models like GPT could help people in the finance industry, if AI continues to improve. OpenAI reminded users to adhere to their usage guidelines which necessitate the review of a qualified person, disclosing the use of AI, and understanding of its limitations. Meta and Anthropic had no comment.
top of page
bottom of page
Comments