A new study has revealed that large language models, much like the one used in ChatGPT, are not very successful in giving responses to questions that come from Securities and Exchange Commission (SEC) filings. This emphasizes the difficulties which arise when businesses, specifically those related to financial services, attempt to incorporate advanced technology into their activities, be it providing customer service or for research. Anand Kannappan, co-founder of Patronus AI, commented on this and said "That type of performance rate is not adequate enough, and it needs to be improved significantly in order for it to be used automatically and on a large scale."
Large language models such as the one powering ChatGPT are often unable to answer questions derived from Securities and Exchange Commission (SEC) filings, according to Patronus AI researchers. They ran tests using OpenAI's GPT-4-Turbo, which was provided with almost the full filing alongside the question; however, the accuracy was only 79%. Sometime, these models refused to answer, or provided incorrect figures and facts. Anand Kannappan, a co-founder of Patronus AI, said that this performance statistic is "absolutely unacceptable" and needs to improve for it to be truly useful in production.
The use of AI models in regulated industries, such as finance, has been a key focus for many companies this year. Analyzing financial data and narratives through AI would be a major boon to the financial sector, and ChatGPT has been presented as a promising advancement since its launch in late 2020.
Nevertheless, the integration of GPT into finance has had a few hiccups. Last year, Microsoft demonstrated a Bing Chat that was supposed to easily summarize an earnings press release, but the numbers were flawed and some were just made up. Despite this, Bloomberg LP created an AI model for financial data, academics studied how GPT can interpret financial headlines, and JPMorgan is developing an automated investing tool using AI. McKinsey believes that the banking industry can save trillions of dollars annually with AI-enabled generative models.
The challenge of incorporating LLMs into actual products, according to the Patronus AI co-founders, is that these models are not deterministic; they won't necessarily always deliver the same outcome when given the same input. Accordingly, companies must conduct extensive assessments to verify that their AI bots are working correctly, staying on topic, and providing dependable results. Rebecca Qian and Kannappan, the two founders who met at Facebook's Meta, where they worked on AI-related problems involving how models arrive at their answers and how to make them more responsible, established Patronus AI and have been supported by Lightspeed Venture Partners with seed funding. Their plan is to facilitate the testing of LLMs through software, so organizations can be confident their AI systems won't cause surprise or confusion to customers or employees. According to Qian, the current evaluation process is almost entirely manual, and one company characterized it as a “vibe check.” To create a benchmark for language AI in the financial industry, Patronus AI created a collection of over 10,000 questions and answers obtained from SEC filings of major publicly traded companies, referred to as FinanceBench. The dataset includes the correct answers and also where in the respective filings to find the answers. The questions necessitate light mathematics and reasoning, as not all can be answered directly from the text. To give an example, the dataset includes questions such as: “Has CVS Health paid dividends to common shareholders in Q2 of FY2022?”, “Did AMD report customer concentration in FY22?”, and “What is Coca Cola's FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.” potential
Patronus AI conducted a study on the performance of four language models - OpenAI's GPT-4 and GPT-4-Turbo, Anthropic's Claude 2 and Meta's Llama 2 - on a subset of 150 questions it had created. It experimented with different settings and prompts, such as "Oracle" mode, which provided the models with exact source text in the question, and "long context" mode, which included an entire SEC filing alongside the question.
GPT-4-Turbo was not successful in the startup's "closed book" test, where it wasn't given access to any SEC source document. It answered only 14 out of 150 questions correctly, and was incorrect on the remaining 88%. However, when presented with the underlying filings, GPT-4-Turbo answered correctly 85% of the time in "Oracle" mode, but still made mistakes 15% of the time.
The results yielded by Llama 2, an open-source model developed by Meta, were particularly concerning, as it gave wrong answers in 70% of the cases, and only provided correct ones on 19% of occasions. Claude 2, on the other hand, was successful in the tests with "long context", answering 75% of the questions correctly, incorrectly on 21% of them and failing to respond on only 3%. GPT-4-Turbo also performed well in this setting, correctly responding to 79% of the questions and incorrectly on 17%.
The study's co-founders were surprised by the model's performance, and the high rate of refusal to answer the question even when the answer was within the context. They did, however, recognize the huge potential language models like GPT can bring in the finance industry, but believe that for the time being, it still requires a human input to support and guide the workflow. OpenAI, Meta, and Anthropic did not comment on the results of the study.
top of page
bottom of page
Comments