Instapundit » Blog Archive » A NICE PROOF OF WHAT MOST PEOPLE SEEM TO GET IN THEIR GUT: Apple’s study proves that LLM-based AI mo

October 14, 2024

A NICE PROOF OF WHAT MOST PEOPLE SEEM TO GET IN THEIR GUT: Apple’s study proves that LLM-based AI models are flawed because they cannot reason.

The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models.

The group investigated the “fragility” of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn’t happen.

“Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark,” the group wrote in their report. “Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases.”

The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. “There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer,” the study concluded.

An earlier example: “The faulty logic was supported by a previous study from 2019 which could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding in background and related information about the games they played in, and a third person who was quarterback in another bowl game, the models produced incorrect answers.”

I had a similar experience last week, putting together Florida Man Friday. I asked ChatGPT to create an image that had to include the eye of a hurricane. On the second attempt, I left out the request to include the hurricane’s eye. Because on the first attempt, I got a hurricane full of eyes.

Cute — but not intelligent.

InstaPundit is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com.