See if you can solve this maths problem: Oliver picks 44 kiwis on Friday. He picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them are a bit smaller than average. How many kiwis does Oliver have?
If you answered "190," congratulations. You did as well as the average primary schoolchild by getting it right: Friday's 44 plus Saturday's 58 plus Sunday's 44 multiplied by two, or 88, equals 190.
You also did better than more than 20 state-of-the-art artificial intelligence models tested by an AI research team at Apple. The AI bots consistently got it wrong.
The Apple team found "catastrophic performance drops" by those models when they tried to parse simple mathematical problems written in essay form.
In this example, the systems tasked with the question often did not understand that the size of the kiwis have nothing to do with the number of kiwis Oliver has. Some, consequently, subtracted the five undersized kiwis from the total and answered "185."
Schoolchildren, the researchers posited, are much better at detecting the difference between relevant information and inconsequential curveballs.
The Apple finding has been published in a technical paper that has attracted widespread attention in AI labs, not only because the results are well-documented, but also because the researchers work for a leading high-tech consumer company - and one that has just rolled out a suite of purported AI features for iPhone users.
"The fact that Apple did this has gotten a lot of attention, but nobody should be surprised at the results," said Gary Marcus, a critic of how AI systems have been marketed as reliably, well, "intelligent."
Indeed, Apple's conclusion matches earlier studies that have found that large language models, or LLMs, do not actually "think" so much as match language patterns in materials they have been fed as part of their "training."
When it comes to abstract reasoning - "a key aspect of human intelligence," in the words of Melanie Mitchell, an expert in cognition and intelligence at the Santa Fe Institute - the models fall short.
"Even very young children are adept at learning abstract rules from just a few examples," Mitchell and colleagues wrote after subjecting bots to analogy puzzles.
Their conclusion was that "a large gap in basic abstract reasoning still remains between humans and AI systems."
The Apple researchers set out to answer the question: "Do these models truly understand mathematical concepts?" said one of the lead authors, Mehrdad Farajtabar. Their answer is no.
They also pondered whether the shortcomings they identified can be easily fixed, and their answer is also no.
"Can scaling data, models, or compute fundamentally solve this?" Farajtabar asked. "We don't think so!"
The potential for damagingly inaccurate outputs is heightened by AI bots' natural language capabilities, with which they offer even absurdly inaccurate answers with convincingly cocksure elan. Often they double down on their errors when challenged.
These errors are typically described by AI researchers as "hallucinations." The term may make the mistakes seem almost innocuous, but in some applications, even a minuscule error rate can have severe ramifications.
That is what academic researchers concluded in a recently published analysis of Whisper, an AI-powered speech-to-text tool developed by OpenAI, which can be used to transcribe medical discussions or jailhouse conversations monitored by correction officials.
The researchers found that about 1.4 percent of Whisper-transcribed audio segments in their sample contained hallucinations, including the addition to transcribed conversation of wholly fabricated statements including portrayals of "physical violence or death ... [or] sexual innuendo," and demographic stereotyping.
That may sound like a minor flaw, but the researchers observed that the errors could be incorporated in official records such as transcriptions of court testimony or prison phone calls - which could lead to official decisions based on "phrases or claims that a defendant never said."
Updates to Whisper in late 2023 improved its performance, the researchers said, but the updated version "still regularly and reproducibly hallucinated."
That has not deterred AI promoters from unwarranted boasting about their products. In an October 29 tweet, Elon Musk invited followers to submit "x-ray, PET, MRI or other medical images to Grok [the AI application for his X social media platform] for analysis." Grok, he wrote, "is already quite accurate and will become extremely good."
It should go without saying that, even if Musk is telling the truth, any system used by health-care providers to analyze medical images needs to be a lot better than "extremely good."
That brings us to the Apple study. It is proper to note that the researchers are not critics of AI as such but believe that its limitations need to be understood.
The team plied its subject AI models with questions drawn from a popular collection of more than 8,000 primary school maths problems testing children's understanding of addition, subtraction, multiplication and division.
When the problems incorporated clauses that might seem relevant but were not, the models' performance plummeted.
That was true of all the models, including versions of the GPT bots developed by OpenAI, Meta's Llama, Microsoft's Phi-3, Google's Gemma and several models developed by the French lab Mistral AI.
Some did better than others, but all showed a decline in performance as the problems became more complex.
One problem involved a basket of school supplies including erasers, notebooks and writing paper. That requires a solver to multiply the number of each item by its price and add them together to determine how much the entire basket costs.
When the bots were also told that "due to inflation, prices were 10 percent cheaper last year," the bots reduced the cost by 10 percent.
That produces a wrong answer, since the question asked what the basket would cost now, not last year.
Why did this happen? The answer is that LLMs are developed, or trained, by feeding them huge quantities of written material scraped from published works or the internet - not by trying to teach them mathematical principles.
LLMs function by gleaning patterns in the data and trying to match a pattern to the question at hand.
But they become "overfitted to their training data," Farajtabar explained.
"They memorized what is out there on the web and do pattern matching and answer according to the examples they have seen. It is still a [weak] type of reasoning but according to other definitions it's not a genuine reasoning capability."
That is likely to impose boundaries on what AI can be used for.
In mission-critical applications, humans will almost always have to be "in the loop," as AI developers say - vetting answers for obvious or dangerous inaccuracies or providing guidance to keep the bots from misinterpreting their data, misstating what they know, or filling gaps in their knowledge with fabrications.
To some extent, that is comforting, for it means that AI systems cannot accomplish much without having human partners at hand.
"These systems are always going to make mistakes because hallucinations are inherent," Marcus said. "The ways in which they approach reasoning are an approximation and not the real thing. And none of this is going away until we have some new technology."
Los Angeles Times (TNS)
The performance of artificial intelligence models when trying to parse mathematical problems that are in essay form – those that average primary schoolchildren get right – catastrophically drops, an Apple reserach team has discovered.