As per a recent study, GPT-4 and GPT-3.5’s responses were compared to their March versions with significant deviations.
Three researchers from Stanford University and UC Berkeley published a report titled How Is ChatGPT’s Behavior Changing over Time? (arXiv | PDF). The findings demonstrate that the behavior of GPT-3.5 and GPT-4 “has varied significantly over a relatively short amount of time.” As per the study, the AI chatbot has become worse in math, answering sensitive questions, and coding, while having improved marginally in visual reasoning.
Most notably, the GPT-4’s ability to identify prime numbers took a deep dive from 97.6% in the March 2023 version to a tiny 2.4% in the June 2023 version.
People have previously complained how GPT-4 has become worse, dumber, and especially in the case of code generation, buggy. See the discussion on Hacker News. It’s also likely that OpenAI is prioritizing speed and cost-efficiency over the quality of responses.
In terms of answering sensitive questions, the GPT-4 and GPT-3.5 models are certainly safer but less rational than before. This, particularly, is something OpenAI has done intentionally and was to be expected with the growing concerns over security issues. For example, it was easy to get around the guardrails to get ChatGPT to give answers to unsafe or controversial questions.
While it’s easy for humans to do certain basic math tasks, an LLM needs to follow a lot of steps to arrive at a particular conclusion. The researchers tested this with a Chain-of-Thought approach that’s highly capable of analyzing reasoning-heavy tasks.
On average, GPT-4 generated 3.8 characters in reasoning-heavy tasks which is in stark comparison to the 821.2 characters it was generating in March. GPT-3.5 has been improved in this aspect. For sensitive questions, GPT-4 answered fewer ones and GPT-3.5 slightly more than their March versions. GPT-4 has also been made significantly more robust against jailbreaking queries (where your prompt describes a made-up story and asks the bot to act as an unfiltered bot to tiptoe around the safety measures). Both models still provided more direct responses to jailbreaking-type prompts.