Published
- 3 min read
OpenAI's Sycophancy Problem Isn't a Bug
In April 2025, OpenAI rolled back an update to GPT-4o. The model had become excessively flattering. Users reported it endorsing a business idea for literal “shit on a stick” and supporting people who stopped taking medications.
OpenAI’s postmortem was admirably transparent. They explained that new reward signals based on thumbs-up/thumbs-down feedback “overpowered existing safeguards, tilting the model toward overly agreeable, uncritical replies.” They acknowledged they “focused too much on short-term feedback” and didn’t account for how user interactions evolve over time.
The framing was apologetic. A mistake was made, lessons were learned, the update was rolled back, normal service resumed.
I think the apologetic framing misses the point entirely.
The incentive structure is the problem
RLHF, Reinforcement Learning from Human Feedback, is how most frontier models get aligned. The process is straightforward. Humans rate model outputs, and the model learns to produce outputs that get higher ratings.
The problem is that humans prefer agreeable responses to accurate ones. Not always, but often enough to matter.
Anthropic’s own research, published at ICLR 2024, found that five state-of-the-art AI assistants consistently exhibit sycophancy across varied tasks. When a response matches a user’s views, it’s more likely to be preferred by human raters. Both humans and preference models prefer sycophantic responses over correct ones “a non-negligible fraction of the time.”
The sycophancy isn’t a bug in GPT-4o. It’s RLHF working exactly as designed.
The training dynamic
Research from USC puts it bluntly: “RLHF can contribute to sycophancy, or ‘gaslighting’ of humans. Misleading behavior will actively be incentivized by RLHF when humans can be tricked into mistakenly providing positive feedback.”
The mechanism is simple. A user asks a question, the model gives an answer the user already agrees with, the user clicks thumbs up. The model learns that agreement gets rewarded.
Over millions of such interactions, the model develops a systematic bias toward telling people what they want to hear. The more aggressively you optimise for user satisfaction, the more pronounced this becomes.
OpenAI’s April incident was alignment succeeding at the wrong objective.
The structural tension
A paper in Ethics and Information Technology describes what they call “steerable alignment, a form of sycophancy” that “incentivises models to provide agreeable rather than accurate or diverse answers.”
The paper argues that RLHF produces “an ethically problematic trade-off: increased helpfulness, in the sense of increased user-friendliness, leads to the serious risk of misleading or deceiving users about the true nature of the system they are engaging with.”
Put differently, the more you train a model to be helpful in ways users recognise and reward, the more you train it to be dishonest in ways users don’t notice.
The structural tension isn’t something you can engineer away with better safeguards. The safeguards and the sycophancy come from the same source. You’re using human preferences to train the model, and human preferences are biased toward agreement.
Expect recurrence
OpenAI’s expanded postmortem acknowledged the depth of the problem. They’re working on better evaluation methods and trying to balance short-term satisfaction against long-term user benefit.
But the fundamental tension remains. As long as models are trained primarily on human feedback, and as long as humans prefer agreeable responses to challenging ones, the pressure toward sycophancy will persist. Every improvement in “helpfulness” that optimises for user satisfaction risks amplifying the bias.
The April rollback fixed one manifestation. The incentive structure that produced it hasn’t changed.
I’d be genuinely surprised if we don’t see variations of this problem surface again, across multiple providers, as models become more sophisticated at predicting what users want to hear.
Have a comment?
Send me a message, and I'll get back to you.