Skip to content
Home

Published

- 4 min read

Prompt Engineering Is Mostly Cargo Cult Behaviour

Corporate conference presentation illustrating cargo cult behaviour

Prompt engineering has become a discipline. There are courses, certifications, consultancies. The market was valued at over $2 billion in 2024. Job postings promise six-figure salaries for people who know the right incantations.

The trouble is, the research suggests the foundations are shakier than advertised.

The transferability problem

Prompt engineering tips are context and model-bound. What works on GPT-4 may fail on Claude, and even within a single provider, model updates break prompts that worked fine last month.

Research from Sclar et al. found that several widely used open-source LLMs show performance differences of up to 76 accuracy points from subtle changes in prompt formatting. Not from changing the meaning of the prompt. From changing the formatting.

Adding a space at the end of a prompt can cause the model to change its answer.

The researchers also found that format performance only weakly correlates between models. Their conclusion is rather damning for the prompt engineering industry. The weak correlation “puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format.”

In other words, when someone tells you they’ve found the optimal prompt structure, they’ve found a prompt structure that worked for their model on their task at that moment. Whether it works for you is genuinely uncertain.

The “principled instructions” that aren’t

The Principled Instructions paper from MBZUAI attempted to bring some rigour to the field. The researchers tested 26 prompting principles across multiple models. Techniques like “I’m going to tip $xxx for a better solution!” and “You will be penalized” and “Ensure that your answer is unbiased.”

The findings were mixed, and that’s being generous. Effects varied wildly across models, results weren’t consistently reproducible, and many techniques that worked on one system failed on another.

The same pattern keeps repeating. Some techniques show improvement, but the effects are model-specific, and the advice gets shared as if it were universal.

Why the cargo cult persists

The term “cargo cult” comes from anthropological observations of Pacific Islanders who, after World War II, built replica airstrips and control towers hoping to attract the planes that had previously brought supplies. They had observed a correlation (airstrips → planes → cargo) without understanding the underlying mechanism.

Prompt engineering advice often works the same way. Someone discovers that a particular phrasing improved their results. They share it. Others try it and sometimes it works, sometimes it doesn’t. But because LLMs are probabilistic and variable by nature, people attribute successes to the technique and dismiss failures as doing it wrong.

The cargo cult persists because:

The effects are real, locally. Prompt changes genuinely do cause large differences, up to 76 accuracy points in some cases. People correctly observe that their prompt change helped. They incorrectly assume it will help every time. The research found that format performance only weakly correlates between models, so advice that succeeded for one person may fail for another.

The underlying system is genuinely opaque. Different instruction types have distinct geometries in the model’s representation space, with no unified “instruction-following” capability to appeal to. When adding a space changes the answer, debugging is guesswork.

There’s money in it. A multi-billion dollar industry has formed around presenting techniques as more universal and reliable than the evidence supports. The incentive is to sell certainty.

Some techniques do work, sometimes. The cargo cult isn’t pure superstition. Prompt engineering can improve results. But the advice gets sold as transferable expertise when it’s actually model-specific incantation.

The skill is real, but limited

I’m not arguing that prompt engineering is entirely useless. Clearly, how you phrase a request affects what you get back. The research confirms as much, repeatedly. The issue is the gap between what prompt engineering is and what it’s sold as.

What prompt engineering actually is is pattern-matching with a probabilistic system. Learning what kinds of inputs tend to produce useful outputs from a particular model at a particular time. Developing intuition through trial and error.

But what prompt engineering is sold as is a transferable technical skill with best practices that work across models and time. Engineering.

The word “engineering” implies systematic, reproducible, principled work. When adding a space can swing your results by 76 points, you’re not engineering. You’re doing something closer to gardening, or perhaps negotiating with a system whose behaviour you can influence but not control.

That’s fine. It’s a useful skill. But let’s be honest about what it is.

Have a comment?

Send me a message, and I'll get back to you.