The Real Bottleneck in AI-augmented Work

GenAI has shifted the bottleneck in professional knowledge work from producing text to verifying it. The scarce resource is no longer the ability to write – it is the judgment to evaluate. And this is neither easy nor straightforward.

A stylised report page with cracks and errors.
Image: AI-generated (Nano Banana 2), prompted and selected by the author.

For expert, high-level professional work, the bottleneck has long been production: It is difficult and slow to put high-quality thoughts on paper. With recent developments in Generative AI (GenAI), text generation has quickly become easy. The bottleneck has shifted – it is now about sifting through heaps of generated text and verifying it. This is where the new value lies.

I know this. We all know this. Activity reports, meeting notes, desk research – these are quickly generated and skimmed through, but seldom carefully verified. At best, when errors are found, your colleagues think you have been sloppy; that you did not put in the attention that was required. At worst, it leads to meaningful errors. This is not a matter of individual carelessness or negligence. This happens to all of us. Verifying errors is hard work.

From slop to "looks good"

The real problem is that we have moved from generated text being obviously wrong to looking good.

A couple of years ago, AI slop was just that: slop. Merriam-Webster named it their 2025 Word of the Year – but by then, the word was already describing yesterday's problem. The improvements in state-of-the-art GenAI models have been all but transformative. While I prefer doing my own, admittedly anecdotal, testing of models and find established benchmark results often lacking in real-world value, the trajectory has been ever so clear.

OpenAI's GDPval benchmark, published in late 2025, is one of the few that tries to measure what actually matters to (qualitative) knowledge workers. Unlike most benchmarks, which test abstract reasoning or factual recall, GDPval asks AI models to produce real work deliverables – financial analyses, consulting reports, legal reviews, presentations – and then has experienced professionals, averaging 14 years of domain experience, blindly judge whether the AI output is better than, equal to, or worse than the work of a human expert. The current best-performing models are judged as good as or better than the human expert in roughly 83% of comparisons (ChatGPT 5.5-Thinking, see e.g. Mashable).

Sounds great, right? But skimming that output with the verdict "it looks good so it must be correct" is fundamentally more dangerous than dealing with slop. When working on interoperability and digital governance, like I do, a 83%-right document on generating an interoperable architecture, translating regulatory requirements into actionable governance measures, or simply summarising an expert call you had with partners or clients, can lead to serious financial and reputational damage – and/or make you look like a fool.

Harvard Business Review called this dynamic "workslop" in a September 2025 article: content that appears polished but lacks real substance, effectively offloading the cognitive labour onto whoever has to work with it.

The counterarguments

"Just read it carefully yourself": True – but this will often take almost as much time and effort as writing the document from scratch. The text does not follow your thought process, and adding a single sentence to make a point in an otherwise seemingly well-written paragraph can take longer than rewriting the whole paragraph in your own words. Why? You are not improving your own reasoning; you are retrofitting substance into text that was generated, not thought through – and that is a fundamentally different task.

"Just let your junior colleague proofread it": Not a good idea. This is where the difference between junior and senior work becomes most visible. Properly assessing polished text that is light on substance requires the kind of domain and institutional knowledge that is difficult to put in words – it is your gut instinct built over years of practice, not something you can hand over in a briefing note. Juniors simply have not had the time to develop this layer of intuition, which is precisely what makes them juniors. Seniors will still need to validate. And validating AI output is often harder than validating junior work, because it does not carry the traces of a human thought process that you are practiced in following and correcting.

What helps – and what does not

You could try to put these seniors' gut feelings into words by adding them into system prompts, project instructions, or similar frameworks. You can approximate this to an extent, and many people around the world are trying to do this well, but human sense and intuition is not going to be mirrored or fully codified anytime soon. If ever.

You could ignore AI and keep working without it. There have been recent studies suggesting that GenAI is not increasing productivity much across the corporate board – an MIT report from 2025 found that 95% of enterprise AI pilot programmes delivered no measurable return (see Fortune, 2025). But this is the fundamentally wrong approach to the issue. GenAI is a multi-layered tool that requires training and curiosity to make the most out of it. It will take time and effort to skill a workforce – as well as individuals – to use it well. But once they do, efficiency gains are real, including for qualitative knowledge work.

What does help is tight prompting, i.e. being precise and specific about what you want, in what structure, and under which constraints. Context engineering helps further: designing the full information environment that an AI model works within, from reference documents to conversation history, so the model has the richest possible basis for its output. These are real skill areas and they make a measurable difference to output quality.

But ultimately, knowledge workers of all levels will have to keep developing their ability to spot GenAI errors – that is, noticing when coherent, seemingly well-written text does not actually make much sense, or presents well-argued but fundamentally flawed facts or reasoning (including outright hallucinations). As frustrating and discomforting as it may be, for now – and likely the foreseeable future – we will simply all have to pay heightened attention to the output LLMs generate.

Judgment as the bottleneck

The bottleneck has thus moved from production to verification – from generating text to governing what it says. I do not have a perfect answer for how to do this well. I am not sure anyone does. But the direction is clear.

The scarce resource in AI-augmented knowledge work is no longer the ability to write. It is the judgment to evaluate what has been written for you – judgment that is institutional, contextual, and often instinctive. This is not something you can automate, outsource, or skip. It is what makes expert work expert work. And success in the AI era will be set by the rigour we bring to exercising that judgment – for which we have yet to build the habits, the processes, and the awareness, both individually and institutionally.


Dr Gert Hilgers writes about AI, digital governance, and geopolitics at riskymatters.com and consults through Interveo and AIffect.