Harness Engineering: Humans Steer, Agents Execute
Why the discipline of engineering is shifting from writing functions to building the scaffolding that controls the AI.

No matter how powerful or fast your horse is, imagine riding it without a harness.
Can you?
Without reins, you can’t steer it. You can’t slow it down. You can’t guide it through a narrow path. Power alone doesn’t make it useful. But add a harness, and suddenly that same horse becomes controllable. Directional. Reliable.
That’s exactly what OpenAI experimented with and wrote about in their recent post on Harness Engineering
They’ve spent the last five months running an experiment that feels like a glimpse into the next decade of software engineering: building a product with zero lines of manually written code. Everything - the application logic, tests, CI configuration, observability setup, and even documentation was authored by Codex agents. The result was a million-line codebase built in roughly one-tenth the time it would have taken a traditional human team.
Impressive? Yes.
But the real story isn’t the speed. It’s the harness.
Humans Steer. Agents Execute.
The core philosophy they landed on is simple: humans steer, agents execute
In this world, the engineer’s primary job isn’t typing functions. It’s designing the environment, specifying intent clearly and building the feedback loops that keep those agents on track.
Codex is the horse: powerful, fast, capable of running for hours.
The harness is everything around it: the repository structure, strict architectural boundaries, mechanical enforcement via linters, observability wiring, structured documentation and review loops.
Without that harness, you don’t get a product.
You get chaos at scale.
The Repository as the System of Record
One of the most interesting shifts they described was moving away from “manuals” toward “maps.”
They tried the giant instruction-file approach. It failed. Large instruction blobs rot quickly. They crowd out context. They become stale. When everything is important, nothing is.
Instead, they treated AGENTS.md as a lightweight table of contents, pointing to a structured docs/ directory that acted as the real system of record
That’s a subtle but powerful change.
In many enterprises, critical decisions live in Slack threads, Google Docs, hallway conversations or someone’s memory. In an agent-first system, if knowledge isn’t versioned inside the repository, it effectively doesn’t exist.
That forces clarity. It also forces discipline.
Making the System Legible to the Agent
For an agent to be effective, the codebase has to be legible to it.
OpenAI went further than just structuring documentation. They wired Codex directly into an ephemeral observability stack. Agents could query logs using LogQL and metrics using PromQL. They could inspect traces, reproduce bugs, and validate UI flows automatically
Think about that.
Instead of a developer manually checking logs and stepping through failures, the agent could “see” what was happening and iterate.
That’s not just code generation. That’s operational awareness.
The horse wasn’t just running. It had vision.
Enforcing Taste and Architecture
In my 25+ years in tech, I’ve seen how quickly architectural purity falls apart under delivery pressure. In this experiment, they didn’t treat architecture as aspirational. They made it mandatory. They enforced a rigid layering model:
Types → Config → Repo → Service → Runtime → UI
Dependencies could only flow in approved directions. Custom linters enforced boundaries mechanically. Structural tests ensured drift didn’t creep in.
Instead of arguing over style in code reviews, they encoded “human taste” directly into tooling.
And when the agent produced what they openly called “AI slop,” they didn’t just clean it up manually. They built recurring cleanup processes like background agents that scan the repository, update quality grades and open targeted refactoring pull requests.
It’s essentially garbage collection for a codebase.
Human judgment gets encoded once, then enforced continuously.
That’s the harness tightening itself.
Will This Work for the Rest of Us?
This isn’t magic.
It worked because they invested heavily in scaffolding. They designed the harness first.
Where this model will likely shine:
Greenfield products
Teams with strong platform discipline
Clear domain boundaries
Organizations willing to encode governance into tooling
Where it will struggle:
Legacy systems full of tribal knowledge
Fragmented architectures
Weak repository hygiene
Environments where “the real decision” lives outside version control
Most enterprises aren’t ready to just “let the horse run.”
They need to build the harness first.
The Optimization Target Has Shifted
For decades, we optimized for better engineers writing better code.
Now the optimization target is different.
We are optimizing for better repository designers. Better constraint designers. Better feedback-loop architects. As OpenAI put it, the discipline of software engineering is moving out of the code and into the scaffolding
The code becomes a byproduct.
The environment becomes the product.
If we can build a better harness, we can stop fighting the horse and start deciding where we want to go.
And that might be the real evolution here.
References
-Suren
