8 Levels of Agent Engineering

CycleProphet · 2026-03-17T07:11:18+00:00

Each level up means a massive leap in output, and each improvement in model capabilities further amplifies these gains.Author: Bassim EledathCompiled by: Bao YuAI's programming capabilities are outpacing our ability to harness them. That's why all those efforts frantically chasing SWE-bench scores haven't synchronized with the productivity metrics that engineering leadership actually cares about. The Anthropic team shipped Cowork in 10 days, while another team using the same model couldn't even get a POC (proof of concept) working—the difference is that one team has already bridged the gap between capability and practice, while the other hasn't.This gap won't disappear overnight, but will gradually narrow in distinct stages. 8 levels total. Most people reading this article have probably already passed the first few levels, and you should be eager to

CycleProphet

2026-03-17 07:11:18

Leveling up each stage signifies a huge leap in productivity, and every improvement in model capability further amplifies these gains.

Author: Bassim Eledath

Translation: Baoyu

AI’s programming ability is surpassing our capacity to control it. That’s why all those efforts to push SWE-bench scores hard don’t necessarily translate into real productivity metrics that engineering leadership cares about. The Anthropic team launched Cowork in just 10 days, while another team using the same model couldn’t even get a proof of concept — the difference is that one team has bridged the gap between capability and practice, while the other hasn’t.

This gap won’t disappear overnight; it narrows gradually through levels. There are a total of 8 levels. Most readers of this article are probably past the first few levels, but you should be eager to reach the next — because each level-up represents a huge productivity jump, and each enhancement in model ability further magnifies these benefits.

Another reason to care is the effect of team collaboration. Your output depends more on your teammates’ levels than you might think. Suppose you’re a Level 7 expert, and at night, the backend AI is helping you with several PRs. But if your repository requires a colleague’s approval to merge, and that colleague is still at Level 2, manually reviewing PRs, your throughput is bottlenecked. Helping teammates level up benefits you too.

Based on conversations with many teams and individuals about their AI-assisted programming practices, here’s the observed path of level progression (the order isn’t strictly fixed):

The 8 Levels of AI Engineering

Levels 1 & 2: Tab Completion & AI IDEs

I’ll briefly cover these two levels mainly for completeness. Feel free to skim.

Tab completion is the starting point. GitHub Copilot kicked off this movement — press Tab, and code is auto-completed. Many might have forgotten this stage, and newcomers might skip it altogether. It’s more suited for experienced developers who can set up the skeleton of code first, then let AI fill in the details.

Dedicated AI IDEs like Cursor changed the game by connecting chat and codebases, making cross-file editing much easier. But the ceiling is always the context. Models can only help with what they see, and frustratingly, they either miss the right context or see too much irrelevant info.

Most at this level are also experimenting with plan-based modes: transforming a rough idea into a structured, step-by-step plan for the LLM, iterating on it, then executing. This works well at this stage and is a reasonable way to maintain control. But as we’ll see in higher levels, reliance on plan mode diminishes over time.

Level 3: Context Engineering

Now we’re entering the interesting part. Context Engineering became a buzzword in 2025 because models finally can reliably follow a reasonable number of instructions with just the right context. Noisy or insufficient context is equally bad, so the core work is to increase the information density per token. “Every token must fight for its position in the prompt” — that was the mantra.

The same info, fewer tokens — information density is king (Source: humanlayer/12-factor-agents)

In practice, context engineering covers more than most realize. It includes your system prompts and rules files (.cursorrules, CLAUDE.md). It involves how you describe tools, since models read these descriptions to decide which tools to call. It includes managing dialogue history to prevent long-running agents from losing track after many turns. It also involves deciding which tools to expose each turn, since too many options can overwhelm the model — just like humans.

Today, the term “context engineering” is less common. The focus has shifted toward models that tolerate noisier contexts and can still reason in more chaotic scenarios (larger context windows help too). But context consumption remains critical. The following scenarios still make it a bottleneck:

Smaller models are more sensitive to context. Voice applications often use smaller models, and context size affects response latency tied to the first token.
Token-heavy protocols. Protocols like Playwright MCP and image inputs quickly eat tokens, pushing you into “compressed session” mode earlier than expected in Claude Code.
Agents integrated with dozens of tools. Parsing tool definitions consumes more tokens than doing the actual work.

More broadly, context engineering isn’t disappearing — it’s evolving. The focus has shifted from filtering out bad context to ensuring the right context appears at the right time. This shift paves the way for Level 4.

Level 4: Compound Engineering

Context engineering improves the current session. Compound Engineering (coined by Kieran Klaassen) improves subsequent sessions. This realization was a turning point for me and many others — it made us see that “programming by feel” is more than just prototyping.

It’s a cycle of “plan, delegate, evaluate, consolidate.” You plan tasks, provide enough context for the LLM to succeed, delegate, evaluate the output, and crucially, consolidate what you’ve learned: what works, what went wrong, what patterns to follow next time.

The cycle: plan, delegate, evaluate, consolidate — each iteration improves the next

The magic is in “consolidation.” LLMs are stateless. If it reintroduces a dependency you explicitly removed yesterday, it will do so again tomorrow — unless you tell it not to. The most common fix is updating your CLAUDE.md (or equivalent rules file) to embed lessons learned into future sessions. But beware: stuffing too many instructions into rules can backfire (overloading instructions equals no instructions). A better approach is creating an environment where the LLM can discover useful context on its own — for example, maintaining an up-to-date docs/ folder (covered in Level 7).

Practitioners of compound engineering are highly sensitive to the context fed into the LLM. When errors occur, their instinct is to check if the context is missing something, rather than blaming the model. This intuition makes Levels 5–8 possible.

Level 5: MCP & Skills

Levels 3 & 4 address context issues. Level 5 tackles capability. MCPs and custom skills enable your LLM to access databases, APIs, CI pipelines, design systems, and tools like Playwright for browser testing or Slack for notifications. The model no longer just thinks about your codebase — it can act directly.

There’s plenty of good material on MCPs and skills, so I won’t repeat what they are. But here are some examples I’ve used: our team shares a PR review skill, iterating on it together (still ongoing). It conditionally spawns sub-agents based on PR type — one checks database security, another analyzes complexity for redundancy or overengineering, another monitors prompt health to ensure standards. It also runs linters and Ruff.

Why invest so much in review skills? Because when agents start producing PRs en masse, manual review becomes a bottleneck, not a quality gate. Latent Space makes a compelling case: traditional code review is dead. Automated, consistent, skill-driven review is the future.

For MCPs, I use Braintrust MCP to query logs and make direct edits. I use DeepWiki MCP to access docs from any open-source repo without manually pulling in docs.

When multiple team members develop their own skills, it’s worth consolidating into a shared registry. Block (with respect) built an internal skills marketplace with over 100 skills, curated for specific roles and teams. Skills and code get the same treatment: pull requests, reviews, version history.

Another trend: LLMs increasingly use CLI tools instead of MCPs (and every company seems to be releasing its own: Google Workspace CLI, Braintrust’s upcoming CLI). The reason is token efficiency. MCP servers inject full tool definitions into context every turn, whether used or not. CLI, on the other hand, runs targeted commands, with only relevant output entering the context window. I often prefer agent-browser over Playwright MCP for this reason.

Pause here. Levels 3–5 are the foundation for everything that follows. LLMs are surprisingly good at some tasks and surprisingly poor at others. Developing intuition about these boundaries is essential before stacking more automation. If your context is noisy, prompts are insufficient or inaccurate, or tool descriptions are vague, then Levels 6–8 will only magnify these issues.

Level 6: Harness Engineering

The rocket really takes off here.

While context engineering focuses on what the model sees, Harness Engineering (a term I borrow) is about building the entire environment — tools, infrastructure, feedback loops — so that the agent can work reliably without your intervention. It’s not just an editor; it’s a complete feedback system.

OpenAI’s Codex toolchain — a full observability system enabling agents to query, relate, and reason about their outputs (Source: OpenAI)

OpenAI’s Codex team integrated Chrome DevTools, observability tools, and browser navigation into the runtime, allowing the agent to screenshot, drive UI flows, query logs, and verify fixes. Give a prompt, and the agent can reproduce bugs, record videos, and implement fixes. It then tests by manipulating the app, submits PRs, responds to reviews, and merges — only involving humans when necessary. The agent isn’t just coding; it sees what the code does and iterates, just like a human.

My team built a voice/chat agent for troubleshooting tech issues, with a CLI called converse that lets any LLM chat with our backend. The LLM modifies code, tests in the live system, and iterates. Sometimes this loop runs for hours. When results are verifiable, it’s powerful: conversations follow this flow, or invoke specific tools in certain cases (like transferring to human support).

The core concept supporting all this is backpressure — automated feedback mechanisms (type systems, tests, linters, pre-commit hooks) that let the agent detect and fix errors without human intervention. If you want autonomy, you need backpressure; otherwise, you get a garbage factory. This also extends to security: Vercel’s CTO points out that agents, generated code, and your keys should be in different trust domains, because a prompt injection in logs could trick the agent into stealing credentials — sharing the same security context is dangerous. Security boundaries are backpressure: they constrain what the agent can do if it goes out of control, not just what it should do.

Two principles clarify this:

Design for throughput, not perfection. Expecting every commit to be perfect leads to endless fix cycles and overwriting each other’s work. Better to tolerate small, non-blocking errors and do a final quality check before release — just like with human colleagues.
Constraints over instructions. Step-by-step prompts (“do A, then B, then C”) are becoming outdated. From experience, defining boundaries is more effective than a checklist, because agents tend to fixate on the list and ignore outside info. Better prompts are “this is the result I want, keep doing it until all tests pass.”

Another half of Harness Engineering is ensuring agents can navigate code repositories freely without your help. OpenAI’s approach is to keep AGENTS.md under 100 lines, as a directory pointing to other structured docs, with freshness checked via CI, not relying on quick, temporary updates.

Once you’ve built all this, a natural question arises: if agents can verify their work, navigate repositories, and correct errors without you, why are you still sitting there?

A warning: for those still at the earlier levels, the following may sound sci-fi (but save it for later, and come back to it).

Level 7: Background Agents

Critique: planning mode is dying.

Boris Cherny, creator of Claude Code, says about 80% of tasks still start with planning. But with each new model generation, the success rate after planning improves. I believe we’re approaching a tipping point: planning as a separate manual step will gradually vanish. Not because planning isn’t important, but because models are smart enough to plan themselves. But only if you’ve done the Levels 3–6 groundwork. If your context is clean, constraints clear, tool descriptions complete, and feedback loops closed, the model can reliably plan without your review. If not, you still need to oversee the plan.

To clarify: planning as a general practice won’t disappear — it’s just changing form. For beginners, planning remains the right entry point (Levels 1 & 2). But for Level 7’s complex features, “planning” becomes more about exploration: probing codebases, prototyping in worktrees, understanding solution spaces. Increasingly, it’s the background agent doing this exploration for you.

This is crucial because it unlocks background agents. If an agent can generate reliable plans and execute them without your sign-off, it can run asynchronously while you do other things. A key shift — from “I switch between tabs” to “work progresses without me.”

The Ralph cycle is a popular entry: an autonomous agent loop that repeatedly runs a programming CLI until all items in the PRD are done, spawning new instances with fresh context each time. In my experience, running Ralph well isn’t easy; any vague or inaccurate PRD description can backfire. It’s a bit too “throw it out and forget.”

You can run multiple Ralph loops in parallel, but the more you start, the more time you spend coordinating, scheduling, checking outputs, and pushing progress. You’re no longer coding — you’re managing. You need orchestration to handle scheduling, so you can focus on intent, not logistics.

Dispatch runs 3 models in parallel, launching 5 workers — keeping your session lean, agents working

Recently, I’ve been heavily using Dispatch, a Claude Code skill I built that turns your session into a command center. You stay in a clean session, while workers in isolated contexts handle heavy lifting. The scheduler plans, delegates, tracks. Your main window is reserved for orchestration. When a worker stalls, it raises clarifying questions instead of silently failing.

Dispatch runs locally, ideal for rapid development where close contact with work is needed: faster feedback, easier debugging, no infrastructure overhead. Ramp’s Inspect is complementary, suited for longer, more autonomous runs: each agent session runs in a cloud sandbox VM with a full dev environment. A PM spots a UI bug, flags it in Slack, and Inspect takes over when you close your laptop. The tradeoff is operational complexity (infrastructure, snapshots, security), but you get scale and reproducibility that local agents can’t match. I recommend using both — local and cloud.

At this level, a surprisingly powerful pattern is using different models for different tasks. The best engineering teams aren’t made of clones. They have diverse thinking styles, backgrounds, strengths. The same applies to LLMs. Different models, trained differently, with distinct personalities. I often assign Opus for implementation, Gemini for exploration, Codex for review — combining their strengths for better results than any single model working alone. Think of it as collective intelligence applied to code.

It’s also critical to decouple implementers from reviewers. I’ve learned this the hard way: if the same model instance both implements and reviews its work, it’s biased. It ignores issues, claims all is well — but it’s not. Not out of malice, just because it can’t grade itself like an exam. Use a different model (or a different prompt setup) for review. Your signal quality improves dramatically.

Background agents also open the door for CI + AI integration. Once agents can run autonomously, they can be triggered from existing infrastructure. A docs bot regenerates docs after each merge and submits PRs to update CLAUDE.md (we do this, saving tons of time). A security bot scans PRs and submits fixes. A dependency bot not only flags issues but actually upgrades packages and runs tests. Good context, continuous lessons, powerful tools, automated feedback — all now autonomous.

Level 8: Autonomous Agent Teams

No one has fully mastered this yet, though some are pushing toward it. It’s the frontier.

At Level 7, you have orchestrated LLMs distributing tasks to working LLMs in a hub-and-spoke pattern. Level 8 removes this bottleneck. Agents coordinate directly — claiming tasks, sharing discoveries, marking dependencies, resolving conflicts — all without a central orchestrator.

Claude Code’s experimental Agent Teams feature is an early implementation: multiple instances working in parallel on a shared codebase, communicating directly in their contexts. Anthropic built a Linux compiler from scratch with 16 parallel agents. Cursor ran hundreds of agents for weeks, building a browser from scratch and migrating its codebase from Solid to React.

But there are issues. Without hierarchy, agents become timid, stuck in place, making no progress. Anthropic’s agents kept breaking features until a CI pipeline was added to prevent regressions. Everyone experimenting at this level agrees: multi-agent coordination is very hard, and no one has found the optimal solution.

Honestly, I don’t think models are yet ready for this level of autonomy for most tasks. Even if they’re smart enough, building a compiler or a browser is still too slow, token-expensive, and not cost-effective (impressive but far from mature). For most of our daily work, Level 7 is the real lever. I wouldn’t be surprised if Level 8 eventually becomes mainstream, but for now, I focus on Level 7 — unless you’re Cursor, where breaking through is your business.

Level ?:

The inevitable “what’s next” question.

Once you master orchestrating agent teams smoothly, the interaction shouldn’t be limited to text. Voice-to-voice (or maybe mind-to-mind?) interactions with programming agents — conversational Claude Code, not just speech-to-text — are the natural next step. Watching your app, describing a series of changes aloud, and seeing them happen before your eyes.

A group is chasing perfect one-shot generation: say what you want, and AI delivers it flawlessly in one go. But that assumes we always know exactly what we want. We don’t. Never have. Software development has always been iterative, and I believe it always will be. Just much easier, far beyond plain text, and much faster.

So: what level are you at? What are you doing to reach the next?

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.