Introducing a dev-lead agent: a coordinator forbidden from writing code, and why it has to be

von  und  | 2. Juli 2026 | Künstliche Intelligenz, Software Engineering, Tools & Frameworks

Roman Mühlfeldner

Senior Developer

Leo Köninger

Senior Developer

A blog post on what splitting a coding task across isolated sub-agents actually buys.

The reliability problem

Agentic coding tools demo well on greenfield work and fail predictably on real tickets. They mock the dependency they were meant to test. They declare a feature done while half the acceptance criteria still fail. As the diff grows past what the model can hold in working memory, scope disappears without anyone noticing. They ship the happy path for a third-party call and never write the branch that runs when the vendor returns a 500. The dev-lead skill treats this as a structural problem rather than something one more prompt can patch. It is a GitHub Copilot coordinator skill that splits a coding task across isolated sub-agents and forbids the coordinator itself from writing code. Everything below follows one rule: the agents that check each other stay independent, so nothing ever grades its own work. Forbidding the coordinator from writing code is just the most visible place that rule bites.

The following article provides the insights gained by Roman Mühlfeldner and Leo Köninger, both Senior Developers at Senacor, on a real multi-repo product: a couple of credit-sales web apps backed by several services, built contract-first, with many micro-frontends composed into the apps.

Why we use sub-agents, and what the resulting topology looks like

A single large context window does not beat a fleet of small ones. Frontier models lose accuracy well before the window fills, especially when distractors sit beside the relevant material. The bottleneck is signal-to-noise; raw window size matters less.

Therefore each sub-agent gets a fresh context holding only what it needs. This is why the scope problem from the opening goes away: scope only disappears once a diff outgrows a single window, and here no agent ever holds the whole diff. Research-shaped work benefits most: a sub-agent can spend its own context on exploration and return only what it found. Parallelism is a bonus, not the main benefit.

That logic produces the rule mentioned at the beginning:

A lead agent forbidden from writing code, tests, or reviews. Its job is to plan, dispatch specialists, and arbitrate the findings that come back.

dev-lead runs as a GitHub Copilot skill, and every role on the team is a Copilot sub-agent: a fresh context the lead spawns via @agent-name, runs to completion, and reads the result back from. Agent identity files (AGENTS.md) live under .github/agents/ and hold each role’s persona, file ownership, and validation commands, so the lead composes them with per-task context and never re-explains who the agent is. There is no agent-to-agent channel; all coordination flows through the lead, because the same logic that argues for fresh context per agent argues against peer back-channels.

The resulting topology of agents is shown in the following graphic. Please note that more or less subagents could be present depending on your project domain.

Dispatch Topology for an agent based development project

Picture 1: Dispatch topology. Fourteen sub-agents across Build, Test, and Review; arrows go only between the lead and each agent, never agent to agent. Each spoke carries its own context; agents share a filesystem but no conversation history. The dashed link to @fullstack-dev marks it as mutually exclusive with @frontend-dev / @backend-dev / @database-dev – never dispatched alongside them. Reviewers split into the core (blind, edge-case, acceptance) and risk-based observers dispatched only when the diff warrants them.

Which aspects of the topology actually buy the reliability

What buys the reliability is the independence rule from the top: keep the agents that check each other independent, so nothing ever grades its own work. It runs deeper than a fresh context. Fresh context per agent; the point of the section above, is only the first cut. The rest comes from holding that rule at every stage, so no two agents share a blind spot. The sections below trace it through five layers: the test intent, the reviewer perspective, the re-review of a fix, the reviewing model, and the contract between parallel builders.

Testers re-derive acceptance criteria from the source

It begins with the testers. The failure mode is comprised of tests that mirror the implementation rather than checking intent. @adversarial-tester and @integration-tester get the ticket or goal description directly. They don’t see the lead’s plan, the builder’s interpretation. They extract acceptance criteria themselves; from the same source the developer worked from. If the developer misreads the requirement, the tester will catch it. Their tests are first-class artifacts; developers cannot delete them. This is where two of the opening failures close: a feature called done with half its criteria failing, and a test that stubs out the very dependency it was meant to verify.

The tester can misread the requirement too. The design counts on it. Three independent readers go at the same ticket: the builder, the two testers, and the @acceptance-reviewer that runs in Phase 4 on a different model family from anyone else on the team. A misread only survives if every one of them makes it the same way. When they disagree on what the requirement means, the lead does not crown a winner; the finding classifies as intent_gap or bad_spec and escalates to the human, because requirements ambiguity is not the lead’s call to make.

Both testers run as independent background subagents, each deriving acceptance criteria from the ticket not from the builder's interpretation of it.

Picture 2: Both testers run as independent background subagents, each deriving acceptance criteria from the ticket not from the builder’s interpretation of it.

Reviewers split by perspective, with a real triage scheme

A single reviewer rationalizes the diff in front of it. The lead dispatches @blind-reviewer, @edge-case-reviewer, and @acceptance-reviewer as the core, plus risk-based observers (@security-reviewer, @architect-reviewer, @code-quality-reviewer, @performance-reviewer, @accessibility-reviewer) only when the diff warrants them. Findings return to the lead, which deduplicates them by primary-ownership rules (input validation belongs to security, query efficiency to performance) and classifies each as intent_gap | bad_spec | patch | defer | reject. When reviewers contradict, a fixed priority applies: correctness > performance > readability.

Five reviewers run in parallel against a single diff.

Picture 3: Five reviewers run in parallel against a single diff.

Re-review is where reliability comes from

It does not stop at the first review. The failure mode for any fix loop is a reviewer biased toward approving the same diff it already approved. Every fix therefore triggers independent verification from someone who hasn’t seen that round.

Routing is proportional to scope. A small mechanical fix (under ~100 lines, touching one concern) goes back to a single re-reviewer, because a full review round costs more than the risk warrants. Anything larger gets fresh reviewers who have not seen the first pass. This proportionality is what makes the loop sustainable rather than a bottleneck.

Two shapes of finding recur often enough to be worth calling out:

The first is external system error handling. The product we built integrates many third-party services, and those services fail in shapes the builder rarely plans for. A resource 404s because it was deleted upstream, a call times out mid-transaction, or a library the builder treated as infallible raises an exception out of nowhere. First-pass generation ships the happy path, and re-review catches the missing branches with findings phrased as "this will throw in production the first time the vendor has an outage." This is the third-party call from the opening, the one that ships the happy path and never handles the 500 edge case, now caught before it reaches production.

In our experience, accessibility is one of the easiest things to overlook, which is what makes a dedicated @accessibility-reviewer so useful. Form controls without labels, icon buttons without accessible names; error states that never reach a screen reader. The first pass misses these often enough that the re-review loop pays for itself on frontend work alone.

The fix loop in action: 13 numbered findings routed back to the owning builder, with re-test and targeted re-review already scheduled.

Picture 4: The fix loop in action: 13 numbered findings routed back to the owning builder, with re-test and targeted re-review already scheduled.

Cross-model review

The same rule reaches one level deeper, into the models themselves. Code generated by one model family is reviewed by a different one, because a reviewer trained on the same data as the builder is biased toward rationalizing the same patterns it would have produced. Diverse model families catch what same-family review misses. The assignment below spreads three model families across the roles, concentrating that distance where it matters most.

A note on Copilot context limits

GitHub Copilot now lets you pick, per model, between a default context window and an extended one of up to a million tokens, with a configurable reasoning level alongside it. The larger settings cost more credits per interaction, and GitHub itself recommends the default window for ordinary work. Every role on the team, the lead included, runs on that default window on purpose. This is not a limit to work around; it is the fleet-of-small-windows argument from the opening applied to billing. Each sub-agent holds only its slice of the task and has nothing to gain from a bigger window, and the lead fits the same default because its agents hand back compressed findings, not raw diffs. Paying a million-token window per dispatch would buy context none of them would use. What the assignment turns on, then, is model family per role, not window size.

 
Model Roles
GPT-5.3-Codex frontend-dev, backend-dev, database-dev, fullstack-dev
Claude Sonnet 5 blind-reviewer, security-reviewer, performance-reviewer, edge-case-reviewer, adversarial-tester, accessibility-reviewer
GPT-5.4 lead (coordinator), code-quality-reviewer, architect-reviewer
GPT-5.4 mini integration-tester
Gemini 3.5 Flash acceptance-reviewer

Builder – GPT-5.3-Codex. It comes from a different model family than every reviewer on the correctness pipeline, which is the whole point of the layout. But it is not chosen only for that distance: it is also OpenAI’s code specialist line.

Reviewers and adversarial tester – Sonnet 5. Independent benchmarks (see Further reading) put Claude ahead on logic bugs, race conditions, and multi-file reasoning, and catching those is reviewer work. Same-family bias between builder and bug-hunters is gone now that the builder is OpenAI and the bug-hunters are Claude. adversarial-tester sits with them for the same reason: a tester sharing a family with the builder anticipates the same attack patterns the builder would have.

Coordinator and structural reviewers – GPT-5.4. The lead never writes code but sees more aggregate context than anyone on the team (every agent’s return value flows through it), so the choice is driven by Copilot prompt headroom and cross-family distance from the Sonnet reviewers it arbitrates between. architect-reviewer and code-quality-reviewer stay here for the same reason, structural review rewards broad reasoning across the whole change, and keeping it off the Sonnet family preserves the distance. GPT-5.4 has a different post-training lineage from Codex, so there is still cross-model distance from the builder, but the diversity is intra-OpenAI rather than across vendors. That is a softer guarantee than the Claude side gets, and worth flagging.

Acceptance reviewer – Gemini 3.5 Flash. A third family for the most spec-critical role, which is the point: builder, reviewers, and acceptance check sit on three different vendors. The default window is all the role needs, since it scopes to ticket plus diff, rather than the whole repo.

Contract-first multi-repo orchestration

Parallelism is where the rule is hardest to keep. The failure mode of parallel builders is interface drift, where each builder finishes against its own assumption of the API and the pieces refuse to integrate. Phase 0 first discovers the workspace (multiple independent git repos under one root, each with its own build tooling) and builds a repo map the lead works against. From there, the interface contract is authored as part of the plan, committed first to the contract’s repo, and owned by the lead. The OpenAPI spec is a planning artifact, so authoring it sits on the planning side of the line that keeps the lead out of code, tests, and reviews. Only then are builders dispatched in parallel against the frozen contract. If a builder needs to deviate, it reports back and the lead updates the spec; builders never edit it themselves. Parallelism is safe because the contract is frozen before the first builder starts, and it stays that way because only the coordinator is allowed to touch it.

Phase flow with feedback loops. Seven sequential phases with three bounded fix loops (test, review, pipeline - each capped at 3 rounds), the Human Review comment-address loop, and dashed escalations to the human for ambiguous requirements, plan approval, and any loop that exhausts its cap.

Picture 5: Phase flow with feedback loops. Seven sequential phases with three bounded fix loops (test, review, pipeline – each capped at 3 rounds), the Human Review comment-address loop, and dashed escalations to the human for ambiguous requirements, plan approval, and any loop that exhausts its cap.

The boring guardrails

More independence means more agents, and the lead will spawn them if nothing stops it. These are the parts that make the workflow safe to run unattended for an hour.

  • Hard cap of 24 dispatches per session. Without it, the lead reaches for another dispatch every time something doesn’t converge, and a single ticket can burn through hundreds of agent calls before anyone notices.
  • The fix-and-retest loop gives up after 3 rounds, and any single observer gets at most 2 re-reviews before the lead escalates. When the model has missed something three times, attempt four rarely finds it. The cap exists because the next dispatch is always cheaper than admitting the lead is stuck. Without it, the lead retries forever and the loop quietly replaces real diagnosis with busywork.
  • Every test and build command runs under timeout. On timeout the lead investigates instead of retrying. A stuck process would otherwise hold the session open until the budget is gone, and respawning the same hang would just count as progress.
  • File ownership is exclusive: defaults live in each AGENTS.md, and the lead arbitrates when two agents want the same file. Drop exclusivity and parallel builders race on the same file; the last writer silently wins, which is the kind of drift integration tests sometimes catch and sometimes do not.

Four things always escalate to the user rather than loop: security design, ambiguous requirements, scope change, and repeated failure. These are not the lead’s to decide. Cross that boundary and the lead substitutes its own judgment for the user’s, and the failure is invisible until it ships.

Where it fits, and where it does not

Independence is not free, and it only pays off when there is enough work for the agents to divide and cross-check. dev-lead is built for multi-repo workspaces: a contracts repo, a frontend, and one or more backend services. On a single monolith, Phase 0 discovery, the contracts-repo ordering, and file-ownership arbitration are all just in the way.

The dispatch overhead is also real. A one-line config tweak or a typo fix still spawns the lead, a builder, the testers, and a review round; the round-trip costs more than the change. Below some threshold (anything a senior could land in five minutes), the harness gets in its own way, and the right call is to skip it. The same applies to work that lives in tacit knowledge the codebase does not capture: a fix that turns on a half-remembered conversation about why a particular vendor returns 503 instead of 504 will not surface from any number of sub-agents. The lead can only arbitrate what the agents return, and the agents can only return what the repo and the ticket actually tell them.

It is also opinionated. The priority order, the rule keeping testers isolated from dev intent, the specific model-to-role mapping, and the contract-first ordering are all judgment calls we would argue for in code review.

Three things to steal

You do not need the whole harness to get something out of this. Three ideas travel on their own, and you can try them in a single editor session today:

  1. Forbid your planner from writing code. Whatever agent or prompt you use to scope a task, keep it out of the implementation. The moment the planner starts typing, it stops planning and starts rationalizing.
  2. Review across model families. If you generate the code with Claude, review it with GPT or Gemini, and vice versa. A reviewer from the same family rationalizes the same mistakes it would have made. This works at the single-prompt level; you do not need a harness.
  3. Freeze the contract before parallelizing anything. If two humans, two agents, or two tabs are going to touch related code at the same time, the interface they meet has to exist in writing first. Most „integration hell“ is a contract that was never written down.

Try your next non-trivial agent task with the planner forbidden from touching code. Notice what breaks first. That’s where reliability was leaking.

Further reading

If you want to go deeper on context, feedback loops, or model diversity, this is where we would start.

Context degradation and why narrow, relevant context matters:

Feedback loops as a reliability mechanism:

Cross-model diversity as a reliability mechanism:

Coding model benchmarks (early 2026):