Building autonomous agents for technical tasks: 5 lessons learned

Blog

Brex Eng Blog

Building autonom...

Brex Eng Blog

Building autonomous agents for technical tasks: 5 lessons learned

Graham Fuller

Apr 15, 2026

Designing a system to tackle the busywork

You send a Slack message describing a bug or a refactor. The system spins up a dedicated environment, assigns an agent, and gives it the full toolchain. The agent reads its own CI failures and applies suggestions by code review bots to ensure the PR is in pristine condition. Your next interaction with it is a human review.

We ran dozens of gRPC migrations across our monorepo, and the process got better with each one. Minimal engineer-hours spent relaying CI output. No one copying error logs into chat windows. No afternoon check-ins.

That’s not a demo; engineers at Brex are using this right now. This is how we got there.

Where engineering teams hit bottlenecks

AI coding agents today typically live in isolated environments: give them a repo, a task, and a sandbox. They perform impressively. Then you put them on real production work, such as a live monorepo, real CI, and real review bots, and a specific failure mode emerges: the agent finishes its changes, hits a wall of automated feedback it can't read, and stops. Or worse, it keeps going without that feedback and produces output that looks correct but isn't.

The issue is connectivity:

Your CI system knows what failed and why.
Your review bots flag style violations and security issues.
Your test runner produces exact stack traces.

All of that information exists. It was built for engineers. But agents can't reach it, so it never gets back to them.

The standard workaround is a human in the loop, not to make decisions, but to relay messages. That's an expensive solution to a plumbing problem.

The job that made it real

We needed to migrate gRPC client factories across 400+ services in our monorepo. Teams had built their own implementations over the years. The result? Duplicated logic, inconsistent configs, dependency injection conflicts that surfaced at deploy time. The fix was clear: centralize the client libraries, replace every factory. The scope was large enough that doing it manually wasn't realistic.

We started pairing with AI agents to do the refactors. For simpler services it worked well — 100 to 200 lines of changes, one-shot, about 30 minutes each. The work was repetitive enough that you could describe the pattern once and let the agent run.

For a while, that felt like enough.

The broken toolchain

For simpler migrations the agents could do the work, but we were still running the process. For complex services with multiple client factories, they couldn't handle it at all.

Every migration followed the same loop:

Spin up a remote developer environment
Kick off the agent, wait for it to finish
CI runs, wait for it to finish
Check if review bots left comments
Copy those comments back to the agent, ask it to fix things
Wait again
Repeat until green
Finally review and merge

Every morning we'd kick off a handful of migrations. Every afternoon we'd come back and manually relay whatever automated systems had said back to the agents. We turned engineers into messengers.

All the information the agents needed already existed. Bots comment on PRs. CI returns detailed logs. Test runners tell you exactly what broke and where. We'd built that infrastructure for our human engineers. The agents just couldn't access it. So we stood in the middle, passing notes.

Closing the loop

We started simple with three Python scripts: one to forward Slack task requests to remote developer environments, one to feed PR bot comments back to agents, and one to handle CI failures. They ran continuously in the background.

The first time a migration ran start to finish without either of us touching it was a confirmation: would closing the feedback loop be enough? It was. Getting each piece to hand off cleanly to the next just worked.

The time unlock was magic.

You'd send a message in the morning and find a green PR waiting by the end of the day. No afternoon check-ins, no copying error logs into chat windows. The agents were doing what agents are supposed to do: running until they're done.

Beyond migrations: delegating engineering work to agents

What we built is a general-purpose template for delegating engineering work to agents: a trigger, a dedicated environment, a full toolchain, and closed feedback loops with every automated system that would normally surface a problem to a human engineer. Give agents that setup and they can own a wide class of well-scoped tasks without anyone managing them.

So we turned it into a platform:

Platform for delegating engineering work to agents

Tasks come in from Slack, Linear, GitHub, or a cron job. The orchestrator routes them to the agent pool. Each task gets its own RDE. The agent works, hits failures, reads feedback, iterates, and puts up a PR. We come back to something reviewable.

We're continuing to extend what agents can access: internal tool verification, test workload deployment, browser-based checks for frontend changes. The scope of what's delegable keeps expanding, and the pattern scales cleanly because each agent runs in its own environment and manages its own feedback loop.

What this taught us about running agents on real work

The gap between "AI agents can do this" and "AI agents are doing this reliably" is mostly an infrastructure problem. A few things became clear:

Feedback loop closure is the core problem. Agents stall when they need information that's sitting in systems they can't reach. The integration work is less exciting than the model capability question, and it matters more.

Environment quality determines output quality. The hypothesis is that an agent without validation tooling is guessing. We gave agents the same toolchain our engineers have, and early PRs are bearing that out.

Parallelization is a systems design problem. Running one agent on one task is easy. Running 50 in parallel, each iterating independently, requires thinking carefully about orchestration and environment isolation.

Scope the work tightly. This is for tasks that are tedious, well-defined, and common enough that nobody wants to own them manually. That category is larger than you'd expect, and it grows as the platform does.

Governance is critical. Agents should only have access to the information they need to perform the task at hand. When agent work is put in front of a human to review it needs to be attributed back to an actual human and not just a bot.

We ran dozens of migrations. None of them required an engineer to spend an afternoon relaying CI output. The platform that emerged from that work is what we're building on now.

Contributors: Graham Fuller, Jacky Chung, Sofhia de Souza Gonçalves

JOIN US

“Brex is where you become your best self and learn to be ambitious.”

View open roles

Business-banking-treasury-management-software-solutions

Finance operations that actually work at scale

Spreadsheets break and fragmented tools slow you down. Learn how integrated finance operations help you manage complexity without expanding headcount.

Investment options every founder should know

Learn how to make your startup's cash work harder with Brex. Explore treasury bills, money market funds, and short-term bond options for founders.

How AI Can Increase Working Capital: Practical Tips and Strategies

Top finance leaders from Brex, Respaid (YC), Redis, and Grammarly explore how GenAI is transforming working capital management.

Startups racing to build AI voice agents

See which tools startups and enterprises are using to build AI voice agents in our June 2025 Brex Benchmark.

Zip report: Enterprises racing toward best-of-breed procurement and payments

Get procurement insights from Zip’s State of Spend report and see what’s driving the shift from all-in-one systems to integrated, best-of-breed solutions.

Spend management: From chaos to control with AI

Explore the benefits of adding AI to your finance tech stack, from smart policy controls to a chatbot assistant that handles finance-related inquiries.

Building autonomous agents for technical tasks: 5 lessons learned

Designing a system to tackle the busywork

Where engineering teams hit bottlenecks

The job that made it real

The broken toolchain

Closing the loop

Beyond migrations: delegating engineering work to agents

What this taught us about running agents on real work

JOIN US

“Brex is where you become your best self and learn to be ambitious.”

Related articles

Finance operations that actually work at scale

Investment options every founder should know

How AI Can Increase Working Capital: Practical Tips and Strategies

Startups racing to build AI voice agents

Zip report: Enterprises racing toward best-of-breed procurement and payments

Spend management: From chaos to control with AI