6 Ways Agent-Driven Development Is Transforming Coding Agent Analysis

As an AI researcher at GitHub, I recently discovered a way to automate what I once considered intellectual toil—analyzing vast numbers of coding agent trajectories. This journey led me to build a tool called eval-agents that not only speeds up my own workflow but also empowers my entire team. In this listicle, I’ll share six key insights from that process, showing how GitHub Copilot and agent-driven development can radically improve how we evaluate and iterate on coding agents.

1. The Challenge: Overwhelming Trajectory Data

My daily work involves assessing coding agent performance against benchmarks like TerminalBench2 and SWEBench-Pro. Each agent action produces a detailed trajectory—a JSON file with hundreds of lines of code capturing thought processes and actions. Multiply that by dozens of tasks in a benchmark set and repeat across multiple runs, and you’re facing hundreds of thousands of lines of data. Manually analyzing this volume is impossible, even for the most dedicated researcher. Without automation, key performance patterns and failure modes get buried, slowing down progress.

6 Ways Agent-Driven Development Is Transforming Coding Agent Analysis — Source: github.blog

2. The First Step: Using GitHub Copilot to Surface Patterns

I turned to GitHub Copilot to help me navigate this sea of data. By leveraging its natural language and code generation capabilities, I could quickly identify common patterns in the trajectories. For example, I asked Copilot to find cases where agents repeatedly misread a terminal command or failed to handle an edge case. This reduced the lines I had to read from hundreds of thousands to a few hundred. Yet the process was still manual—I kept repeating the same queries across new benchmark runs. The engineer in me knew there had to be a better way.

3. Recognizing the Repetitive Loop

After several iterations of the same Copilot-driven analysis, I realized I was stuck in a repetitive loop: run a new benchmark, use Copilot to surface patterns, investigate manually, rinse, repeat. This wasn’t leveraging Copilot’s full potential—it was just a crutch. The true opportunity was to automate the entire analysis pipeline. I no longer wanted to be the human in the loop for every single investigation. That’s when the idea of building a specialized agent to do this work struck me. It was time to move from using AI as a tool to using AI as an autonomous collaborator.

4. Creating Eval-Agents to Automate Intellectual Toil

Thus eval-agents was born. This system automates the process of analyzing trajectory data, generating reports, and even suggesting fixes. I designed it to be self-contained so that any team member can run it on new benchmarks without my involvement. The agent reads the raw JSON files, applies learned heuristics, and outputs concise summaries of agent behavior—both good and bad. It doesn’t replace human judgment but dramatically reduces the cognitive load. Suddenly, what took me hours of manual scanning is done in minutes. The intellectual toil that once defined my day is now delegated to code.

5. Designing for Collaboration and Ease of Use

From the outset, I had three guiding principles for eval-agents: easy to share, easy to author new agents, and make coding agents the primary vehicle for contributions. Drawing on my experience as an OSS maintainer for the GitHub CLI, I baked these values into the architecture. Every agent is a self-contained package that can be version-controlled, reviewed, and reused across teams. We also built a simple API so that colleagues can write their own agents with minimal friction. This turns the entire team into contributors, not just consumers. Collaboration is no longer a bottleneck—it’s a superpower.

6. Unlocking a Faster Development Loop for the Team

The results have been transformative. My team now uses eval-agents to iterate on coding agents at unprecedented speed. Instead of waiting for me to manually analyze a run, they can launch their own analysis and get actionable insights in minutes. This has accelerated our research cycle, allowing us to test hypotheses and refine agent behaviors much faster. More importantly, it has freed us to focus on the creative parts of our work—designing new agent strategies, exploring novel architectures, and pushing the boundaries of what AI can do. The tools we built not only automate toil but also amplify human ingenuity.

Agent-driven development is not just a technical shift—it’s a cultural one. By automating the analysis of coding agents, we’ve made our entire team more efficient and more creative. I encourage you to look at your own repetitive analysis tasks and ask: “Can I automate this?” The answer, with GitHub Copilot and a little ingenuity, is almost always yes.