At Meta, efficiency at hyperscale is not just a goal—it's a necessity. With over 3 billion users depending on its infrastructure, even a 0.1% performance regression can translate into massive power consumption. That's why Meta's Capacity Efficiency Program has pioneered a unified AI agent platform that automates both the discovery and resolution of performance issues. By encoding the domain expertise of senior engineers into reusable, composable skills, these agents are reshaping how the company manages power and engineering time. Here are five critical insights into how this system works and where it's headed.
1. The Two Fronts of Efficiency: Offense and Defense
Meta treats efficiency as a two-pronged strategy. On the offensive side, engineers proactively search for code optimizations that can make existing systems run more efficiently. These are voluntary changes that reduce resource usage without sacrificing performance. On the defensive side, the company monitors production in real time to detect regressions—unintentional performance hits—and traces them back to specific pull requests. While both approaches have been in use for years, they created a bottleneck: human engineering time. Resolving every issue manually was slow and limited scalability. The AI agent platform was designed to break that bottleneck by automating the heavy lifting on both fronts.

2. A Unified AI Agent Platform That Encodes Expertise
Meta built a single, standardized platform where tool interfaces are unified and domain expertise is encoded into AI agents. These agents are not monolithic; they are composed of reusable skills that reflect the knowledge of senior efficiency engineers. For example, an agent might have a skill for analyzing CPU utilization patterns and another for recommending memory tweaks. By combining these skills, the platform can autonomously investigate and resolve a wide range of performance issues. This approach ensures that the institution's hard-won expertise is not lost when engineers move on, and it allows the program to scale its impact across many product areas without needing to proportionally increase headcount.
3. FBDetect: The Defensive Backbone That Catches Regressions
On the defense side, Meta relies on an in-house tool called FBDetect that scans for performance regressions on a weekly basis. Every week, it identifies thousands of regressions that would otherwise go unnoticed and compound power waste across the fleet. Before AI agents, engineers had to manually investigate each regression, a time-consuming process. Now, AI agents can automatically triage these issues, perform root-cause analysis, and even generate fixes or mitigations. This rapid response means that fewer megawatts are wasted over time, as regressions are resolved in minutes rather than days. The result is a much tighter feedback loop that keeps power consumption in check.

4. Compressing Hours of Manual Work Into Minutes
One of the most tangible benefits of the AI agent platform is the dramatic reduction in investigation time. What used to take an engineer roughly 10 hours of manual detective work—examining logs, comparing baselines, isolating changes—can now be completed by an AI agent in about 30 minutes. In some cases, agents go even further, fully automating the path from an efficiency opportunity to a ready-to-review pull request. This speed allows the Capacity Efficiency Program to handle a growing volume of optimization opportunities (the "offense" side) without overwhelming the engineering team. Engineers are freed to focus on higher-level innovation while AI handles the routine diagnostics and fixes.
5. The End Goal: A Self-Sustaining Efficiency Engine
Meta envisions a future where the AI agent platform becomes a self-sustaining efficiency engine. In this model, AI handles the long tail of performance issues—the thousands of small regressions and optimizations that engineers would never have time to address individually. The program has already recovered hundreds of megawatts of power, enough to power hundreds of thousands of American homes for a year. But the ultimate ambition is to scale this impact indefinitely, matching the growth of Meta's infrastructure without proportional increases in headcount. By standardizing tool interfaces and encoding expert knowledge, Meta is building the infrastructure for efficiency itself—a system that gets smarter over time and makes the entire fleet run leaner.
The journey toward fully autonomous efficiency is still underway, but the results so far are compelling. From cutting investigation time by an order of magnitude to recovering megawatts at scale, Meta's unified AI agents are proving that hyperscale optimization can be both automated and intelligent. As the platform expands to more product areas and matures its skills, the dream of a self-healing infrastructure inches closer to reality.