Skip to main content
Meta

Meta launches Meta-Agent on GitHub to continuously optimize harness performance

Published by
SectorHQ Editorial
Meta launches Meta-Agent on GitHub to continuously optimize harness performance

Photo by ThisisEngineering RAEng on Unsplash

Meta released Meta‑Agent on GitHub, an automatic harness optimizer that lifts benchmark scores from 67% to 87% on tau‑bench without labels, per a recent report.

Key Facts

  • Key company: Meta

Meta‑Agent is more than a code drop; it’s a self‑tuning harness that lets developers let a large language model (LLM) rewrite its own execution plan on the fly. The open‑source repo, posted to GitHub by the Canvas organization, ships a Python‑3.11‑compatible framework that wraps Claude‑based agents in an “outer loop” that iteratively proposes configuration tweaks, runs a benchmark, and keeps the best‑performing version. In the write‑up that accompanies the repo, the authors report a jump from 67 % to 87 % on the tau‑bench suite—an improvement of 20 percentage points—without any human‑provided labels (GitHub – canvas‑org/meta‑agent).

The heart of the system is a two‑stage pipeline. First, a baseline evaluation runs the agent against a YAML‑defined benchmark using a vanilla configuration (see configs/vanilla.py). The command line looks like a typical Python module call: `python -m meta_agent.eval_runner --benchmark benchmarks/example/benchmark.yaml --config configs/vanilla.py --name baseline --model claude-haiku-4-5`. The second stage launches the “outer loop” (`meta_agent.outer_loop`), which repeats the evaluation for a configurable number of iterations (the write‑up uses five) while a proposer model—by default Claude‑Opus‑4‑6—reads the trace logs and emits a new config file. Each iteration feeds the updated harness back into the benchmark, and the framework stores candidates under `experience/` for later inspection. The repo’s README spells out the exact CLI flags, making the process reproducible for anyone with an Anthropic API key (and optionally an OpenAI key for the LLM judge).

Beyond the mechanics, Meta‑Agent offers a declarative way to describe the tasks the agent should solve. A simple YAML snippet defines a task name, a natural‑language instruction, a workspace directory, and a verification command that returns exit 0 on success. For example, a “resolve‑billing” task might instruct the agent to look up a double‑charged customer and fix the account, while a `check.py` script validates the outcome. The harness can be customized further by writing a Python config that returns a `ClaudeAgentOptions` object, where developers can tweak system prompts, permission modes, turn limits, and even enable adaptive thinking. The repository provides starter configs in `configs/` and a full SDK reference in `SKILL.md`, allowing teams to bootstrap from a vanilla baseline or bring their own bespoke logic.

Reproducing the reported tau‑bench gains requires a few extra steps. The authors advise installing the benchmark suite directly from Sierra Research’s GitHub (`pip install "tau2 @ git+https://github.com/sierra-research/tau2-bench.git"`), then pointing the outer loop at the `benchmarks/tau3/` directory. The holdout benchmark (`benchmarks/tau3/benchmark_holdout.yaml`) lets users verify that improvements generalize beyond the training set. When the outer loop finishes, the best candidate configuration is saved alongside its performance metrics, giving a clear audit trail of how the harness evolved. The entire workflow is encapsulated in a MIT‑licensed repo, meaning anyone can fork, modify, or integrate the optimizer into their own AI pipelines without legal hurdles.

Meta‑Agent’s release signals a subtle shift in how AI developers think about tooling: instead of manually tuning prompts or hard‑coding heuristics, they can now hand the optimization problem to an LLM that learns from its own execution traces. While the repo is still a proof‑of‑concept—its results hinge on Claude models and a specific benchmark—the 20‑point lift on tau‑bench demonstrates that continual harness optimization is feasible at scale. As more teams adopt the framework, the community will likely contribute alternative proposer models, richer verification suites, and domain‑specific task libraries, turning the open‑source project into a collaborative hub for self‑improving AI agents.

Sources

Primary source

Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.

More from SectorHQ:📊Intelligence📝Blog

🏢Companies in This Story

Related Stories