Autoresearch as a thing has shipped. Karpathy’s autoresearch
In the earlier blog I described applying this loop to the Ecosystem Demography model (ED v3.0)
Fire forced us to adapt the autoresearch loop into a shape it does not have in the ML-benchmark setting. The search space covers functional form (grounded in fire ecology) and parameters simultaneously. The verifier is a multi-channel observational benchmark with per-region breakdowns and residual maps, not a loss curve. The post walks through the fire problem, the two-layer extension that fell out of the work, the result, and where we think the paradigm transfers and where it still breaks.
ED’s fire submodel predicts burned area, the fraction of each grid cell that burns each month, validated against satellite-derived GFED4.1s burned-area maps
Fire is one of the largest interannual contributors to the global carbon balance. Savanna and tropical-deforestation fires alone emit on the order of 2 GtC/yr
Most physical-science domains share this shape. The literature defines the form space, multiple partial observational verifiers each capture a different facet of the truth, and the goal is loosely specified at the start. The lessons here are meant as thought vectors for anyone working in a similar setting.
The Karpathy loop searches over code. That works when the script is small and the loss is fast. In our setting the form space and the parameter space have very different structures and need very different search methods.
The form space is vast. Fire ecology has decades of literature with hundreds of candidate mechanisms across at least four named functional regimes (productivity-limited, fuel-limited, ignition-limited, weather-driven)
The LLM agent handles form-layer search. It searches over combinations of mechanisms drawn from fire ecology, sometimes taking a published form as is, sometimes adapting one to fit the rest of the formula, sometimes inventing a new term. The constraint is ecological grounding: whatever shows up has to be explainable and verifiable against the literature. Pausas-Ribeiro fuel hump for fuel-limited regimes. van der Werf intermediate-productivity hypothesis for monthly GPP gating. Archibald air-temperature ignition sigmoid
Optuna handles parameter-layer search. Once a form is fixed, TPE
+----------+ +-----------+ +-----------+ +---------+
| Form | | Parameter | | | | Shapley |
| Layer +---->| Layer +---->| Benchmark +---->| Ablate |
| (LLM) | | (Optuna) | | | | |
+----------+ +-----------+ +-----------+ +----+----+
^ |
| feedback |
'----------------------------------------------------'
The two layers then recurse. Each form candidate gets fit by Optuna, scored by the real benchmark, and the result feeds back to the agent for the next form iteration. After our 8-mechanism formula (Model A) hit 0.66 Overall, the natural next question was which of those eight mechanisms were actually carrying weight.
The simple version of this question is ablation: remove one piece of your model, refit, and measure how much the score drops. If nothing changes, it was not contributing. The problem is that in a multiplicative formula, contributions are conditional. Mechanism A might look useless when B is present, but critical when B is absent, because they encode overlapping signals. Removing one mechanism at a time misses these interactions entirely, and running every possible subset (full combinatorial ablation) scales as \(2^N\), which is intractable past about ten components.
The Shapley value
What the Shapley decomposition revealed was that our explicit fuel-biomass hump contributed only 5 percent of the total, because monthly GPP was already carrying the productivity signal that fuel was supposed to carry. Without this step we would have shipped that redundancy and never known. We dropped five mechanisms, refit the remaining three, and the smaller 12-parameter model (Model C) scored higher than the 8-mechanism version. Any autoresearch loop that composes mechanisms needs something like this. The question is always “is this piece actually doing work, or is another piece already covering for it?”
We reproduced the official ILAMB benchmark for the TRENDY v14 reference models locally, then scored our offline ED with the same harness and inputs from a prior ED run. Full reproduction lives in the repository. Model C, the 3-mechanism, 12-parameter form, scored 0.6713 Overall, rank 1 out of 24.
| Rank | Model | Bias | RMSE | Seasonal | Spatial | Overall |
|---|---|---|---|---|---|---|
| 1 | ED Model C (ours) | 0.728 | 0.506 | 0.846 | 0.771 | 0.6713 |
| 2 | CLASSIC | 0.738 | 0.507 | 0.782 | 0.797 | 0.6665 |
| 3 | CLM6.0 | 0.759 | 0.474 | 0.758 | 0.838 | 0.6606 |
| 4 | CLM-FATES | 0.725 | 0.525 | 0.802 | 0.707 | 0.6568 |
| 11 | ED stock | 0.681 | 0.489 | 0.439 | 0.290 | 0.4774 |
Three things bit us during this work that I think generalize to any autoresearch project outside the ML-benchmark setting.
The first is that the optimizer will fit whatever you feed it, including your bugs. Our offline pipeline computed an “accumulated dryness” input through a Python function with two quiet bugs that zeroed it across most of the Sahel and savanna, where most fire actually happens. Optuna fit Model C to this broken input and scored rank 1 anyway. We caught it only when Lei plugged Model C into a coupled ED run and got values 2000x off. We fixed the input, retrained, and every parameter shifted by 5 to 10x, but Model C still won. The functional form is what generalizes across deployments. Parameters are slack the formula uses to absorb whatever input you hand it. Validate your inputs as seriously as you validate your outputs.
The second is that the benchmark is not the goal. Same incident, second lesson. Our offline pipeline applied a post-hoc rescale that the benchmark rewarded but the real coupled run could not absorb. The optimizer had learned to depend on the rescale. Removing it barely moved the offline score but the coupled-run mismatch disappeared. Every autoresearch result needs a validation channel the optimizer never sees. For us it was the coupled port into ED. For other domains it might be a deployment test, an out-of-distribution evaluation, or a downstream user who consumes your output differently than the benchmark does.
The third is that single-scalar metrics get gamed silently. ILAMB has multiple aggregation conventions for combining its component scores into one Overall number. Early on I picked the one where Model C beat CLM6.0 by 0.04. The native aggregation gave a margin of 0.005. Same components, different headline. Lei caught it by reminding me to look at the maps and the components, not just the aggregate. Track the full vector of component scores. Look at residual maps. Hold out years that the optimizer never saw.
A few practical notes for anyone setting up a similar loop.
Pin down what “better” means before you start. Every autoresearch loop optimizes toward something, and if that something is vague, you will re-litigate it mid-run. We did, three times. Write down the exact metric, the exact split, the exact baseline, and what counts as a win.
Use a stack of evaluations, not a single number. One scalar rewards gaming. Multiple independent views of the same result (component breakdowns, spatial or distributional checks, held-out subsets) make it much harder for the optimizer to find a shortcut that technically improves the score without improving the thing you care about.
Everything above describes autoresearch given a frame. A literature-bounded form space, a multi-channel verifier, a clear goal. All of the current systems assume the frame exists.
The hard part of science is choosing the frame. Which mechanisms even belong in the search space? Which observational target captures what you actually care about? These questions do not have automatic verifiers and the search space for them is not bounded.
The field is converging on this bottleneck. A 25,000-run study by Schmidgall and collaborators
I am working on this problem with my friend. More on that when we have something to show.