这是一篇由原始材料转换而来的阅读页，保留了源文件的主要结构，并补充了可追溯的来源说明与链接。

摘要

An autoresearch system is a bounded experimental loop operated by one or more agents under explicit human defined policy.

autoresearchmarkdownarticle

Framework

1. What an autoresearch system is

An autoresearch system is a bounded experimental loop operated by one or more agents under explicit human-defined policy.

Its purpose is not merely to generate ideas, but to:

turn ideas into comparable experiments
turn experiments into logged evidence
turn evidence into keep/discard decisions
turn decisions into cumulative improvement

In this sense, autoresearch is a research operating framework, not just a prompting technique.

2. Core system components

Every usable autoresearch system has at least these components:

2.1 Objective

A clearly defined optimization target.

Examples: - lower validation loss - improve benchmark score - reduce latency under fixed quality - reduce bug count without breaking tests

A system without a stable objective will drift.

2.2 Scope

A defined editable surface and a defined protected surface.

Examples: - editable: train.py - read-only: evaluation harness, datasets, infra, dependencies

A system without scope boundaries will optimize by cheating or destabilizing the environment.

2.3 Experiment loop

A repeatable process that turns hypotheses into results.

Minimal loop: 1. inspect current best state 2. form one hypothesis 3. implement one focused change 4. run experiment 5. parse metrics 6. log result 7. keep or revert 8. continue

2.4 Evaluation

A stable way to compare runs.

Evaluation must be: - explicit - consistent - resistant to accidental drift - separated from the editable experiment surface when possible

2.5 Logging

A durable record of what was tried and what happened.

Without logging, the system cannot accumulate organizational memory.

2.6 Reversion

A mechanism for safely discarding failed or non-beneficial changes.

Autoresearch requires many failed attempts. Reversion makes them survivable.

2.7 Policy

The rules governing behavior.

Examples: - keep only if improvement exceeds threshold - prefer simpler implementations when performance is close - stop after repeated crashes in one search direction - do not change evaluation code

2.8 Human control

A clearly defined override model.

Humans retain authority to: - stop the loop - change priorities - narrow or widen scope - redefine the objective - inspect intermediate results

3. The minimal autoresearch contract

A practical autoresearch setup should define at least the following:

Goal: what is being optimized
Metric: how success is measured
Budget: time / compute / iteration limits
Scope: what may and may not be changed
Loop: how each experiment proceeds
State: where best-known state is stored
Log: where outcomes are written
Decision rule: when to keep, discard, or retry
Stop / override: how humans intervene

If any of these are missing, the system is underspecified.

4. System flow

A canonical autoresearch flow looks like this:

Human defines objective and constraints
Human or harness prepares initial baseline
Agent reads current state and history
Agent proposes next experiment
Agent executes change within scope
Agent runs evaluation
Agent records outcome
Policy decides keep / discard / retry
Loop continues until stopped or budget exhausted

5. Information layers

Autoresearch benefits from separating information into layers.

Layer A: worldview

High-level intent and philosophy.

Examples: - manifesto - design principles - non-goals

Layer B: framework

Reusable system-level structure.

Examples: - loop definition - logging schema - rollback rules - search policy

Layer C: program

Task-specific operating instructions for an agent.

Examples: - optimize benchmark X - improve training metric Y - reduce test runtime in project Z

Layer D: runtime state

Mutable execution artifacts.

Examples: - results files - logs - best-known commits - crash records

6. Single-agent vs multi-agent

6.1 Single-agent

Best when: - the search space is narrow - coordination cost would dominate - interpretability is important

Advantages: - simpler state management - easier auditability - fewer race conditions

6.2 Multi-agent

Best when: - there are multiple semi-independent search directions - one agent can propose while another evaluates - one agent can summarize history while others explore

Possible roles: - explorer: proposes and runs experiments - critic: reviews results, rejects weak ideas - historian: maintains logs and synthesized lessons - manager: allocates search budget across directions

Multi-agent systems require stronger state discipline, otherwise they amplify chaos.

7. Design tensions

Every autoresearch system must navigate these tensions:

7.1 Exploration vs exploitation

explore novel directions
exploit proven gains

7.2 Speed vs rigor

run more experiments quickly
ensure experiments are interpretable and trustworthy

7.3 Simplicity vs local performance gain

keep the system evolvable
avoid complexity unless gains clearly justify it

7.4 Autonomy vs governance

allow uninterrupted iteration
preserve human authority and safety boundaries

8. Failure modes

Common failure modes include:

objective ambiguity
metric corruption or drift
uncontrolled scope expansion
poor logging discipline
repeated retries on low-quality ideas
no rollback path
excessive complexity accumulation
inability to resume from interrupted state

A mature framework assumes these failures will happen and designs around them.

9. Stop conditions

Autoresearch should not rely on “good vibes” to stop.

Typical stop conditions: - human interrupt - compute or time budget exhausted - repeated crash threshold reached - no meaningful improvement across N iterations - search direction declared saturated - external dependency or precondition missing

10. What the framework is optimizing for

A good autoresearch framework optimizes not only for better task results, but for better research throughput under control.

That means it should produce: - reliable comparisons - durable history - safe failure recovery - efficient iteration - understandable progress

11. Bottom line

The true artifact of autoresearch is not a single winning experiment.

It is a bounded, legible, repeatable research loop that can continue producing useful improvements over time.

来源与参考

源文件： autoresearch/FRAMEWORK.md

来源目录： /srv/project/harness-engineering

继续阅读

Ideas把 autoresearch 放在 harness engineering 下面，不是因为它只是一个小分支，而是因为它天然依赖 harness：Autoresearch ManifestoAutoresearch 不是“让 AI 随便跑一跑实验”。