SPIKE treats strategic reasoning as a budgeted resource. The Event Trigger decides when to deliberate, the Strategic Controller handles planning and recovery, the Reactive Controller executes low-cost local actions, and Hierarchical Memory retrieves controller-specific evidence.
Lite-100 is the main evaluation split. It contains 100 long-horizon tasks from StarDojo across farming, crafting, exploration, combat, and social interaction, with a fixed difficulty split of 56 easy, 23 medium, and 21 hard tasks.
We use it to measure both task success and cost-aware behavior: Lite-100 SR follows the benchmark step-cap protocol, while Budgeted SR uses matched LLM-call budgets for easy, medium, and hard tasks. We also report tokens per task, latency per step, and Recovery/Stuck Ratio to evaluate whether SPIKE improves success without simply spending more compute.
| Category | Total | Easy | Medium | Hard |
|---|---|---|---|---|
| Farming | 21 | 14 | 3 | 4 |
| Crafting | 14 | 7 | 4 | 3 |
| Exploration | 28 | 15 | 8 | 5 |
| Combat | 12 | 3 | 6 | 3 |
| Social | 25 | 17 | 6 | 2 |
RDR2 is used as a secondary cross-game transfer test. It keeps the same adaptive dual-controller design, but changes the visual domain, action space, interaction pace, and task language.
Main comparison on StarDojo Lite-100 and cross-game transfer on Red Dead Redemption 2. Higher SR is better; lower token use and latency indicate better efficiency.
| # | Method | Lite-100 SR | Budgeted SR | Tokens / Task (k) | Latency / Step (s) | Recovery/Stuck Ratio |
|---|---|---|---|---|---|---|
| 1 | SPIKE (Default Qwen3.5-397B) | 18.0±1.7% | 21.6±1.7% | 168.1±5.7 | 43.5±4.7 | 2.75±0.12 |
| 2 | CRADLE | 13.0±1.7% | 10.7±0.6% | 372.5±8.8 | 73.5±8.2 | 2.49±0.18 |
| 3 | StarDojo baseline | 12.3±1.4% | 12.3±1.4% | 295.9±8.4 | 48.7±4.5 | 2.37±0.18 |
| 4 | Reflexion-like | 8.7±1.0% | 10.4±0.6% | 178.5±6.6 | 35.9±3.8 | 2.02±0.14 |
| 5 | Voyager-like | 7.3±0.6% | 8.1±1.0% | 238.3±6.7 | 56.5±6.0 | 1.73±0.15 |
| 6 | ReAct-like | 6.7±0.6% | 9.6±0.6% | 102.6±5.4 | 24.5±3.2 | 1.61±0.11 |
| # | Method | SR | Tokens / Step (k) | Latency / Step (s) |
|---|---|---|---|---|
| 1 | SPIKE | 56.4±7.7% | 3.4±0.7 | 54.1±4.6 |
| 2 | CRADLE | 43.6±7.7% | 6.6±0.8 | 102.2±8.4 |
| 3 | StarDojo baseline | 33.3±5.1% | 5.7±0.5 | 67.5±4.1 |
| 4 | Reflexion-like | 17.9±5.1% | 3.5±0.2 | 41.4±3.6 |
| 5 | ReAct-like | 15.4±2.6% | 2.1±0.2 | 28.2±2.7 |
| 6 | Voyager-like | 7.7±2.6% | 4.8±0.3 | 72.4±3.9 |
Coming soon.