SPIKE

An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

1Zhejiang University, 2National University of Singapore, 3Nanyang Technological University
SPIKE teaser

Highlights

  1. Event-triggered amortized deliberation. Strategic reasoning is reused across stable local segments and reinvoked at visual, progress, repetition, or failure boundaries.
  2. Adaptive dual-controller execution. A Strategic Controller handles planning and recovery, while a bounded Reactive Controller performs fast local execution and local override.
  3. Controller-specific hierarchical memory. SPIKE separates State-Action Memory Bank retrieval for routine execution from State-Action Knowledge Graph evidence for replanning.
  4. Better success-cost trade-off. On StarDojo Lite-100, SPIKE improves success over the strongest baseline while reducing tokens and latency.

SPIKE Framework

SPIKE treats strategic reasoning as a budgeted resource. The Event Trigger decides when to deliberate, the Strategic Controller handles planning and recovery, the Reactive Controller executes low-cost local actions, and Hierarchical Memory retrieves controller-specific evidence.

SPIKE architecture
SPIKE uses event-triggered switching to reserve strategic deliberation for discontinuities while reactive execution handles stable local progress.
Strategic Controller workflow
When escalation is triggered, the Strategic Controller gathers state evidence, retrieves memory, reasons over subtasks, and proposes the next actions.

Experiments

Experimental Setup

StarDojo Lite-100

Lite-100 is the main evaluation split. It contains 100 long-horizon tasks from StarDojo across farming, crafting, exploration, combat, and social interaction, with a fixed difficulty split of 56 easy, 23 medium, and 21 hard tasks.

We use it to measure both task success and cost-aware behavior: Lite-100 SR follows the benchmark step-cap protocol, while Budgeted SR uses matched LLM-call budgets for easy, medium, and hard tasks. We also report tokens per task, latency per step, and Recovery/Stuck Ratio to evaluate whether SPIKE improves success without simply spending more compute.

100tasks
5task families
3repeated runs
Category Total Easy Medium Hard
Farming211434
Crafting14743
Exploration281585
Combat12363
Social251762

RDR2 Transfer

RDR2 is used as a secondary cross-game transfer test. It keeps the same adaptive dual-controller design, but changes the visual domain, action space, interaction pace, and task language.

  • 13 selected long-horizon tasks
  • Three repeated runs per task
  • Navigation, search, following, protection, and combat-like objectives
  • Shared protocol across all compared methods
Pareto trade-off
Effectiveness-efficiency trade-off on Lite-100, showing that SPIKE improves success while using fewer tokens and lower latency than high-cost baselines.
Mechanistic analysis
Mechanistic analysis showing how controller allocation and hierarchical memory contribute to recovery, efficiency, and sustained task progress.
Qualitative analysis
Qualitative comparison of goal-reaching behavior, illustrating how SPIKE maintains progress over long-horizon interaction traces.
Demo 1
Representative successful trajectory under SPIKE.
Demo 2
Additional qualitative rollout showing long-horizon progress.

Main Result

Main comparison on StarDojo Lite-100 and cross-game transfer on Red Dead Redemption 2. Higher SR is better; lower token use and latency indicate better efficiency.

StarDojo Lite-100

# Method Lite-100 SR Budgeted SR Tokens / Task (k) Latency / Step (s) Recovery/Stuck Ratio
1SPIKE (Default Qwen3.5-397B)18.0±1.7%21.6±1.7%168.1±5.743.5±4.72.75±0.12
2CRADLE13.0±1.7%10.7±0.6%372.5±8.873.5±8.22.49±0.18
3StarDojo baseline12.3±1.4%12.3±1.4%295.9±8.448.7±4.52.37±0.18
4Reflexion-like8.7±1.0%10.4±0.6%178.5±6.635.9±3.82.02±0.14
5Voyager-like7.3±0.6%8.1±1.0%238.3±6.756.5±6.01.73±0.15
6ReAct-like6.7±0.6%9.6±0.6%102.6±5.424.5±3.21.61±0.11

RDR2 Transfer

# Method SR Tokens / Step (k) Latency / Step (s)
1SPIKE56.4±7.7%3.4±0.754.1±4.6
2CRADLE43.6±7.7%6.6±0.8102.2±8.4
3StarDojo baseline33.3±5.1%5.7±0.567.5±4.1
4Reflexion-like17.9±5.1%3.5±0.241.4±3.6
5ReAct-like15.4±2.6%2.1±0.228.2±2.7
6Voyager-like7.7±2.6%4.8±0.372.4±3.9
+5.0 ptsLite-100 SR over CRADLE
+9.3 ptsBudgeted SR over StarDojo baseline
-54.9%Tokens versus CRADLE
-40.8%Latency versus CRADLE

BibTeX

Coming soon.