SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

Experimental Setup

StarDojo Lite-100

Lite-100 is the main evaluation split. It contains 100 long-horizon tasks from StarDojo across farming, crafting, exploration, combat, and social interaction, with a fixed difficulty split of 56 easy, 23 medium, and 21 hard tasks.

We use it to measure both task success and cost-aware behavior: Lite-100 SR follows the benchmark step-cap protocol, while Budgeted SR uses matched LLM-call budgets for easy, medium, and hard tasks. We also report tokens per task, latency per step, and Recovery/Stuck Ratio to evaluate whether SPIKE improves success without simply spending more compute.

100tasks

5task families

3repeated runs

Category	Total	Easy	Medium	Hard
Farming	21	14	3	4
Crafting	14	7	4	3
Exploration	28	15	8	5
Combat	12	3	6	3
Social	25	17	6	2

RDR2 Transfer

RDR2 is used as a secondary cross-game transfer test. It keeps the same adaptive dual-controller design, but changes the visual domain, action space, interaction pace, and task language.

13 selected long-horizon tasks
Three repeated runs per task
Navigation, search, following, protection, and combat-like objectives
Shared protocol across all compared methods

Pareto trade-off — Effectiveness-efficiency trade-off on Lite-100, showing that SPIKE improves success while using fewer tokens and lower latency than high-cost baselines.

Mechanistic analysis showing how controller allocation and hierarchical memory contribute to recovery, efficiency, and sustained task progress.

Qualitative analysis — Qualitative comparison of goal-reaching behavior, illustrating how SPIKE maintains progress over long-horizon interaction traces.

Demo 1 — Representative successful trajectory under SPIKE.

Demo 2 — Additional qualitative rollout showing long-horizon progress.

Main Result

Main comparison on StarDojo Lite-100 and cross-game transfer on Red Dead Redemption 2. Higher SR is better; lower token use and latency indicate better efficiency.

StarDojo Lite-100

#	Method	Lite-100 SR	Budgeted SR	Tokens / Task (k)	Latency / Step (s)	Recovery/Stuck Ratio
1	SPIKE (Default Qwen3.5-397B)	18.0±1.7%	21.6±1.7%	168.1±5.7	43.5±4.7	2.75±0.12
2	CRADLE	13.0±1.7%	10.7±0.6%	372.5±8.8	73.5±8.2	2.49±0.18
3	StarDojo baseline	12.3±1.4%	12.3±1.4%	295.9±8.4	48.7±4.5	2.37±0.18
4	Reflexion-like	8.7±1.0%	10.4±0.6%	178.5±6.6	35.9±3.8	2.02±0.14
5	Voyager-like	7.3±0.6%	8.1±1.0%	238.3±6.7	56.5±6.0	1.73±0.15
6	ReAct-like	6.7±0.6%	9.6±0.6%	102.6±5.4	24.5±3.2	1.61±0.11

RDR2 Transfer

#	Method	SR	Tokens / Step (k)	Latency / Step (s)
1	SPIKE	56.4±7.7%	3.4±0.7	54.1±4.6
2	CRADLE	43.6±7.7%	6.6±0.8	102.2±8.4
3	StarDojo baseline	33.3±5.1%	5.7±0.5	67.5±4.1
4	Reflexion-like	17.9±5.1%	3.5±0.2	41.4±3.6
5	ReAct-like	15.4±2.6%	2.1±0.2	28.2±2.7
6	Voyager-like	7.7±2.6%	4.8±0.3	72.4±3.9

+5.0 ptsLite-100 SR over CRADLE

+9.3 ptsBudgeted SR over StarDojo baseline

-54.9%Tokens versus CRADLE

-40.8%Latency versus CRADLE

SPIKE

An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

Highlights

SPIKE Framework

Experiments

Experimental Setup

StarDojo Lite-100

RDR2 Transfer

Main Result

StarDojo Lite-100

RDR2 Transfer

BibTeX