CRAFT

Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies.

Keyu Chen¹ · Nanfei Ye² · Yida Wang² · Wenchao Sun¹ · Danqi Zhao¹ · Hao Cheng¹ · Sifa Zheng¹

¹School of Vehicle and Mobility, Tsinghua University · ²Li Auto Inc

arXiv Code

CRAFT result on a safety-critical Crossing Bicycle Flow scenario. The full gallery below compares pre-trained and CRAFT-fine-tuned behavior side by side.

Overview

Dense counterfactual proxy, grounded residual correction.

Open-loop supervised driving policies can look strong near expert states but fail once their own actions shift future observations. CRAFT closes this gap without changing the base architecture.

3policy families: hierarchical planning, vision-language-action, and vocabulary scoring

24representative before/after closed-loop video comparisons

100CRAFT score on every selected qualitative comparison

CRAFT treats counterfactual feedback as a dense proxy and closed-loop interaction as grounded residual correction.

Method

A proxy-residual recipe for closed-loop post-training.

The framework assigns distinct statistical roles to local counterfactual evaluation and real interaction, then regularizes adaptation toward reliable pre-trained behavior.

Dense counterfactual proxy

For each on-policy state, candidate trajectories are scored for efficiency, safety, and rule compliance, then converted into group-normalized advantages.

Grounded residual correction

Executed rollouts supply event-driven rewards that correct proxy errors exposed by closed-loop interaction, using a value-free dual-clipped update.

On-policy self-distillation

An exponential moving-average teacher initialized from the pre-trained checkpoint keeps adaptation close to reliable behavior while allowing corrective shifts.

CRAFT combines trajectory-level counterfactual supervision, closed-loop residual feedback, and asymmetric KL self-distillation.

Visual Results

Visualization Before and After CRAFT Fine-Tuning.

Select an algorithm and scenario to compare the same ego-view rollout before and after CRAFT. The two videos auto-play, wait for both videos to finish, then restart together.

Loading scenario

Driving score

Pre-trained

CRAFT fine-tuned

Videos are stacked vertically. If one video is shorter, it waits at the end until both videos finish, then both restart from the beginning.

Empirical Picture

CRAFT improves interaction-sensitive driving skills.

The paper reports consistent gains over the pre-trained policies on the full Bench2Drive closed-loop benchmark. The cards below summarize the main-table Driving Score and Success Rate improvements.

Fine-grained ability profiles across representative driving policies.

Scaling behavior and reward dynamics show more consistent gains as fine-tuning data increases.

Training stability improves through broader proxy signal and dual-clipped residual correction.

Proxy + Residual

Division of labor

The counterfactual proxy provides broad optimization signal, while grounded residual feedback supplies the missing correction for interaction-dependent failures. Self-distillation stabilizes the shift around the pre-trained policy manifold.

Citation

Reference

Citation information will be updated when the public paper record is available.

@misc{craft2026,
  title     = {CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Closed-Loop Autonomous Driving},
  author    = {Keyu Chen and Nanfei Ye and Yida Wang and Wenchao Sun and Danqi Zhao and Hao Cheng and Sifa Zheng},
  year      = {2026}
}