2026-04-24

2-M16 model-selection rollback

Restored the March 19 deployment checkpoint as the public 2-M16 model after comparing where it wins and loses against the April 3 reference run on holdout, event-level, replay, and pseudo-live safety metrics.

Historical note: archived update posts preserve the figures published at that time. For the current verified run bundles, use the results page.

Decision

The public website should display 20260319_075520 as the main 2-M16 deployment model again.

The April 3 run, 20260403_grouptrial_rest050, is still a useful offline benchmark. It wins the standard holdout and event-level scores. It should not be the displayed live-control model because the pseudo-live actuation metrics are worse.

Model Roles

ModelWinsLosesJustification
20260319_075520Would-send precision, false REST actuation, REST true-positive rate, action calibration, cleaned pseudo-live committed jointOffline holdout action/joint accuracy, event-level action/joint accuracy, would-send recallUse this as the public live-control model because avoiding false actuation matters more than maximizing offline recall for a deployed robot-hand claim.
20260403_grouptrial_rest050Offline holdout action/joint/finger accuracy, event-level action/joint/finger accuracy, would-send recallWould-send precision, false REST actuation, REST true-positive rate, action calibrationKeep this as an offline research benchmark because it proves training can improve decoding, but it needs safer gating before it should control the public deployment story.

Ranking

MetricMarch 19April 3Winner
Holdout action accuracy89.79%91.83%April 3
Holdout joint accuracy84.66%86.66%April 3
Holdout non-REST finger accuracy85.96% eval / 87.01% model card88.11%April 3
Event-level action accuracy92.56%95.87%April 3
Event-level joint accuracy87.60%93.39%April 3
Event-level non-REST finger accuracy90.68%94.92%April 3
Holdout REST TPR98.37%94.79%March 19
Action ECE, lower is better2.32%3.98%March 19
Cleaned pseudo-live committed joint86.64%86.04%March 19
Cleaned pseudo-live would-send precision93.32%80.06%March 19
Cleaned pseudo-live would-send recall10.57%36.49%April 3
Cleaned pseudo-live false REST actuation0.12%6.74%March 19

Diagnosis

The April 3 run is not simply "worse." It is more aggressive. It sends more true movement windows, which improves would-send recall from 10.57% to 36.49%, but that comes with a large precision and REST-safety regression.

The precision drop is 13.26 percentage points:

  • March 19: 93.32% would-send precision
  • April 3: 80.06% would-send precision

The REST-safety regression is larger in practical terms:

  • March 19: 0.12% false REST actuation on the cleaned pseudo-live corpus
  • April 3: 6.74% false REST actuation on the same corpus

Two things explain the regression:

  • The April 3 model is less conservative around REST. Holdout REST TPR drops from 98.37% to 94.79%.
  • The April 3 public replay used the raw-gated Step 7 path with postprocessing disabled, while the March 19 displayed metric set uses the tuned deployment family with EMA smoothing, low action/finger thresholds, applicability gating, stability, and cooldown.

Public Model Policy

For the public model on alphahand.org, deployment safety ranks ahead of offline split accuracy. A replacement for March 19 should beat it on:

  • cleaned-corpus would-send precision
  • cleaned-corpus false REST actuation
  • holdout REST TPR
  • zero invalid committed and sent action-finger pairs

The April 3 run remains useful for contributors because it shows where offline model training improved. The next target is to keep those offline gains while restoring March 19-level actuation precision and REST safety.

Website Changes

  • Restored the displayed 2-M16 metrics to 20260319_075520.
  • Added per-event accuracy for the March 19 holdout.
  • Kept the April 3 comparison in the public metrics JSON as a model-selection audit.
  • Reframed the April 3 reference-bundle update as historical context rather than the featured model claim.