Decision
The public website should display 20260319_075520 as the main 2-M16 deployment model again.
The April 3 run, 20260403_grouptrial_rest050, is still a useful offline benchmark. It wins the standard holdout and event-level scores. It should not be the displayed live-control model because the pseudo-live actuation metrics are worse.
Model Roles
| Model | Wins | Loses | Justification |
|---|---|---|---|
20260319_075520 | Would-send precision, false REST actuation, REST true-positive rate, action calibration, cleaned pseudo-live committed joint | Offline holdout action/joint accuracy, event-level action/joint accuracy, would-send recall | Use this as the public live-control model because avoiding false actuation matters more than maximizing offline recall for a deployed robot-hand claim. |
20260403_grouptrial_rest050 | Offline holdout action/joint/finger accuracy, event-level action/joint/finger accuracy, would-send recall | Would-send precision, false REST actuation, REST true-positive rate, action calibration | Keep this as an offline research benchmark because it proves training can improve decoding, but it needs safer gating before it should control the public deployment story. |
Ranking
| Metric | March 19 | April 3 | Winner |
|---|---|---|---|
| Holdout action accuracy | 89.79% | 91.83% | April 3 |
| Holdout joint accuracy | 84.66% | 86.66% | April 3 |
| Holdout non-REST finger accuracy | 85.96% eval / 87.01% model card | 88.11% | April 3 |
| Event-level action accuracy | 92.56% | 95.87% | April 3 |
| Event-level joint accuracy | 87.60% | 93.39% | April 3 |
| Event-level non-REST finger accuracy | 90.68% | 94.92% | April 3 |
| Holdout REST TPR | 98.37% | 94.79% | March 19 |
| Action ECE, lower is better | 2.32% | 3.98% | March 19 |
| Cleaned pseudo-live committed joint | 86.64% | 86.04% | March 19 |
| Cleaned pseudo-live would-send precision | 93.32% | 80.06% | March 19 |
| Cleaned pseudo-live would-send recall | 10.57% | 36.49% | April 3 |
| Cleaned pseudo-live false REST actuation | 0.12% | 6.74% | March 19 |
Diagnosis
The April 3 run is not simply "worse." It is more aggressive. It sends more true movement windows, which improves would-send recall from 10.57% to 36.49%, but that comes with a large precision and REST-safety regression.
The precision drop is 13.26 percentage points:
- March 19: 93.32% would-send precision
- April 3: 80.06% would-send precision
The REST-safety regression is larger in practical terms:
- March 19: 0.12% false REST actuation on the cleaned pseudo-live corpus
- April 3: 6.74% false REST actuation on the same corpus
Two things explain the regression:
- The April 3 model is less conservative around REST. Holdout REST TPR drops from 98.37% to 94.79%.
- The April 3 public replay used the raw-gated Step 7 path with postprocessing disabled, while the March 19 displayed metric set uses the tuned deployment family with EMA smoothing, low action/finger thresholds, applicability gating, stability, and cooldown.
Public Model Policy
For the public model on alphahand.org, deployment safety ranks ahead of offline split accuracy. A replacement for March 19 should beat it on:
- cleaned-corpus would-send precision
- cleaned-corpus false REST actuation
- holdout REST TPR
- zero invalid committed and sent action-finger pairs
The April 3 run remains useful for contributors because it shows where offline model training improved. The next target is to keep those offline gains while restoring March 19-level actuation precision and REST safety.
Website Changes
- Restored the displayed 2-M16 metrics to
20260319_075520. - Added per-event accuracy for the March 19 holdout.
- Kept the April 3 comparison in the public metrics JSON as a model-selection audit.
- Reframed the April 3 reference-bundle update as historical context rather than the featured model claim.