What changed
This release marks a substantive architecture and evaluation update for 2-M16, not just another offline metric refresh.
The public 2-m16 bundle is now built around the March 18 deployment candidate:
- Session:
combined_20260317_211414 - Run:
20260318_042115
The current recommendation for live use is no longer a narrower split-fixed mixed-rest run. The main checkpoint is now trained on all currently available 2-M16 sessions, because those are the only sessions we have and the smaller holdout-only model was too weak to justify shipping as the default live candidate.
Why this is a breakthrough
- The active-finger head removes the old
OPEN/CLOSE + NONEfailure mode from non-REST decoding. - The site now publishes pseudo-live replay metrics that reuse the Step 7 decision path offline, so the public bundle can report expected control behavior instead of only raw classifier accuracy.
- Step 7 now preserves the uncertainty-aware speed scalar all the way through the command shaper, so
modulate_actuation_speedis honored end-to-end. - The public run bundle now includes alpha-band topomaps so readers can see the signal structure behind the decoding story rather than only confusion matrices.
Headline public results
- Primary holdout action accuracy: 78.39%
- Primary holdout finger accuracy on non-REST windows: 82.03%
- Primary holdout joint accuracy: 75.19%
- Primary holdout REST TPR: 57.11%
- Repeated-split action accuracy mean / std: 82.23% / 4.35%
- Repeated-split joint accuracy mean / std: 78.66% / 3.77%
These are still offline evaluation numbers, not a live-control guarantee, but they are the most appropriate public baseline for the current deployment candidate.
Replay benchmarks now published
The old site mostly stopped at holdout metrics. This update adds a fuller ladder:
- Auxiliary quiet-rest replay: 96.03% action accuracy, REST F1
0.980 - All-session chronological replay before actuation gating: 86.40% action, 82.70% joint
- Pseudo-live replay on the combined corpus: 84.77% committed joint accuracy, 91.53% would-send precision on non-REST windows
- Pseudo-live replay on the unseen March 17 mixed session: 70.44% committed action accuracy, 70.32% committed joint accuracy, 66.67% would-send precision
That unseen March 17 pseudo-live benchmark is currently the most informative public realism check because it is closer to the deployed decision path than a raw holdout split.
Validation benchmark vs deployment model
We also kept a smaller realism benchmark for configuration validation:
- Train on
2-M16_20260216_150056_01 - Add
2-M16_20260315_145838_01as quiet-rest auxiliary training only - Replay on
2-M16_20260317_190134
That validation-only model performs much worse than the all-session deployment model:
- Committed action accuracy: 39.05%
- Committed joint accuracy: 36.31%
- Would-send precision on non-REST windows: 4.55%
That is why the website now distinguishes validation benchmarks from the recommended deployment checkpoint. With the current data volume, the deployment model should use all currently available subject sessions.
What the topomaps add
The new alpha-band topomaps help explain the decoding behavior:
- REST alpha is strongly dominated by
TP10versusTP9, so absolute maps mostly reflect scale. - Rest-relative action maps show OPEN and CLOSE both dominated by decreased
TP10power with smallerTP9increases. - Finger-level variation is concentrated on
TP10and thenTP9, whileAF7andAF8move much less. - Split-halves maps show drift, but not catastrophic collapse, which supports the all-session training strategy.
These topomaps do not replace the decoder metrics, but they do make the current model story more interpretable.
Legacy comparison note
The published 1-M16 bundle remains on the website as a historical reference point. It was produced with an earlier model and legacy evaluation methods, so its numbers should not be treated as directly comparable to the current 2-M16 deployment candidate.
What changed in live control
Two practical live-control issues were resolved in this cycle:
- The active-finger decoding change removes non-REST
NONEleakage from the published benchmark family. - Step 7 speed modulation is now actually applied end-to-end, so action uncertainty can influence final hand speed instead of being silently discarded.
What to watch next
- The next hard gate is real live validation with synchronized prediction logs and video review.
- The public site now has the scaffolding to report those outcomes cleanly: holdout metrics, replay metrics, pseudo-live behavior, and signal evidence all live in one place.
- If the next live session holds up, the website is now structured to show a true control milestone instead of just another offline experiment.