2026-04-24

2-M16 offline reference bundle and event-level metrics

Documented the April 3 2-M16 offline reference model, published the curated clone-ready dataset path, refreshed reports, and added a quick per-event accuracy check alongside per-window metrics.

Historical note: archived update posts preserve the figures published at that time. For the current verified run bundles, use the results page.

What changed

Update, April 24, 2026: the April 3 run is no longer the featured public deployment model. The model-selection rollback restores March 19 for public deployment metrics because it wins would-send precision and false REST actuation. The April 3 numbers remain useful as an offline reference benchmark.

The featured 2-m16 site bundle now points at the reference checkpoint used by the published repo entrypoint:

  • Dataset: combined_20260319_081200_pruned_rest_events_0_1_2
  • Run: 20260403_grouptrial_rest050
  • Final dataset: 12,447 windows shaped 64 x 4
  • Published channels: TP9, AF7, AF8, TP10

This remains a clone-friendly reference path for people who want to validate the current process, train a comparable model, and start improving the system without collecting new data first.

Current headline metrics

Primary holdout, per window:

  • Action accuracy: 91.83%
  • Joint action+finger accuracy: 86.66%
  • Non-REST finger accuracy: 88.11%
  • REST TPR / precision: 94.79% / 84.59%
  • Action ECE / non-REST finger ECE: 3.98% / 1.60%
  • Committed OPEN/CLOSE + NONE: 0.00%
  • Committed REST + active-finger: 0.00%

Quick event-level check, majority vote over held-out windows grouped by session and event:

  • Events scored: 121 total, 118 non-REST, 3 REST
  • Event-level action accuracy: 95.87%
  • Event-level joint accuracy: 93.39%
  • Event-level non-REST finger accuracy: 94.92%

The event-level score is not a replacement for per-window reporting. It answers a different practical question: when a movement event produces several overlapping windows, does the event mostly land on the right action and finger?

Replay and runtime context

The April 3 model stays strong on the standard holdout and full-session replay, but the pseudo-live replay still shows where the runtime layer needs work:

  • Core full-session replay: 89.42% action, 85.73% joint, 88.33% non-REST finger.
  • Published-corpus pseudo-live replay: 86.04% committed joint, 80.06% would-send precision, 6.74% false REST actuation.
  • March 17 realism pseudo-live replay: 68.80% committed joint and 27.41% would-send precision.

The important invariant still holds: committed and sent invalid pairs remain zero in the published diagnostics. The harder realism replay remains intentionally visible because it shows the next improvement target instead of hiding it behind the best split.

Website and repo cleanup

  • Historical note: these changes were made before the later April 24 rollback audit. The current displayed website bundle now reflects March 19 again, while April 3 remains documented as an offline benchmark.
  • After rollback, the March 19 scatter and topomap assets are restored with the displayed deployment bundle.
  • The public metrics bundle now includes a machine-readable event_level block.
  • The report now includes event-level misses so contributors can inspect which movement events should be targeted first.

Links