Run detail

Subject 2-M16 — Run 2026-03-19

Date: 2026-03-19

Run note

This March 19 bundle is the current 2-M16 winning deployment candidate: cleaned training data, active-finger decoding, explicit finger-applicability gating, and zero committed or sent pair leakage.

deployment candidate

Highlights

  • The public 2-M16 bundle now tracks the cleaned March 19 combined corpus rather than the older March 18 combined session.
  • The active-finger head is now paired with a dedicated finger-applicability head, so REST-side gating is modeled directly instead of being inferred from active-finger logits.
  • Public holdout and replay bundles now publish applicability false-positive and false-negative rates together with the deployment pair invariant.

Changes in this bundle

  • Committed and sent OPEN or CLOSE plus NONE remain zero across the published holdout and replay bundles.
  • Committed and sent REST plus active-finger also remain zero because REST is forced to NONE while applicability only gates actuation.
  • The tuned threshold_applicability = 0.4 setting is now reflected in deployment config, replay artifacts, report HTML, and the public website bundle.

Deployment note

The March 19 website refresh replaces the March 18 public snapshot with the current winning-model bundle. The featured figure set now emphasizes current confusion, calibration, and replay diagnostics.

Frozen live defaults

Postprocess enabled, ema (5), finger_mode=raw

threshold_action=0.05, threshold_finger=0.2, threshold_applicability=0.4, actuation_min_prob=0.2

actuation_stability=3, cooldown_ms=250, speed_modulation=on

Test action accuracy

89.79%

2,301 held-out windows

Finger accuracy on non-REST

87.01%

1,994 non-REST test windows

Primary holdout joint accuracy

84.66%

REST TPR 98.37% · applicability FN 2.26%

Pseudo-live committed joint

86.64%

Would-send precision 93.32% · false REST actuation 0.12%

Why this run won

The public March 19 bundle is the tip of a much larger tuning and validation iceberg.

The featured 2-M16 run was not chosen on a single accuracy number. It emerged from repeated retraining, postprocess ablations, holdout audits, chronological replay, and pseudo-live replay until the deployment pair invariants stayed clean while the broader replay ladder remained competitive.

2,595 configs

Postprocess ablation

The March 16, 2026 website update documents a 2,595-config ablation over thresholds, smoothing, hysteresis, adjacency, and finger-mode settings.

96 retained sweep runs

Archived Step 2 + Step 3 cycle

The preserved `logs/sweep/` CSVs retain 96 completed training-plus-evaluation runs from the broader 2-M16 tuning cycle.

100+ model variants

Documented in the older Feb 26 update

The February 26, 2026 2-M16 tuning update states that 100+ model variants were trained across full-dataset, non-REST event-gated, and REST event-gated regimes.

30+ hours

Continuous sweep time

That same February 26, 2026 update describes a 30+ hour sweep and highlights a 90-run block that spanned about 33.3 hours from February 25, 2026 07:49 to February 26, 2026 17:07.

How this run was chosen

  • The March 19 checkpoint replaced the March 18 deployment candidate after the cleaned training corpus, explicit finger-applicability head, and refreshed replay bundle all aligned better than the previous public snapshot.
  • Selection favored the combination of strong holdout metrics, stronger replay behavior on the cleaned deployment corpus, and zero committed or sent invalid action-finger pairs across the published holdout and replay bundles.
  • The model was chosen because it behaved coherently across saved split metrics, chronological replay, and pseudo-live replay, not because it won on one leaderboard number.
  • The harder March 17 realism replay is still conservative on applicability recall, but it remains part of the public selection story because it shows where the deployment stack is still weak.

How the tuning campaign evolved

  • The February 26, 2026 update documents the earlier large-scale weight and hyperparameter campaign: 100+ trained variants, a 30+ hour sweep, and a largest logged 90-run block.
  • The March 16, 2026 update documents the later deployment-facing postprocess ablation that froze the live default family after 2,595 evaluated configurations.
  • The March 18, 2026 update widened the selection criteria from holdout accuracy alone to include full-session replay, pseudo-live behavior, and the end-to-end Step 7 control path.
  • The March 19, 2026 update finalized the current winning bundle by moving to the cleaned corpus and publishing applicability diagnostics directly alongside the deployment pair invariant.

Training Recipe & Frozen Runtime

This is the deeper layer behind the public bundle: the training recipe, split policy, auxiliary data support, and frozen deployment defaults that carried the winning checkpoint into replay and pseudo-live evaluation.

Training stack

Architecture

CNNLSTMFingerActionNet

The March 19 winning run combines an action head, an active-finger head, and a dedicated finger-applicability head.

Optimization

60 epochs · batch 64 · lr 0.001 · seed 43

These values come from the winning run's training config and match the published March 19 metrics bundle.

Split policy

group_trial · test_size 0.2 · calibration_size 0.1

The holdout bundle stays tied to a fixed split while calibration is separated from the main train/test partition.

Input + preprocessing

64 x 4 windows · center_detrend

Per-window centering and detrending are frozen into the winning run's preprocessing and normalizer config.

Sampler

core_event_equalized

Training equalizes the core REST-event mass while still keeping the auxiliary quiet-rest corpus train-only.

REST support

1,059 auxiliary quiet-rest windows

The auxiliary quiet-rest session is used for train-only support while the core split contributes 11,388 windows and the public test split contributes 2,301 windows.

Replay and runtime stack

Frozen live defaults

EMA smoothing (5) · action 0.05 · finger 0.20 · applicability 0.40

The March 19 bundle freezes the same deployment-facing thresholds reflected in the replay artifacts, report HTML, and website.

Actuation gates

min_prob 0.2 · stability 3 · cooldown 250 ms

These are the saved Step 7 decision defaults for the current deployment candidate.

Replay cadence

0.25 s windows · 0.05 s hop · 10 MC passes

The pseudo-live replay runs the same checkpoint through the saved inference and actuation path at a replay cadence close to live use.

Replay latency

127.21 ms mean · 127.29 ms p95

The current cleaned-corpus pseudo-live replay logs stable prediction latency across 12,447 windows.

Would-send onset

0.083 s median · 0.317 s p95

These onset figures come from the current cleaned-corpus pseudo-live replay and are exposed in the public benchmark ladder.

Replay footprint

12,447 windows over 3,046.15 s

The cleaned deployment replay is long enough to expose transition behavior, actuation suppression reasons, and latency stability rather than only short held-out windows.

Key Metrics

The public headline metrics use the published holdout bundle. Extended evaluation cards below add replay and pseudo-live context so the reader can see how the model behaves beyond a single split.

SplitMetricValue
TrainAction accuracy86.39%
TrainFinger accuracy86.80%
TrainAvg loss0.7714
TrainConfigepochs=60, batch=64, lr=0.001, seed=43
TestAction accuracy89.79%
TestFinger accuracy on non-REST windows87.01%
TestJoint accuracy84.66%
TestJoint accuracy on non-REST82.55%
TestFinger accuracy overall87.61%
TestREST TPR / precision / F198.37% / 80.11% / 0.883
TestREST FPR3.76%
TestApplicability FP / FN18.57% / 2.26%
TestAction-applicability disagreement3.56%
TestRaw valid / invalid pair rate83.62% / 16.38%
TestRaw non-REST NONE / raw REST active-finger0.00% / 16.38%
TestCommitted non-REST NONE / committed REST active-finger0.00% / 0.00%
TestAction ECE / finger ECE on non-REST2.32% / 2.73%
TestDeployment pair invariantpassed
TestTest windows2,301
TestNon-REST test windows1,994

Artifacts

model=finger_action_model.pt, scaler=scaler.npz, preds=test_predictions.npz

temperature scaling=temperature_scaling.json

Source identifiers: subject=2-M16, session=combined_20260319_081200_pruned_rest_events_0_1_2, run=20260319_075520

Created UTC: 2026-03-19T08:27:08+00:00

How to read this bundle

The test row is the saved split summary. Replay and pseudo-live cards below use the same checkpoint under different evaluation conditions.

Action train-test gap: 3.39%, with test accuracy slightly higher than training accuracy.

Extended Evaluation

This section groups repeated splits, quiet-rest replay, and chronological replay for the same run.

Auxiliary quiet-REST benchmark

Target: 2-M16_20260315_145838_01

Windows: 1,059

Action accuracy: 97.26%

REST TPR: 97.26%

REST precision: 100.00%

REST F1: 0.986

Applicability FP on true REST: 4.53%

Deployment pair invariant: passed

Dedicated quiet-rest replay used to measure REST-side applicability false positives on true REST windows.

Core full-session replay

Target: 2-M16_20260216_150056_01 + 2-M16_20260317_190134

Windows: 11,388

Action accuracy: 88.48%

Joint accuracy: 84.30%

Joint accuracy on non-REST: 82.73%

Finger accuracy on non-REST: 85.71%

REST TPR: 95.99%

REST precision: 69.15%

Applicability FP on true REST: 18.07%

Applicability FN on true non-REST: 3.68%

Committed non-REST + NONE rate: 0.00%

Committed REST + active-finger rate: 0.00%

Deployment pair invariant: passed

Chronological replay across the two core movement sessions with zero committed pair leakage and explicit applicability diagnostics.

Pseudo-Live Replay

Pseudo-live replay runs the saved EEG windows through the Step 7 decision path and records what the hand would have done without contacting hardware. This is the closest public benchmark on the site to live control behavior.

Pseudo-live replay on the cleaned deployment corpus

Target: combined_20260319_081200_pruned_rest_events_0_1_2

Training source: Winning March 19 deployment checkpoint

Windows: 12,447

Committed action accuracy: 91.75%

Committed joint accuracy: 86.64%

Committed finger accuracy on non-REST: 85.78%

Applicability FP on true REST: 12.10%

Applicability FN on true non-REST: 3.68%

Would-send precision on non-REST: 93.32%

Would-send recall on non-REST: 10.57%

False REST actuation rate: 0.12%

Non-REST NONE count: 0

Committed non-REST + NONE rate: 0.00%

Committed REST + active-finger rate: 0.00%

Sent non-REST + NONE rate: 0.00%

Sent REST + active-finger rate: 0.00%

Deployment pair invariant: passed

First-send latency median / p95: 0.083 s / 0.317 s

Threshold applicability is tuned to 0.4 for the current deployment bundle.

Pseudo-live replay on the legacy combined corpus

Target: combined_20260317_211414

Training source: Winning March 19 deployment checkpoint

Windows: 12,969

Committed action accuracy: 87.95%

Committed joint accuracy: 82.98%

Committed finger accuracy on non-REST: 85.90%

Applicability FP on true REST: 27.79%

Applicability FN on true non-REST: 3.68%

Would-send precision on non-REST: 89.62%

Would-send recall on non-REST: 10.57%

False REST actuation rate: 1.71%

Non-REST NONE count: 0

Committed non-REST + NONE rate: 0.00%

Committed REST + active-finger rate: 0.00%

Sent non-REST + NONE rate: 0.00%

Sent REST + active-finger rate: 0.00%

Deployment pair invariant: passed

First-send latency median / p95: 0.083 s / 0.317 s

Regression replay against the pre-pruned March 17 combined corpus.

Pseudo-live replay on the March 17 realism session

Target: 2-M16_20260317_190134

Training source: Winning March 19 deployment checkpoint

Windows: 1,644

Committed action accuracy: 72.87%

Committed joint accuracy: 71.96%

Committed finger accuracy on non-REST: 9.72%

Applicability FP on true REST: 17.89%

Applicability FN on true non-REST: 52.98%

Would-send precision on non-REST: 62.50%

Would-send recall on non-REST: 0.99%

False REST actuation rate: 0.09%

Non-REST NONE count: 0

Committed non-REST + NONE rate: 0.00%

Committed REST + active-finger rate: 0.00%

Sent non-REST + NONE rate: 0.00%

Sent REST + active-finger rate: 0.00%

Deployment pair invariant: passed

First-send latency median / p95: 0.381 s / 0.566 s

Hard realism check remains conservative: pair invariants hold, but applicability recall is still weak on this session.

Across published runs

Compare to other runs

Finger accuracy is reported on non-REST windows only.

RunDateAction accuracyFinger accuracy on non-REST windowsTest windows
2-m162026-03-1989.79%87.01%2,301
1-m16-5002026-03-0583.94%80.61%2,652

Plain-language highlights

  • Test action accuracy: 89.79%.
  • Test finger accuracy on non-REST windows: 87.01%.
  • Test windows: 2,301.

What this means

  • Action accuracy measures how often held-out EEG windows were assigned the correct REST, OPEN, or CLOSE label.
  • Finger accuracy on non-REST windows isolates finger classification after removing EEG windows labeled REST.
  • These are EEG-window-level metrics and should not be interpreted as direct trial-level or online-control performance.
  • Confusion matrices and confidence plots provide error structure that is not visible from accuracy alone.

Trust & Caveats

  • The public metrics bundle does not include full per-class counts, so class imbalance is not fully characterized on-page.
  • The pipeline uses overlapping windows; leakage control depends on split settings and metadata quality.
  • This public bundle does not expose run-specific split mode or purge settings, so leakage risk cannot be fully ruled out for this run.

Topomaps & Signal Evidence

These alpha-band topomaps are here to show what changed physiologically, not to replace the classifier metrics. They help explain why the current deployment strategy leans on lateral Muse 2 channels and all-session training rather than a narrow single-session fit.

Action alpha rest-delta topomap

Action alpha rest-delta topomap

Rest-relative alpha maps for REST, OPEN, and CLOSE in the March 19 winning session. OPEN and CLOSE both show the dominant TP10 decrease and smaller TP9 increase that characterize the current 2-M16 action story.

Finger alpha rest-delta topomap with NONE reference

Finger alpha rest-delta topomap with NONE reference

Finger-level rest-delta maps, including NONE as the explicit REST reference. The strongest variation remains concentrated on TP10 and TP9, which helps explain why lateral Muse 2 channels carry most of the finger-separation load.

Interpretive notes

  • The strongest rest-relative separations remain concentrated on the lateral Muse 2 channels rather than a broad scalp-wide shift.
  • OPEN and CLOSE are highly similar in rest-relative alpha topography, so these figures are best read as signal-evidence context rather than a substitute for temporal decoding metrics.
  • Finger-level variation remains strongest on TP10 and then TP9, with AF7 and AF8 changing much less.

Figures

These figures carry the structure behind the headline metrics. The confusion matrices show where the decoder drifts, while the confidence panels show whether the model's probabilities are stable enough to support conservative actuation rules.

Note: in the finger confusion matrix, REST action misses are shown as NONE. Those cells reflect true movement windows that the action head labeled REST, not deployable OPEN/CLOSE plus NONE outputs.

Confusion matrix showing predicted versus actual action classifications for 2-m16

Action confusion matrix

Confusion matrix for action classification across REST, OPEN, and CLOSE. Rows show the actual labels, columns show the predicted labels, and off-diagonal cells show where action boundaries remain unstable.

Confusion matrix showing predicted versus actual finger classifications on non-REST windows for 2-m16

Finger confusion matrix on non-REST windows

Confusion matrix for finger classification on non-REST windows. The diagonal shows which active fingers remain separable after REST is removed from the task.

Action calibration plot for 2-m16

Action calibration

Action calibration helps show whether confidence tracks observed correctness tightly enough to support conservative actuation gates and replay analysis.

Calibration plot comparing model confidence with observed accuracy for 2-m16

Confidence calibration

Calibration bars compare predicted confidence with observed accuracy across bins. Better alignment means the model's confidence is more usable for thresholding and safety gates.

Scatter plot comparing action confidence with action uncertainty for 2-m16

Confidence and uncertainty scatter

The uncertainty scatter shows where action predictions stay compact and where they begin to loosen. High-confidence, low-uncertainty regions are the most stable part of the decoding space.

Source trail

Follow the selection path

These links document how the project moved from broad tuning and ablation work to the current public run.

This page reflects the currently published bundle for this run slug and may be revised if a newer validated bundle is released.