Model outputs

The v4 offense full-game baseline failed its walk-forward and the 8-of-8 feature-test program is closed. On the 22 carried features the per-team learned-model architecture is now the model of record on BOTH surfaces, each chosen by a 7-configuration GBM bake-off: CatBoost on full-game (commit eb7c46c) and CatBoost on F5 (commit 29f50d8). Research-phase point estimates, not betting claims.

Models of record

Full-game total runsCatBoost — full-game GBM bake-off winner (commit eb7c46c) · gr-18-clean offense matrix (22 features) · pooled 2021-2025 MAE 3.5249 vs LightGBM 3.5755 · 2026 OOS shadow 3.6067 vs 3.6673 · bake-off winner
F5 total runsCatBoost — GBM bake-off winner (7 configs pre-registered) · F5 v2 matrix (22 features) · pooled 2021-2025 F5-total MAE 2.6419 vs LightGBM 2.6791 · 2026 OOS shadow 2.4816 vs 2.5346 · bake-off winner
Phaseresearch point estimates — not probabilities, not betting claims

Live deployment: the daily page's predict_slate now uses CatBoost on BOTH surfaces — F5 from commit 9fbf52a and full-game from the 2026-05-19 wiring commit (model_version icecream-allcatboost-20260519).

Feature importance (Phase 1 — descriptive)

CatBoost is the model of record on both surfaces. Phase 1 of the carried-features research arc probes how the model weights each of the 22 carried features — a descriptive LossFunctionChange read per day across the walk-forward, no pre-registration required. Phase 2 (a pre-registered ablation against the bottom-ranked features) designs against this evidence. Research program: audit/research_programs/20260519_carried_features_next.md.

Full-game surface

CatBoost LossFunctionChange, day-by-day walk-forward 2021-2025 (835 trained days). Higher mean = larger loss change when the feature is permuted = more load-bearing for the model.

RankFeatureMeanStd
1park_factor0.02230.0086
2bat_team_runs_pg_recent0.01730.0053
3opp_sp_xwoba_allowed0.01710.0073
4opp_bullpen_woba_tm0.01640.0081
5xwoba_team0.01160.0086
6ump_run_env0.01110.0084
7opp_bullpen_woba_lg0.01110.0070
8temp_f0.01100.0058
9flat_woba0.01020.0053
10opp_sp_depth_bf0.01010.0067
11opp_sp_woba_allowed0.00970.0045
12carry_index0.00920.0062
13opp_bullpen_ip_l5d0.00850.0057
14opp_bullpen_ip_l1d0.00780.0054
15opp_bullpen_ip_l3d0.00770.0051
16matchup_woba_v40.00750.0054
17matchup_woba_pitchtype0.00720.0048
18opp_sp_velo_drop0.00620.0039
19is_home0.00470.0038
20bat_team_miles_traveled0.00400.0033
21bat_team_rest_days0.00370.0026
22bat_team_tz_shift0.00250.0011

F5 surface

CatBoost LossFunctionChange, day-by-day walk-forward 2021-2025 (835 trained days). Higher mean = larger loss change when the feature is permuted = more load-bearing for the model.

RankFeatureMeanStd
1opp_sp_xwoba_allowed0.01680.0072
2opp_bullpen_woba_tm0.01220.0064
3park_factor0.01170.0042
4bat_team_runs_pg_recent0.01030.0045
5is_home0.00900.0052
6opp_sp_depth_bf0.00900.0064
7ump_run_env0.00850.0057
8opp_bullpen_woba_lg0.00830.0048
9flat_woba0.00770.0068
10xwoba_team0.00730.0055
11temp_f0.00720.0039
12carry_index0.00670.0046
13opp_bullpen_ip_l3d0.00660.0043
14opp_sp_woba_allowed0.00660.0041
15matchup_woba_pitchtype0.00630.0041
16opp_bullpen_ip_l5d0.00600.0043
17matchup_woba_v40.00570.0035
18opp_sp_velo_drop0.00540.0027
19opp_bullpen_ip_l1d0.00530.0041
20bat_team_miles_traveled0.00410.0037
21bat_team_tz_shift0.00310.0015
22bat_team_rest_days0.00160.0006

Offense feature-test program

All 8 closed; FAIL/MARGINAL results carried (not killed), pending discussion.

TestVerdictCommitResult
#1 pitch-type matchupFAILd916541Pitch-type-level matchup degraded total-runs accuracy every season.
#2 expected wOBAPASSb2eccf5Statcast expected wOBA beat actual wOBA -- better every season.
#3 team-specific bullpenFAIL41c53beThe fixed-formula team-bullpen swap raised MAE every season -- now shadowed.
#4 carry indexMARGINAL31b4857Weather carry index vs temperature -- a noise-level MAE change, neutral.
#5 umpire run environmentPASSf2be26ePlate-umpire run environment improved MAE 3.6691 vs 3.6722, 4 of 5 seasons.
#6 travel / restFAIL26ac3c5Per-team travel-fatigue adjustment raised MAE -- fitted slope near-zero.
#7 pitcher velo trendFAIL8e4bcd0Velo-drop swaps (continuous + 2-mph flag) both raised MAE -- carried, not killed.
bullpen fatigueMARGINALeee89a7Recent bullpen-workload swap -- continuous noise-level, flag worse. Spun off from #3.

Team-bullpen shadow

Test #3 failed 2021-2025; the team-bullpen variant continues to run forward on out-of-sample games -- monitoring only, does NOT overturn the test #3 verdict.

Window2026-03-25 .. 2026-05-18
Games scored621
C0 league-flat bullpen MAE3.6866
C1 team-bullpen MAE3.6817
C1 minus C0-0.0049 — team-bullpen ahead