Model outputs

The v4 offense full-game baseline failed its walk-forward and the 8-of-8 feature-test program is closed. On the 22 carried features the per-team learned-model architecture is now the model of record on BOTH surfaces, each chosen by a 7-configuration GBM bake-off: CatBoost on full-game (commit eb7c46c) and CatBoost on F5 (commit 29f50d8). Research-phase point estimates, not betting claims.

Models of record

Full-game total runs	CatBoost — full-game GBM bake-off winner (commit `eb7c46c`) · gr-18-clean offense matrix (22 features) · pooled 2021-2025 MAE 3.5249 vs LightGBM 3.5755 · 2026 OOS shadow 3.6067 vs 3.6673 · bake-off winner
F5 total runs	CatBoost — GBM bake-off winner (7 configs pre-registered) · F5 v2 matrix (22 features) · pooled 2021-2025 F5-total MAE 2.6419 vs LightGBM 2.6791 · 2026 OOS shadow 2.4816 vs 2.5346 · bake-off winner
Phase	research point estimates — not probabilities, not betting claims

Live deployment: the daily page's predict_slate now uses CatBoost on BOTH surfaces — F5 from commit 9fbf52a and full-game from the 2026-05-19 wiring commit (model_version icecream-allcatboost-20260519).

Feature importance (Phase 1 — descriptive)

CatBoost is the model of record on both surfaces. Phase 1 of the carried-features research arc probes how the model weights each of the 22 carried features — a descriptive LossFunctionChange read per day across the walk-forward, no pre-registration required. Phase 2 (a pre-registered ablation against the bottom-ranked features) designs against this evidence. Research program: audit/research_programs/20260519_carried_features_next.md.

Full-game surface

CatBoost LossFunctionChange, day-by-day walk-forward 2021-2025 (835 trained days). Higher mean = larger loss change when the feature is permuted = more load-bearing for the model.

Rank	Feature	Mean	Std
1	`park_factor`	0.0223	0.0086
2	`bat_team_runs_pg_recent`	0.0173	0.0053
3	`opp_sp_xwoba_allowed`	0.0171	0.0073
4	`opp_bullpen_woba_tm`	0.0164	0.0081
5	`xwoba_team`	0.0116	0.0086
6	`ump_run_env`	0.0111	0.0084
7	`opp_bullpen_woba_lg`	0.0111	0.0070
8	`temp_f`	0.0110	0.0058
9	`flat_woba`	0.0102	0.0053
10	`opp_sp_depth_bf`	0.0101	0.0067
11	`opp_sp_woba_allowed`	0.0097	0.0045
12	`carry_index`	0.0092	0.0062
13	`opp_bullpen_ip_l5d`	0.0085	0.0057
14	`opp_bullpen_ip_l1d`	0.0078	0.0054
15	`opp_bullpen_ip_l3d`	0.0077	0.0051
16	`matchup_woba_v4`	0.0075	0.0054
17	`matchup_woba_pitchtype`	0.0072	0.0048
18	`opp_sp_velo_drop`	0.0062	0.0039
19	`is_home`	0.0047	0.0038
20	`bat_team_miles_traveled`	0.0040	0.0033
21	`bat_team_rest_days`	0.0037	0.0026
22	`bat_team_tz_shift`	0.0025	0.0011

F5 surface

CatBoost LossFunctionChange, day-by-day walk-forward 2021-2025 (835 trained days). Higher mean = larger loss change when the feature is permuted = more load-bearing for the model.

Rank	Feature	Mean	Std
1	`opp_sp_xwoba_allowed`	0.0168	0.0072
2	`opp_bullpen_woba_tm`	0.0122	0.0064
3	`park_factor`	0.0117	0.0042
4	`bat_team_runs_pg_recent`	0.0103	0.0045
5	`is_home`	0.0090	0.0052
6	`opp_sp_depth_bf`	0.0090	0.0064
7	`ump_run_env`	0.0085	0.0057
8	`opp_bullpen_woba_lg`	0.0083	0.0048
9	`flat_woba`	0.0077	0.0068
10	`xwoba_team`	0.0073	0.0055
11	`temp_f`	0.0072	0.0039
12	`carry_index`	0.0067	0.0046
13	`opp_bullpen_ip_l3d`	0.0066	0.0043
14	`opp_sp_woba_allowed`	0.0066	0.0041
15	`matchup_woba_pitchtype`	0.0063	0.0041
16	`opp_bullpen_ip_l5d`	0.0060	0.0043
17	`matchup_woba_v4`	0.0057	0.0035
18	`opp_sp_velo_drop`	0.0054	0.0027
19	`opp_bullpen_ip_l1d`	0.0053	0.0041
20	`bat_team_miles_traveled`	0.0041	0.0037
21	`bat_team_tz_shift`	0.0031	0.0015
22	`bat_team_rest_days`	0.0016	0.0006

Offense feature-test program

All 8 closed; FAIL/MARGINAL results carried (not killed), pending discussion.

Test	Verdict	Commit	Result
#1 pitch-type matchup	FAIL	`d916541`	Pitch-type-level matchup degraded total-runs accuracy every season.
#2 expected wOBA	PASS	`b2eccf5`	Statcast expected wOBA beat actual wOBA -- better every season.
#3 team-specific bullpen	FAIL	`41c53be`	The fixed-formula team-bullpen swap raised MAE every season -- now shadowed.
#4 carry index	MARGINAL	`31b4857`	Weather carry index vs temperature -- a noise-level MAE change, neutral.
#5 umpire run environment	PASS	`f2be26e`	Plate-umpire run environment improved MAE 3.6691 vs 3.6722, 4 of 5 seasons.
#6 travel / rest	FAIL	`26ac3c5`	Per-team travel-fatigue adjustment raised MAE -- fitted slope near-zero.
#7 pitcher velo trend	FAIL	`8e4bcd0`	Velo-drop swaps (continuous + 2-mph flag) both raised MAE -- carried, not killed.
bullpen fatigue	MARGINAL	`eee89a7`	Recent bullpen-workload swap -- continuous noise-level, flag worse. Spun off from #3.

Team-bullpen shadow

Test #3 failed 2021-2025; the team-bullpen variant continues to run forward on out-of-sample games -- monitoring only, does NOT overturn the test #3 verdict.

Window	2026-03-25 .. 2026-05-18
Games scored	621
C0 league-flat bullpen MAE	3.6866
C1 team-bullpen MAE	3.6817
C1 minus C0	-0.0049 — team-bullpen ahead