The v4 offense full-game baseline failed its walk-forward and the 8-of-8 feature-test program is closed. On the 22 carried features the per-team learned-model architecture is now the model of record on BOTH surfaces, each chosen by a 7-configuration GBM bake-off: CatBoost on full-game (commit eb7c46c) and CatBoost on F5 (commit 29f50d8). Research-phase point estimates, not betting claims.
| Full-game total runs | CatBoost — full-game GBM bake-off winner (commit eb7c46c) · gr-18-clean offense matrix (22 features) · pooled 2021-2025 MAE 3.5249 vs LightGBM 3.5755 · 2026 OOS shadow 3.6067 vs 3.6673 · bake-off winner |
| F5 total runs | CatBoost — GBM bake-off winner (7 configs pre-registered) · F5 v2 matrix (22 features) · pooled 2021-2025 F5-total MAE 2.6419 vs LightGBM 2.6791 · 2026 OOS shadow 2.4816 vs 2.5346 · bake-off winner |
| Phase | research point estimates — not probabilities, not betting claims |
Live deployment: the daily page's predict_slate now uses CatBoost on BOTH surfaces — F5 from commit 9fbf52a and full-game from the 2026-05-19 wiring commit (model_version icecream-allcatboost-20260519).
CatBoost is the model of record on both surfaces. Phase 1 of the carried-features research arc probes how the model weights each of the 22 carried features — a descriptive LossFunctionChange read per day across the walk-forward, no pre-registration required. Phase 2 (a pre-registered ablation against the bottom-ranked features) designs against this evidence. Research program: audit/research_programs/20260519_carried_features_next.md.
CatBoost LossFunctionChange, day-by-day walk-forward 2021-2025 (835 trained days). Higher mean = larger loss change when the feature is permuted = more load-bearing for the model.
| Rank | Feature | Mean | Std |
|---|---|---|---|
| 1 | park_factor | 0.0223 | 0.0086 |
| 2 | bat_team_runs_pg_recent | 0.0173 | 0.0053 |
| 3 | opp_sp_xwoba_allowed | 0.0171 | 0.0073 |
| 4 | opp_bullpen_woba_tm | 0.0164 | 0.0081 |
| 5 | xwoba_team | 0.0116 | 0.0086 |
| 6 | ump_run_env | 0.0111 | 0.0084 |
| 7 | opp_bullpen_woba_lg | 0.0111 | 0.0070 |
| 8 | temp_f | 0.0110 | 0.0058 |
| 9 | flat_woba | 0.0102 | 0.0053 |
| 10 | opp_sp_depth_bf | 0.0101 | 0.0067 |
| 11 | opp_sp_woba_allowed | 0.0097 | 0.0045 |
| 12 | carry_index | 0.0092 | 0.0062 |
| 13 | opp_bullpen_ip_l5d | 0.0085 | 0.0057 |
| 14 | opp_bullpen_ip_l1d | 0.0078 | 0.0054 |
| 15 | opp_bullpen_ip_l3d | 0.0077 | 0.0051 |
| 16 | matchup_woba_v4 | 0.0075 | 0.0054 |
| 17 | matchup_woba_pitchtype | 0.0072 | 0.0048 |
| 18 | opp_sp_velo_drop | 0.0062 | 0.0039 |
| 19 | is_home | 0.0047 | 0.0038 |
| 20 | bat_team_miles_traveled | 0.0040 | 0.0033 |
| 21 | bat_team_rest_days | 0.0037 | 0.0026 |
| 22 | bat_team_tz_shift | 0.0025 | 0.0011 |
CatBoost LossFunctionChange, day-by-day walk-forward 2021-2025 (835 trained days). Higher mean = larger loss change when the feature is permuted = more load-bearing for the model.
| Rank | Feature | Mean | Std |
|---|---|---|---|
| 1 | opp_sp_xwoba_allowed | 0.0168 | 0.0072 |
| 2 | opp_bullpen_woba_tm | 0.0122 | 0.0064 |
| 3 | park_factor | 0.0117 | 0.0042 |
| 4 | bat_team_runs_pg_recent | 0.0103 | 0.0045 |
| 5 | is_home | 0.0090 | 0.0052 |
| 6 | opp_sp_depth_bf | 0.0090 | 0.0064 |
| 7 | ump_run_env | 0.0085 | 0.0057 |
| 8 | opp_bullpen_woba_lg | 0.0083 | 0.0048 |
| 9 | flat_woba | 0.0077 | 0.0068 |
| 10 | xwoba_team | 0.0073 | 0.0055 |
| 11 | temp_f | 0.0072 | 0.0039 |
| 12 | carry_index | 0.0067 | 0.0046 |
| 13 | opp_bullpen_ip_l3d | 0.0066 | 0.0043 |
| 14 | opp_sp_woba_allowed | 0.0066 | 0.0041 |
| 15 | matchup_woba_pitchtype | 0.0063 | 0.0041 |
| 16 | opp_bullpen_ip_l5d | 0.0060 | 0.0043 |
| 17 | matchup_woba_v4 | 0.0057 | 0.0035 |
| 18 | opp_sp_velo_drop | 0.0054 | 0.0027 |
| 19 | opp_bullpen_ip_l1d | 0.0053 | 0.0041 |
| 20 | bat_team_miles_traveled | 0.0041 | 0.0037 |
| 21 | bat_team_tz_shift | 0.0031 | 0.0015 |
| 22 | bat_team_rest_days | 0.0016 | 0.0006 |
All 8 closed; FAIL/MARGINAL results carried (not killed), pending discussion.
| Test | Verdict | Commit | Result |
|---|---|---|---|
| #1 pitch-type matchup | FAIL | d916541 | Pitch-type-level matchup degraded total-runs accuracy every season. |
| #2 expected wOBA | PASS | b2eccf5 | Statcast expected wOBA beat actual wOBA -- better every season. |
| #3 team-specific bullpen | FAIL | 41c53be | The fixed-formula team-bullpen swap raised MAE every season -- now shadowed. |
| #4 carry index | MARGINAL | 31b4857 | Weather carry index vs temperature -- a noise-level MAE change, neutral. |
| #5 umpire run environment | PASS | f2be26e | Plate-umpire run environment improved MAE 3.6691 vs 3.6722, 4 of 5 seasons. |
| #6 travel / rest | FAIL | 26ac3c5 | Per-team travel-fatigue adjustment raised MAE -- fitted slope near-zero. |
| #7 pitcher velo trend | FAIL | 8e4bcd0 | Velo-drop swaps (continuous + 2-mph flag) both raised MAE -- carried, not killed. |
| bullpen fatigue | MARGINAL | eee89a7 | Recent bullpen-workload swap -- continuous noise-level, flag worse. Spun off from #3. |
Test #3 failed 2021-2025; the team-bullpen variant continues to run forward on out-of-sample games -- monitoring only, does NOT overturn the test #3 verdict.
| Window | 2026-03-25 .. 2026-05-18 |
| Games scored | 621 |
| C0 league-flat bullpen MAE | 3.6866 |
| C1 team-bullpen MAE | 3.6817 |
| C1 minus C0 | -0.0049 — team-bullpen ahead |