Pinheiro et al. (2025) Enhancing machine learning-based seasonal precipitation forecasting using CMIP6 simulations

Identification

Journal: Atmospheric Research
Year: 2025
Date: 2025-09-09
Authors: Enzo Pinheiro, Taha B. M. J. Ouarda
DOI: 10.1016/j.atmosres.2025.108463

Research Groups

Institut National de la Recherche Scientifique, Centre Eau-Terre-Environnement, Québec, Canada

Short Summary

This study demonstrates that training machine learning (ML) models for seasonal precipitation forecasting with a larger number of individual simulations from CMIP6 models significantly enhances their generalization ability and improves forecasts over South America. These CMIP6-trained ML models consistently outperform those trained with limited reanalysis data (ERA5) and state-of-the-art dynamical models.

Objective

To investigate the advantages and limitations of using CMIP6 data to train machine learning-based seasonal forecasting (MLSF) models for seasonal precipitation forecasting.
To assess how the number of CMIP6 models used during training affects the ML model's generalization ability.
To compare the performance of MLSF models trained on individual CMIP6 model outputs versus their ensemble mean.
To quantify the added value of CMIP6 simulations by comparing their performance against an MLSF model trained with ERA5 data.
To compare the performance of the CMIP6-based ML model with state-of-the-art dynamical models.

Study Configuration

Spatial Scale: South America, with data bilinearly interpolated to a common 1° × 1° grid. Original data resolutions include 0.25° for ERA5 and 2° for ERSSTv5.
Temporal Scale:
- CMIP6 historical simulations: 1850–2014.
- ERA5 reanalysis: 1940 to present.
- ERSSTv5: 1854 to present.
- Training period: 1851–1981.
- Validation period: 1982–2002.
- Test period: Bootstrapped years from 2003 to 2023.
- Forecasts: Seasonal precipitation (3-month periods) at various initialization months (February, May, August, November) and lead times (up to three leads).

Methodology and Data

Models used:
- TelNet: A sequence-to-sequence machine learning model designed for seasonal climate forecasting.
- CMIP6 models: Historical simulations from 18 individual models (e.g., CanESM5-CanOE, MPI-ESM1-2-HR, ACCESS-CM2).
- SEAS5: ECMWF seasonal forecasting system.
- NMME4: North American Multi-Model Ensemble project.
Data sources:
- Monthly total precipitation, sea surface temperature (SST), and 10-meter wind components from CMIP6 historical simulations.
- Monthly atmospheric variables and total precipitation from ERA5 reanalysis.
- Extended Reconstructed SST version 5 (ERSSTv5).
- Seasonal precipitation forecasts from Copernicus Climate Change Service (C3S) for SEAS5 and the North American Multi-Model Ensemble (NMME) project for NMME4.
- Climate indices (e.g., ONI, ATN, IOBW) derived from SST and atmospheric variables.

Main Results

Machine learning models trained with a small number of CMIP6 simulations perform worse than those trained with ERA5, attributed to instability during ML model tuning and reduced generalization ability.
As the number of CMIP6 models used for training increases, the performance of the ML models improves, surpassing both ERA5-based ML models and those trained with the CMIP6 ensemble mean.
Models trained with 9 or 18 CMIP6 simulations consistently outperform ERA5-TelNet across all seasons, with statistically significant improvements.
Performance gains show diminishing returns beyond nine CMIP6 models, as 9-TelNet and 18-TelNet exhibit nearly identical performance.
Reliability and sharpness diagrams indicate that ML models trained with more CMIP6 simulations yield more confident and calibrated forecasts, demonstrating improved forecast calibration.
CMIP6-based TelNet (e.g., 9-TelNet) consistently matched or outperformed most state-of-the-art dynamical models (SEAS5, NMME4) across different initialization months and lead times, particularly for December–January–February (DJF) forecasts initialized in November.
ML models incorporating the Oceanic Niño Index (ONI) as a covariate show better performance in the Amazon region, while those with tropical Atlantic indices (ATN-, ATS-, ATL-SST) perform better in northeastern Brazil.
All ML models generally assign probabilities between 20 % and 60 % for each forecast category, suggesting low confidence, though confidence slightly increases in regions with high predictability (e.g., Amazon basin).

Contributions

Demonstrates that leveraging a larger number of individual multi-model dynamical simulations from CMIP6 can significantly enhance the generalization ability and robustness of machine learning-based seasonal precipitation forecasting models.
Quantifies the impact of the number of CMIP6 models on ML model performance, identifying a threshold for diminishing returns in forecast skill improvement.
Provides a comprehensive comparison of CMIP6-trained ML models against both ERA5-trained ML models and current state-of-the-art dynamical seasonal forecasting systems.
Highlights the critical role of robust model and predictor selection processes in scenarios with limited training data from CMIP6 models.
Utilizes TelNet, a recently developed interpretable ML model, to conduct this assessment in a region with high seasonal predictability (South America).

Funding

Natural Sciences and Engineering Research Council of Canada (NSERC)
Canada Research Chairs Program
Canadian Research Knowledge Network (CRKN)
Copernicus Climate Change Service (C3S) (for ERA5 and SEAS5 data)
NOAA, NSF, NASA, and DOE (for supporting the NMME project)
Digital Research Alliance of Canada (for computational resources)

Citation

@article{Pinheiro2025Enhancing,
  author = {Pinheiro, Enzo and Ouarda, Taha B. M. J.},
  title = {Enhancing machine learning-based seasonal precipitation forecasting using CMIP6 simulations},
  journal = {Atmospheric Research},
  year = {2025},
  doi = {10.1016/j.atmosres.2025.108463},
  url = {https://doi.org/10.1016/j.atmosres.2025.108463}
}

Original Source: https://doi.org/10.1016/j.atmosres.2025.108463