Keller et al. (2025) Replicability in Earth System Models

Identification

Journal: Geoscientific model development
Year: 2025
Date: 2025-12-22
Authors: Kai Keller, Marta Alerany Solé, Mario Acosta
DOI: 10.5194/gmd-18-10221-2025

Research Groups

Barcelona Supercomputing Center, Barcelona, Spain
IBS Center for Climate Physics (ICCP), South Korea (for LENS2 data and supercomputing resources)

Short Summary

This paper introduces a novel methodology to test the replicability of Earth System Models (ESMs) across different computing environments, improving upon existing methods by 60% in accuracy. It also establishes an objective measure for statistically distinguishing between model climates using Cohen's effect size, finding that an effect size of d = 0.2 can serve as a threshold for statistical indistinguishability.

Objective

To develop a novel and robust methodology for testing the replicability of Earth System Models (ESMs) when run on different computing environments, ensuring that observed climate signal differences are exclusively attributable to scientific drivers rather than computational variations.
To establish an objective, quantitative measure for what constitutes a statistically different climate, based on Cohen's effect size.

Study Configuration

Spatial Scale: Global, with a 1° horizontal resolution for the Community Earth System Model 2 (CESM2) Large Ensemble Community Project (LENS2) data. Lorenz-96 model uses N=36 cells (10° longitude each).
Temporal Scale: Multi-decadal climate simulations (e.g., 1850-2100 for LENS2, with specific analysis periods like 1960-1989, 1990-2014, and 1850-1880). Lorenz-96 simulations span 10 years with monthly averages.

Methodology and Data

Models used:
- Community Earth System Model 2 (CESM2) Large Ensemble Community Project (LENS2)
- Lorenz-96 model (chaotic toy model)
Data sources:
- LENS2 ensemble (100-member, multi-decadal climate simulation data at 1° horizontal resolution)
- Observational datasets for various climate variables (e.g., ERA5 for air temperature, sea level pressure, specific humidity; EN.4.2.2 for sea surface temperature).
- Synthetic data from Gaussian distributions.

Main Results

A novel replicability methodology was developed, incorporating four scores (Root-Mean-Square Z-score, adapted Reichler-Kim index, Bias score, Root-Mean-Square Error score) and four statistical tests (Kolmogorov-Smirnov, Welch's t-test, Mann-Whitney U-test, Bootstrap test).
The new methodology improves accuracy by approximately 60% compared to a recent state-of-the-art method (Massonnet et al., 2020). For 5-member ensembles, it resolves differences of 1 standard deviation, compared to 1.7 standard deviations with Massonnet's method.
Cohen's effect size (d) was established as an objective measure for statistical differences between model climates. An effect size of d = 0.2 is proposed as a reliable threshold for statistical distinguishability.
Analysis with the CESM2 LENS2 100-member ensemble showed that two 50-member sub-ensembles from the same configuration, but with different biomass burning (BMB) emission forcings (original CMIP6 vs. smoothed), exhibited median effect sizes between 0.15 and 0.38 in the reference period (1990-2014) where forcing differed, and between 0.11 and 0.16 in the control period (1960-1989) where forcing was identical.
The methodology successfully detected non-replicability for variables with median effect sizes greater than 0.3 when using 50-member ensembles (e.g., hus850, hus300, ta850, ta200, tas, tos).
For 20-member ensembles comparing LENS2 simulations initialized from different AMOC phases (strong vs. weak current), the methodology detected non-replicability for variables with effect sizes greater than 0.35, demonstrating its applicability even with smaller ensemble sizes.
The T-test and the newly developed Bootstrap test generally showed the best performance in terms of power for detecting differences.

Contributions

Introduction of a novel, enhanced methodology for ensemble-based replicability testing in ESMs, significantly improving accuracy over existing methods.
Establishment of Cohen's effect size as an objective and quantitative metric to define and quantify statistical differences between climate model outputs, providing a clear threshold (d = 0.2) for statistical indistinguishability.
Comprehensive evaluation of the methodology's performance using both synthetic (Gaussian, Lorenz-96) and real-world (CESM2 LENS2) climate data, including analysis of False Positive Rate (FPR) and statistical power.
Quantification of the required ensemble sizes to confidently detect specific effect sizes, offering practical guidance for future climate model intercomparison projects like CMIP.
Demonstration that the developed methodology can effectively detect artificial climate signals introduced by differences in computing environments, thereby increasing confidence in climate change projections.

Funding

EU Horizon 2020 (Grant No. 101136269)
Hpc AlliaNce for Applications and supercoMputing Innovation (HANAMI) project, funded by the European High Performance Computing Joint Undertaking (EuroHPC JU) under the European Union's Horizon Europe framework program for research and innovation (Grant Agreement No. 101136269).
Barcelona Supercomputing Center (BSC) internal resources on the Marenostrum 5 supercomputer.
IBS Center for Climate Physics (ICCP) for CESM2 LENS2 project supercomputing resources.

Citation

@article{Keller2025Replicability,
  author = {Keller, Kai and Solé, Marta Alerany and Acosta, Mario},
  title = {Replicability in Earth System Models},
  journal = {Geoscientific model development},
  year = {2025},
  doi = {10.5194/gmd-18-10221-2025},
  url = {https://doi.org/10.5194/gmd-18-10221-2025}
}

Original Source: https://doi.org/10.5194/gmd-18-10221-2025