Han et al. (2025) Climate science data can be compressed efficiently by dual-stage extreme compression with a variational auto-encoder transformer
Identification
- Journal: Communications Earth & Environment
- Year: 2025
- Date: 2025-11-24
- Authors: Tao Han, Zhenghao Chen, Song Guo, Wanghan Xu, Wanli Ouyang, Lei Bai
- DOI: 10.1038/s43247-025-02903-z
Research Groups
- The Hong Kong University of Science and Technology, Hong Kong, SAR, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, China
- The University of Newcastle, Callaghan, NSW, Australia
Short Summary
This paper introduces Aeolus, a deep learning framework utilizing Variational Auto-Encoder Transformer (VAEFormer) modules, to achieve extreme compression of large-scale atmospheric datasets. It successfully compresses the 400-terabyte ERA5 reanalysis dataset by a factor of 470x into a 0.85-terabyte dataset (CRA5) while maintaining high numerical accuracy and preserving critical climate patterns for scientific analysis and forecasting.
Objective
- To develop an efficient deep learning-based atmospheric data compression method (Aeolus) that delivers exceptionally high compression ratios (e.g., >470x) while maintaining strict numerical accuracy (e.g., temperature mean absolute error below 0.2 kelvin), exhibiting low computational complexity, preserving critical atmospheric information (including extreme values and frequency domain characteristics), and serving as a compact, information-dense resource for data-driven operational analysis models.
Study Configuration
- Spatial Scale: Global, with a spatial resolution of 0.25 degrees latitude by 0.25 degrees longitude.
- Temporal Scale: Data from January 1, 1979, to December 31, 2023 (45 years), totaling 394,464 frames. Training data spanned 1979-2021, and test data covered 2022-2023.
Methodology and Data
- Models used:
- Aeolus: A dual-stage lossy-to-lossless compression scheme.
- VAEFormer (Variational Auto-Encoder Transformer): Core deep learning module for both lossy compression (VAEFormer I) and lossless compression (VAEFormer II as a neural entropy model).
- ACT (Atmospheric Circulation Transformer) block: A core module within VAEFormer designed to model diverse atmospheric circulation patterns with linear computational complexity.
- FastCast: A lightweight global numerical weather prediction (NWP) model used for downstream task evaluation.
- Data sources:
- ERA5 (Fifth-generation global atmospheric reanalysis dataset) from the European Centre for Medium-Range Weather Forecasts (ECMWF).
- Specific subsets included:
- Pressure-level dataset: 37 isobaric surfaces (1000 hPa to 1 hPa), 7 variables (geopotential, relative humidity, specific humidity, zonal wind component, meridional wind component, air temperature, vertical velocity).
- Surface-level dataset: 10 variables (2-meter temperature, 10-meter u/v-component of wind, 100-meter u/v-component of wind, mean sea level pressure, total cloud cover, hourly accumulated precipitation, surface pressure).
Main Results
- Extreme Compression: Aeolus achieved an overall compression ratio of over 470x, reducing the 400-terabyte ERA5 dataset to 0.85 terabytes (CRA5).
- High Accuracy:
- Mean absolute error (MAE) for temperature was approximately 0.17 kelvin.
- MAE for meridional and zonal winds was approximately 0.33 meters per second.
- MAE for mean sea-level pressure was less than 0.09 hectopascals.
- MAE for geopotential was approximately 7 square meters per second squared.
- Weighted Root Mean Square Error (WRMSE) for geopotential at 500 hPa was 6.2 square meters per second squared.
- Low Latency: Achieved compression rates of 2728.5 megabytes per second and decompression rates of 1734.3 megabytes per second on an NVIDIA GeForce RTX 4090 GPU.
- Preservation of Scientific Information:
- CRA5 accurately reconstructed climatological mean and standard deviation patterns, with maximum normalized errors for climatological mean as low as 0.13% for geopotential at 500 hPa.
- The power spectral density of CRA5 closely aligned with ERA5 across large-scale (>200 kilometers) and mesoscale (2–2000 kilometers) levels for most variables, preserving critical meteorological information.
- CRA5 effectively preserved and even enhanced the characterization of extreme weather events, such as hurricanes (showing average maximum wind speeds 0.3 meters per second higher than ERA5 near hurricane eyes for winds exceeding 20 meters per second) and heatwaves.
- Downstream Utility: Global numerical weather prediction models (FastCast) trained on the compressed CRA5 dataset exhibited nearly identical forecasting skills compared to models trained on the original ERA5 dataset, achieving performance comparable to state-of-the-art models like ECMWF-HRES and Pangu-Weather.
Contributions
- Introduction of Aeolus, the first deep learning-based framework capable of achieving practical, ultra-high compression (470x) for massive atmospheric datasets like ERA5, addressing critical storage and transmission challenges.
- Development of the innovative VAEFormer architecture, incorporating the Atmospheric Circulation Transformer (ACT) block, which significantly reduces computational complexity while effectively capturing diverse atmospheric patterns.
- Comprehensive validation demonstrating that the compressed CRA5 dataset maintains high numerical accuracy, preserves essential climatological characteristics, and accurately captures extreme weather phenomena.
- Proof that the compressed data (CRA5) is highly functional for downstream scientific applications, enabling advanced data-driven weather forecasting models to achieve comparable skill to those trained on uncompressed data, thereby enhancing data accessibility and accelerating climate research.
Funding
- Shanghai Artificial Intelligence Laboratory
- Hong Kong RGC General Research Fund (152169/22E, 152228/23E, 162161/24E)
- Research Impact Fund (No. R5060-19, No. R5011-23)
- Collaborative Research Fund (No. C1042-23GF)
- NSFC/RGC Collaborative Research Scheme (Grant No. 62461160332 & CRS_HKUST602/24)
- Areas of Excellence Scheme (AoE/E-601/22-R)
- InnoHK (HKGAI)
Citation
@article{Han2025Climate,
author = {Han, Tao and Chen, Zhenghao and Guo, Song and Xu, Wanghan and Ouyang, Wanli and Bai, Lei},
title = {Climate science data can be compressed efficiently by dual-stage extreme compression with a variational auto-encoder transformer},
journal = {Communications Earth & Environment},
year = {2025},
doi = {10.1038/s43247-025-02903-z},
url = {https://doi.org/10.1038/s43247-025-02903-z}
}
Original Source: https://doi.org/10.1038/s43247-025-02903-z