Liu et al. (2025) From RNNs to Transformers: benchmarking deep learning architectures for hydrologic prediction

Identification

Journal: Hydrology and earth system sciences
Year: 2025
Date: 2025-12-01
Authors: Jiangtao Liu, Chaopeng Shen, Fearghal O’Donncha, Yalan Song, Zhi Wei, Hylke E. Beck, Tadd Bindas, Nicholas Kraabel, Kathryn Lawson
DOI: 10.5194/hess-29-6811-2025

Research Groups

Civil and Environmental Engineering, The Pennsylvania State University, University Park, PA, USA
IBM Research, Dublin, Ireland
Hohai University, Nanjing, China
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

Short Summary

This study introduces a deep learning framework to benchmark 11 Transformer-based architectures against a baseline Long Short-Term Memory (LSTM) model and evaluate pretrained Large Language Models (LLMs) and Time Series Attention Models (TSAMs) for diverse hydrologic prediction tasks, revealing that LSTM excels in regression but attention-based models surpass it in complex tasks like autoregression and zero-shot forecasting.

Objective

To develop a single deep learning framework capable of tackling myriad hydrologic tasks (soil moisture, streamflow, water chemistry, snow water equivalent) to enable comparisons among different model architectures.
To evaluate the performance of various attention-based architectures compared to LSTM models across tasks with varying complexity.
To assess the applicability of large, pre-trained models (LLMs or TSAMs) for hydrologic prediction in ungauged basins.

Study Configuration

Spatial Scale:
- Catchment Attributes and Meteorology for Large-sample Studies (CAMELS): 531 basins across the conterminous United States.
- Global Streamflow: 3434 basins globally, with catchment areas between 50 square kilometers and 5000 square kilometers.
- International Soil Moisture Network (ISMN): 1317 sites globally.
- Snow Water Equivalent (SWE): 525 sites across the western United States.
- Dissolved Oxygen (DO): 236 basins across the United States.
Temporal Scale:
- All datasets have a daily temporal resolution.
- Data periods vary by dataset (e.g., CAMELS: 1980–2008; Global Streamflow: 1980–2016).
- Prediction horizons for forecasting and autoregression: 1 day, 7 days, 30 days, and 60 days.
- Input time series length for regression: 365 days.

Methodology and Data

Models used:
- Baseline Models: Long Short-Term Memory (LSTM), DLinear.
- Attention-based Architectures (11): CARDformer, Crossformer, ETSformer, Informer, iTransformer, Non-stationary Transformer, Pyraformer, Reformer, Vanilla Transformer, PatchTST, TimesNet.
- Pre-trained Models for Zero-Shot Forecasting: GPT-3.5, GPT-4-turbo, Gemini 1 pro, Llama3 8B, Llama3 70B (Large Language Models); TimeGPT, Lag-Llama, Tiny Time Mixers (Time Series Attention Models).
Data sources:
- Observed Hydrologic Data:
  - CAMELS dataset (streamflow, conterminous United States).
  - Global Streamflow dataset (Beck et al., 2020) (streamflow, global).
  - International Soil Moisture Network (ISMN) (soil moisture, global).
  - SNOTEL observations (snow water equivalent, western United States).
  - CAMELS-Chem (dissolved oxygen, United States).
- Meteorological Forcing Data: Daymet, Maurer, North American Land Data Assimilation System (NLDAS) (precipitation, solar radiation, temperature, vapor pressure, wind speed, humidity, surface pressure, evaporation).
- Static Attributes: Elevation, slope, area, forest fraction, leaf area index, green vegetation fraction, soil depth, porosity, conductivity, water content, soil texture fractions (sand, silt, clay), carbonate rock fraction, permeability, climate indices, land surface temperature, albedo, landcover, Normalized Difference Vegetation Index (NDVI), profile curvature, roughness, mean Soil Moisture Active Passive Data (SMAP).

Main Results

Regression Tasks: LSTM models generally outperformed attention-based models, achieving median Kling-Gupta Efficiency (KGE) values of 0.80 for CAMELS and 0.75 for global streamflow. The Non-stationary Transformer slightly surpassed LSTM for snow water equivalent prediction (KGE = 0.88 vs 0.87). Attention-based models showed advantages in capturing extreme values (e.g., Non-stationary Transformer for high-flow, Pyraformer and Reformer for low-flow).
Forecasting Tasks: LSTM performed best at short lead times (1 day), with a KGE of 0.89 for CAMELS streamflow. The performance gap between LSTM and attention-based models narrowed as the lead time increased (e.g., KGE difference of 0.08 at 1 day reduced to 0.01 at 30 days).
Autoregression Tasks: Attention-based models substantially outperformed LSTM for longer forecasting horizons (7, 30, and 60 days). For a 7-day horizon, Pyraformer, PatchTST, and Crossformer achieved KGE values nearly twice that of LSTM. Incorporating static attributes improved performance for all models, but LSTM's performance still declined significantly at longer horizons.
Spatial Cross-Validation (Prediction in Ungauged Basins): All models experienced performance declines, but attention-based models showed relatively smaller decreases. Crossformer slightly outperformed LSTM (KGE = 0.63 vs 0.62) and demonstrated improved high-flow prediction (FHV = -6.76 vs -14.13).
Zero-Shot Predictions: Pre-trained LLMs (GPT-3.5, Llama 3 8B) and TSAMs (TimeGPT) exhibited competitive predictive capabilities without domain-specific fine-tuning, with TimeGPT achieving a KGE of 0.68 for a 7-day horizon, surpassing supervised LSTM (KGE = 0.50). TimeGPT maintained robust performance at a 30-day horizon (KGE = 0.33).
Computational Cost: Attention-based models generally incurred higher energy consumption and carbon dioxide emissions; for example, training Crossformer on the CAMELS dataset resulted in approximately 2.25 kilograms of CO2 equivalent, about 13 times higher than LSTM (0.17 kilograms of CO2 equivalent).

Contributions

Developed a robust, automated deep learning framework for multi-source, multi-scale data integration and comprehensive benchmarking of diverse deep learning architectures in hydrology.
Systematically compared 11 Transformer-based models against LSTM across five distinct hydrologic prediction tasks, providing a detailed understanding of their relative strengths and weaknesses.
Demonstrated that while LSTM excels in regression and short-term forecasting, attention-based models show superior performance in more complex tasks, including long-term autoregression and capturing extreme hydrologic events.
Pioneered the application of pre-trained Large Language Models (LLMs) and Time Series Attention Models (TSAMs) for zero-shot hydrologic forecasting, highlighting their significant potential for predictions in data-limited regions without task-specific training.
Provided a valuable benchmark and framework for future development and comparison of large-scale models in water resource modeling, forecasting, and management.

Funding

National Science Foundation Award (Award no. EAR-2221880)
U.S. Department of Energy, Office of Biological and Environmental Research (contract DE-SC0016605)
Cooperative Institute for Research to Operations in Hydrology (CIROH) through National Oceanic and Atmospheric Administration (NOAA) Cooperative Agreement (grant no. NA22NWS4320003)

Citation

@article{Liu2025From,
  author = {Liu, Jiangtao and Shen, Chaopeng and O’Donncha, Fearghal and Song, Yalan and Wei, Zhi and Beck, Hylke E. and Bindas, Tadd and Kraabel, Nicholas and Lawson, Kathryn},
  title = {From RNNs to Transformers: benchmarking deep learning architectures for hydrologic prediction},
  journal = {Hydrology and earth system sciences},
  year = {2025},
  doi = {10.5194/hess-29-6811-2025},
  url = {https://doi.org/10.5194/hess-29-6811-2025}
}

Original Source: https://doi.org/10.5194/hess-29-6811-2025