Yu et al. (2026) How much historical data do we need? The role of data recency and training period length in LSTM-based rainfall-runoff modeling
Identification
- Journal: Journal of Hydrology
- Year: 2026
- Authors: Qiutong Yu, Bryan Tolson
- DOI: 10.1016/j.jhydrol.2026.135046
Research Groups
- Department of Civil and Environmental Engineering, University of Waterloo, Canada.
Short Summary
This study investigates the relative importance of training period length and data recency for LSTM-based rainfall-runoff models across 1374 North American watersheds. The findings demonstrate that recent data is more critical for predictive accuracy than long historical records, and the benefits of spatial diversity are significantly enhanced when training on recent observations.
Objective
- To determine if decades of historical data are necessary for training large-scale LSTM streamflow models and to evaluate whether data recency (temporal proximity to the prediction period) is more important than the total volume of temporal data.
Study Configuration
- Spatial Scale: 1374 gauged watersheds across North America (United States and Canada), covering diverse climatic and hydrological regimes.
- Temporal Scale: 1950–2023; training periods varied from 3 to 61 years (ending in 2010), with a fixed testing period from 2011 to 2023.
Methodology and Data
- Models used: Long Short-Term Memory (LSTM) networks implemented using the open-source NeuralHydrology library (v1.11.0).
- Data sources: The HYSETS dataset, providing harmonized daily hydrometeorological records.
- Input Variables: Dynamic meteorological forcings (daily precipitation, maximum temperature, and minimum temperature) and 16 static watershed attributes (e.g., drainage area, elevation, slope, land cover fractions).
- Experimental Design: Three designs including backward-expanding training periods (fixed recent data + older blocks), forward-expanding periods (fixed old data + newer blocks), and sliding-window periods (fixed-length windows moving forward).
Main Results
- Recency vs. Volume: Models trained on recent 16-year blocks (e.g., 1995–2010) performed as well as or better than models trained on the full 61-year record (1950–2010).
- Diminishing Returns of Old Data: Incorporating data prior to 1980 contributed marginally or negatively to model performance for modern predictions.
- Spatial-Temporal Interaction: The benefit of increasing the number of watersheds (from 220 to 1100) is conditional on data recency; spatial diversity yields substantial gains only when recent data is included.
- Minimum Training Length: A threshold of 6–11 years of training data was identified as necessary to capture sufficient interannual variability; performance dropped significantly with only 3 years of data.
- Peak Flow Performance: While longer records (back to 1980) slightly improved peak flow predictions, extending the record further back to 1950 provided negligible additional benefit.
Contributions
- Provides the first large-scale systematic evidence that data recency is a critical factor—often more so than temporal volume—for deep learning-based rainfall-runoff modeling.
- Demonstrates that spatial diversity can compensate for shorter training periods, particularly in Prediction in Ungauged Basins (PUB) contexts.
- Offers practical guidance for operational forecasting: prioritizing recent data and spatial diversity over long historical records can reduce computational costs without sacrificing accuracy.
Funding
- NSERC (Natural Sciences and Engineering Research Council of Canada) Discovery Grant (Grant No. 2022-03890).
Citation
@article{Yu2026How,
author = {Yu, Qiutong and Tolson, Bryan},
title = {How much historical data do we need? The role of data recency and training period length in LSTM-based rainfall-runoff modeling},
journal = {Journal of Hydrology},
year = {2026},
doi = {10.1016/j.jhydrol.2026.135046},
url = {https://doi.org/10.1016/j.jhydrol.2026.135046}
}
Generated by BiblioAssistant using gemini-3-flash-preview (Google API)
Original Source: https://doi.org/10.1016/j.jhydrol.2026.135046