Adnan et al. (2026) Assessing the transferability of LSTM-based streamflow models under varying source basin diversity and target data availability (Mangla Basin, Pakistan)
Identification
- Journal: Journal of Hydrology Regional Studies
- Year: 2026
- Date: 2026-03-09
- Authors: Muhammad Adnan, Wenyu Ouyang, Lei Ye, Muhammad Adnan Khan, Yikai Chai, Haoran Ma
- DOI: 10.1016/j.ejrh.2026.103329
Research Groups
- School of Infrastructure Engineering, Dalian University of Technology, Dalian, China
- Ningbo Institute of Dalian University of Technology, Dalian, China
- Institute of Hydraulic Engineering and Technical Hydromechanics, TUD Dresden University of Technology, Dresden, Germany
Short Summary
This study evaluates the transferability of LSTM-based streamflow models in the data-scarce Mangla Basin, Pakistan, demonstrating that transfer learning significantly improves predictions, especially with limited local data, though its advantage lessens as local data availability increases.
Objective
- To evaluate how source dataset size and target basin data availability impact the performance of transfer learning for streamflow prediction in the data-scarce Mangla Basin, Pakistan, compared to locally trained Long Short-Term Memory (LSTM) models.
Study Configuration
- Spatial Scale: Mangla Basin, Pakistan (33,500 km²). Source basins from CAMELS-US (531 basins) and Caravan (~5000 basins globally, filtered to 5150 basins).
- Temporal Scale:
- Target basin (Mangla): 2000–2014 (15 years). Training data varied from 20% to 100% of this period (approximately 2–10 years). Validation and testing periods were fixed.
- CAMELS-US source: 1980–2005.
- Caravan source: 1980–2000.
Methodology and Data
- Models used: Long Short-Term Memory (LSTM) networks. Both local LSTM and Transfer Learning (TL) LSTM models were implemented, with full model fine-tuning applied for TL.
- Data sources:
- Source Datasets: CAMELS-US (531 basins) and Caravan (5150 basins). These datasets provide daily streamflow observations, meteorological forcing data (precipitation, radiation, temperature, vapor pressure for CAMELS-US; precipitation, max/min temperature, snow depth, solar radiation, pressure, potential evaporation for Caravan), and static catchment attributes.
- Target Dataset (Mangla Basin):
- Observed streamflow: Azad Pattan gauging station, provided by the Water and Power Development Authority (WAPDA), Pakistan. Converted from cubic meters per second (m³/s) to millimeters per day (mm/day).
- Meteorological forcing: ERA5-Land reanalysis dataset (total precipitation, maximum and minimum air temperature, surface pressure, surface net solar radiation, snow depth water equivalent, potential evaporation) for 2000–2014.
- Evaluation Metrics: Root Mean Square Error (RMSE), Nash–Sutcliffe Efficiency (NSE), Kling–Gupta Efficiency (KGE), Flow High bias volume (FHV) of the top 2% observed flow, and Flow Low bias volume (FLV) of the lower 30% observed flow.
Main Results
- Under severely limited training data (20% of the target basin record), both local and transfer learning (TL) models performed poorly (validation NSE ≈ 0.1–0.3), indicating insufficient hydrological learning.
- As training data increased, TL models substantially improved, achieving validation NSE values of 0.89 for CAMELS-US-based models and 0.87 for Caravan-based models at 80% training length, consistently outperforming the local LSTM model.
- At full training length (100%), TL performance slightly declined but remained marginally superior (NSE ≈ 0.84–0.87 for TL vs. 0.81 for local), suggesting local models adapt more to basin-specific dynamics with extensive data.
- Across all scenarios, negative FHV values indicated a systematic underestimation of high flows under data scarcity, though TL models generally reduced this underestimation compared to the local model.
- TL models trained on larger and more diverse Caravan subsets (e.g., 60% and 100% of 5150 basins) generally achieved lower RMSE and higher NSE values, particularly at shorter target training lengths, highlighting the benefit of larger source datasets.
- The advantage of transfer learning diminished as local data availability increased, indicating a shift in the dominant source of information from pretrained weights to local observations.
Contributions
- Systematically evaluates transfer learning performance across varying levels of target-basin data availability (20% to 100% of the record) and source dataset sizes (CAMELS-US and Caravan subsets ranging from approximately 1000 to 5000 basins).
- Provides practical guidance on the conditions under which transfer learning is most effective and when its advantages diminish in data-scarce hydrological settings.
- Highlights the persistent challenge of accurately predicting extreme high- and low-flow events even with transfer learning, suggesting the need for extreme-aware loss functions and region-adaptive strategies in future research.
Funding
- National Natural Science Foundation of China (Nos. 52322901 and 52309010)
- Doctoral Research Start-up Project of the Liaoning Provincial Science and Technology Joint Program Fund (No. 2023-BSBA-075)
Citation
@article{Adnan2026Assessing,
author = {Adnan, Muhammad and Ouyang, Wenyu and Ye, Lei and Khan, Muhammad Adnan and Chai, Yikai and Ma, Haoran},
title = {Assessing the transferability of LSTM-based streamflow models under varying source basin diversity and target data availability (Mangla Basin, Pakistan)},
journal = {Journal of Hydrology Regional Studies},
year = {2026},
doi = {10.1016/j.ejrh.2026.103329},
url = {https://doi.org/10.1016/j.ejrh.2026.103329}
}
Original Source: https://doi.org/10.1016/j.ejrh.2026.103329