Jeung et al. (2026) Sensitivity of hydrological machine learning prediction accuracy to information quantity and quality

Identification

Journal: Hydrology and earth system sciences
Year: 2026
Date: 2026-02-24
Authors: Minhyuk Jeung, Younggu Her, Sang-Soo Baek, Kwangsik Yoon
DOI: 10.5194/hess-30-1077-2026

Research Groups

Department of Rural & Biosystems Engineering (Brain Korea 21), Chonnam National University, Gwangju, Republic of Korea
Department of Agricultural and Biological Engineering/Tropical Research and Education Center, University of Florida, Homestead, Florida, USA
Department of Environmental Engineering, Yeungnam University, Gyeongsan, Republic of Korea

Short Summary

This study investigates how the quantity and quality of information in training data influence the prediction accuracy of hydrological machine learning (ML) models. It demonstrates that while the highest accuracy is achieved with all available data, incorporating high-quality outputs from calibrated mechanistic models most efficiently improves ML prediction accuracy.

Objective

To explore the connection between the amount of information contained in the data used to train an ML model and the model’s prediction accuracy, with the goal of understanding how to improve accuracy efficiently.
To answer how the quantity and quality of information in training datasets, as measured by marginal and transfer entropies, affect the prediction accuracy of hydrological ML models.
To hypothesize that both higher information quantity and quality in training datasets, as reflected by increased marginal and transfer entropies values, would together positively correlate with improved model prediction accuracy.

Study Configuration

Spatial Scale: Four nested watersheds (Wall-Jeong (WJ), Ha-Nam (HN), Jang-Su (JS), and Pung-Yeong-Jung (PYJ)) within the Pung-Yeong-Jung river watershed, Republic of Korea. Drainage areas vary, with land uses including agricultural fields (upland and rice paddy), forest, and urbanized areas.
Temporal Scale:
- Monitoring data collection: 12 July 2013 to 31 December 2017 (4 years and 6 months).
- SWAT model warm-up period: 1 January 2008 to 11 July 2013.
- SWAT model calibration and ML model training period: 12 July 2013 to 31 December 2015.
- SWAT model validation and ML model testing period: 1 January 2016 to 31 December 2017.
- Data resolution: Daily for weather and model outputs; water quality samples collected every one or two weeks, or hourly during rainfall events.

Methodology and Data

Models used:
- Machine Learning (ML) models: Random Forest (RF), Support Vector Machine (SVM), and Artificial Neural Network (ANN). Hyperparameters optimized using Bayesian optimization.
- Mechanistic (theory-driven) model: Soil and Water Assessment Tool (SWAT). Calibrated using the SUFI-2 algorithm.
- Information theory metrics: Shannon's marginal entropy and transfer entropy were used to quantify information quantity and quality, respectively.
Data sources:
- Observed weather data: Daily precipitation (P), average temperature (AT), wind speed (WS), relative humidity (RH), solar radiation (SR), and evaporation (E) from the Korean Meteorological Administration (KMA).
- Observed hydrological data: Daily streamflow discharge (m³ s⁻¹), suspended solid (SS) loads, total nitrogen (TN) loads, and total phosphorus (TP) loads/concentrations measured at watershed outlets.
- Simulated data: Outputs (flow discharge, SS, TN, TP loads) from uncalibrated and calibrated SWAT models.
- Training data sets were prepared in four combinations:
  1. WDO: Weather data only.
  2. WD + UC: Weather data + Uncalibrated SWAT model outputs.
  3. WD + C: Weather data + Calibrated SWAT model outputs.
  4. All: Weather data + Uncalibrated SWAT model outputs + Calibrated SWAT model outputs.
- Data normalization: Linear scaling to a range of 0 to 1.
- Accuracy evaluation: Kling-Gupta efficiency (KGE).
- Information Use Efficiency (IUE) was calculated as the ratio of prediction accuracy gain to the increase in marginal or transfer entropy.

Main Results

Prediction Accuracy Improvement: ML model prediction accuracy consistently improved with the sequential addition of training data, from weather data only to including uncalibrated and then calibrated mechanistic model outputs. For example, in the PYJ watershed, RF flow prediction KGE increased from 0.67 (WDO) to 0.91 (WD + C).
Efficiency of Information: The most efficient improvements in prediction accuracy, relative to the amount of information added (measured by Information Use Efficiency - IUE), were achieved when high-quality outputs from calibrated mechanistic models (WD + C case) were incorporated into the training data.
Impact of Information Quality: Augmenting training datasets with low-relevance or low-accuracy data (WD + UC case) did not always improve, and sometimes degraded, model performance, leading to negative IUE scores. This highlights the critical role of information quality over mere quantity.
Entropy Quantification: Marginal entropy of training data generally increased with additional data, with the "All" case showing the most substantial increases. Transfer entropy, which quantifies effective information transfer, did not always increase and could decrease, indicating information loss or less effective transfer depending on data types, prediction variables, and ML models.
Model Performance Variation: No single ML model consistently outperformed others across all variables and watersheds. However, the ANN model demonstrated robustness by avoiding negative IUE scores even when lower-quality data (WD + UC) were added, indicating its effectiveness in handling high-dimensional, non-linear data.
Variable and Watershed Influence: ML models showed higher prediction accuracy for flow (average KGE 0.557–0.854) compared to water quality variables (SS, TN, TP; average KGE 0.093–0.607), correlating with higher marginal entropy in flow data (average 8.582 bits) versus water quality data (average 5.144–6.180 bits). Prediction accuracy was also generally higher for larger watersheds with higher entropy in their responses.

Contributions

Provides a quantitative evaluation of the impact of both information quantity (marginal entropy) and quality (transfer entropy) in training data on hydrological ML model prediction accuracy.
Introduces Information Use Efficiency (IUE) as a novel metric to assess the effectiveness of additional information in improving prediction accuracy.
Demonstrates that integrating high-quality outputs from theory-driven (mechanistic) models is the most efficient strategy for enhancing data-driven ML model performance.
Highlights that simply increasing data volume with low-quality information can be detrimental to ML model accuracy, emphasizing the need for careful data selection.
Offers insights into the differential responses of various ML models (RF, SVM, ANN) to changes in data quantity and quality, identifying ANN as particularly robust.
Establishes a robust and interpretable framework for optimizing training data selection and structuring in hydrological ML modeling.

Funding

Yeongsan and Seomjin River Water Management Committee
Project: "A Long-term Monitoring for the Nonpoint Sources Discharge"

Citation

@article{Jeung2026Sensitivity,
  author = {Jeung, Minhyuk and Her, Younggu and Baek, Sang-Soo and Yoon, Kwangsik},
  title = {Sensitivity of hydrological machine learning prediction accuracy to information quantity and quality},
  journal = {Hydrology and earth system sciences},
  year = {2026},
  doi = {10.5194/hess-30-1077-2026},
  url = {https://doi.org/10.5194/hess-30-1077-2026}
}

Original Source: https://doi.org/10.5194/hess-30-1077-2026