Yin et al. (2025) Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability

⚠️ Warning: This summary was generated from the abstract only, as the full text was not available.

Identification

Journal: Water Resources Research
Year: 2025
Date: 2025-10-01
Authors: Xiaoran Yin, Longcang Shu, Zhe Wang, Long Zhou, Shuyao Niu, Huazhun Ren, Lei Zhu, Chengpeng Lu
DOI: 10.1029/2024wr039848

Research Groups

Not explicitly mentioned in the abstract.

Short Summary

This study evaluates advanced sampling methods, particularly feature space coverage sampling (FSCS), in hydrological machine learning applications to address data imbalance. It demonstrates that FSCS significantly enhances model accuracy, feature importance estimation, and interpretability for predicting forest cover types and saturated hydraulic conductivity, even with smaller training sets.

Objective

To evaluate the impact of advanced sampling methods, especially feature space coverage sampling (FSCS), on model performance in predicting forest cover types and saturated hydraulic conductivity (Ks).
To investigate the mechanism underlying the efficacy of FSCS.
To assess the impact of FSCS on model interpretability in hydrological machine learning applications.

Study Configuration

Spatial Scale: Data from Roosevelt National Forest (for forest cover types) and the USKSAT database (for soil properties across the USA).
Temporal Scale: Not explicitly mentioned in the abstract; data sets appear to represent static properties or aggregated states.

Methodology and Data

Models used: Random Forest (RF), LightGBM (LGB).
Data sources:
- A large multiclass forest cover type data set from Roosevelt National Forest (110,393 samples).
- A continuous-value data set of soil properties (saturated hydraulic conductivity, Ks) from the USKSAT database (18,729 samples).
Sampling methods evaluated: Feature space coverage sampling (FSCS), balanced sampling, conditioned Latin hypercube sampling, and simple random sampling.
Analysis: 1,720 models constructed and optimized; SHAP analysis for interpretability.

Main Results

FSCS significantly mitigated data imbalance, leading to enhanced model accuracy, feature importance estimation, and interpretability.
Balanced sampling, conditioned Latin hypercube sampling, and FSCS consistently outperformed simple random sampling across various training set sizes.
FSCS-trained models, even when using smaller training sets and simpler Random Forest models, matched or surpassed the performance of models trained with larger data sets or more complex LightGBM models.
SHAP analysis revealed that FSCS improved the clarity of feature–target relationships, emphasized feature interactions, and enhanced overall model interpretability.

Contributions

Demonstrates the significant potential of advanced sampling methods, particularly FSCS, to effectively address data imbalance in hydrological machine learning applications.
Shows that FSCS can improve model accuracy, feature importance estimation, and interpretability, even with reduced training data requirements.
Provides insights into the mechanism by which FSCS enhances model interpretability through clearer feature-target relationships.
Offers a pathway to develop more reliable, accurate, and interpretable machine learning models for hydrological applications by providing superior prior information for model training.

Funding

Not explicitly mentioned in the abstract.

Citation

@article{Yin2025Addressing,
  author = {Yin, Xiaoran and Shu, Longcang and Wang, Zhe and Zhou, Long and Niu, Shuyao and Ren, Huazhun and Zhu, Lei and Lu, Chengpeng},
  title = {Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability},
  journal = {Water Resources Research},
  year = {2025},
  doi = {10.1029/2024wr039848},
  url = {https://doi.org/10.1029/2024wr039848}
}

Original Source: https://doi.org/10.1029/2024wr039848