Yin et al. (2025) Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability
⚠️ Warning: This summary was generated from the abstract only, as the full text was not available.
Identification
- Journal: Water Resources Research
- Year: 2025
- Date: 2025-10-01
- Authors: Xiaoran Yin, Longcang Shu, Zhe Wang, Long Zhou, Shuyao Niu, Huazhun Ren, Lei Zhu, Chengpeng Lu
- DOI: 10.1029/2024wr039848
Research Groups
Not explicitly mentioned in the abstract.
Short Summary
This study evaluates advanced sampling methods, particularly feature space coverage sampling (FSCS), in hydrological machine learning applications to address data imbalance. It demonstrates that FSCS significantly enhances model accuracy, feature importance estimation, and interpretability for predicting forest cover types and saturated hydraulic conductivity, even with smaller training sets.
Objective
- To evaluate the impact of advanced sampling methods, especially feature space coverage sampling (FSCS), on model performance in predicting forest cover types and saturated hydraulic conductivity (Ks).
- To investigate the mechanism underlying the efficacy of FSCS.
- To assess the impact of FSCS on model interpretability in hydrological machine learning applications.
Study Configuration
- Spatial Scale: Data from Roosevelt National Forest (for forest cover types) and the USKSAT database (for soil properties across the USA).
- Temporal Scale: Not explicitly mentioned in the abstract; data sets appear to represent static properties or aggregated states.
Methodology and Data
- Models used: Random Forest (RF), LightGBM (LGB).
- Data sources:
- A large multiclass forest cover type data set from Roosevelt National Forest (110,393 samples).
- A continuous-value data set of soil properties (saturated hydraulic conductivity, Ks) from the USKSAT database (18,729 samples).
- Sampling methods evaluated: Feature space coverage sampling (FSCS), balanced sampling, conditioned Latin hypercube sampling, and simple random sampling.
- Analysis: 1,720 models constructed and optimized; SHAP analysis for interpretability.
Main Results
- FSCS significantly mitigated data imbalance, leading to enhanced model accuracy, feature importance estimation, and interpretability.
- Balanced sampling, conditioned Latin hypercube sampling, and FSCS consistently outperformed simple random sampling across various training set sizes.
- FSCS-trained models, even when using smaller training sets and simpler Random Forest models, matched or surpassed the performance of models trained with larger data sets or more complex LightGBM models.
- SHAP analysis revealed that FSCS improved the clarity of feature–target relationships, emphasized feature interactions, and enhanced overall model interpretability.
Contributions
- Demonstrates the significant potential of advanced sampling methods, particularly FSCS, to effectively address data imbalance in hydrological machine learning applications.
- Shows that FSCS can improve model accuracy, feature importance estimation, and interpretability, even with reduced training data requirements.
- Provides insights into the mechanism by which FSCS enhances model interpretability through clearer feature-target relationships.
- Offers a pathway to develop more reliable, accurate, and interpretable machine learning models for hydrological applications by providing superior prior information for model training.
Funding
Not explicitly mentioned in the abstract.
Citation
@article{Yin2025Addressing,
author = {Yin, Xiaoran and Shu, Longcang and Wang, Zhe and Zhou, Long and Niu, Shuyao and Ren, Huazhun and Zhu, Lei and Lu, Chengpeng},
title = {Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability},
journal = {Water Resources Research},
year = {2025},
doi = {10.1029/2024wr039848},
url = {https://doi.org/10.1029/2024wr039848}
}
Original Source: https://doi.org/10.1029/2024wr039848