Kheimi et al. (2025) Multi-boosting and machine learning for soil substrate water content prediction

Identification

Journal: Soft Computing
Year: 2025
Date: 2025-12-17
Authors: Marwan Kheimi, Abdollah Ramezani‐Charmahineh, Mohammad Zounemat‐Kermani
DOI: 10.1007/s00500-025-10984-3

Research Groups

Department of Civil and Environmental Engineering, Faculty of Engineering—Rabigh Branch, King Abdulaziz University, Jeddah, Saudi Arabia
Department of Water Engineering, Shahrekord University, Shahrekord, Iran
Department of Civil Engineering, Shahid Bahonar University of Kerman, Kerman, Iran

Short Summary

This study proposes and evaluates six machine learning algorithms and one mathematical model to predict Substrate Water Content (SWC) using volumetric water content, time since last irrigation, and porosity as inputs. The XGBoost ensemble model demonstrated superior performance with the lowest Root Mean Square Error (0.009 m³·m⁻³) and highest Nash-Sutcliffe coefficient (0.987).

Objective

To compare the efficiency of various machine learning (ML) models (Multi-Layer Perceptron (MLP), Extreme Learning Machine (ELM), Support Vector Regression (SVR), Random Forests (RF), Multi-Boosting (MB), AdaBoost, and XGBoost (XGB)) against the traditional Multiple Linear Regression (MLR) model for Substrate Water Content (SWC) prediction.
To evaluate the effectiveness of tree-based, vector-based, network-based, and ensemble ML models in capturing complex soil-water dynamics and simulating SWC based on collected data from diverse sources.

Study Configuration

Spatial Scale: Greenhouse experiments conducted in El Paso, Texas, USA (arid southwestern U.S.) and at Seoul National University Farm, Suwon, Korea (temperate East Asia).
Temporal Scale: Data collected over periods ranging from 60 days to 42 weeks (294 days), with the overall dataset spanning up to 294 days.

Methodology and Data

Models used: Multiple Linear Regression (MLR), Multi-Layer Perceptron (MLP), Extreme Learning Machine (ELM), Support Vector Regression (SVR), Random Forests (RF), Multi-Boosting (MB), AdaBoost (AB), XGBoost (XGB).
Data sources: A dataset of 722 data points compiled from three independent experimental studies:
1. Liu et al. (2020): Investigated drought stress responses of Cornus alba seedlings.
2. Sun et al. (2015): Examined growth responses of interspecific cotton breeding lines.
3. An et al. (2021): Studied Cymbidium “Hoshino Shizuku” under automated irrigation.
- Input features: Volumetric Water Content (VWC), time elapsed from the last irrigation, and porosity.
- Output variable: Substrate Water Content (SWC).
- Data preparation: Data was randomly mixed and separated into training (70%), validation (15%), and testing (15%) subsets using a cross-validation technique.
- Feature selection: Mallow’s CP test was used to select effective input parameters.
- Evaluation metrics: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Geometric Error (MAGE), Geometric Reliability Index (GRI), Pearson’s Correlation Coefficient (PCC), and Efficiency Factor of Nash-Sutcliffe (EFNS). Visual diagnostic tools included marginal-scatter plots, hybrid violin-box plots, filled error plots, and parallel coordinate diagrams. Bootstrap resampling with 1000 iterations was used for robustness assessment.

Main Results

All seven machine learning models significantly outperformed the conventional Multiple Linear Regression (MLR) model in predicting SWC during the testing phase.
Ensemble machine learning models (XGBoost, Multi-Boosting, AdaBoost, Random Forests) demonstrated superior agreement with observational data compared to individual ML models (SVR, ELM, MLP) and MLR.
The XGBoost (XGB) ensemble model achieved the best overall performance, with the lowest RMSE of 0.009 m³·m⁻³, the highest Nash-Sutcliffe coefficient (EFNS) of 0.987, and a Pearson Correlation Coefficient (PCC) of 0.994.
The Multi-Boosting (MB) and XGBoost (XGB) models improved the RMSE and EFNS criteria by up to 67% and 73%, respectively, compared to the MLR model.
Bootstrap resampling analysis confirmed that ensemble methods (MB, AdaBoost, and XGB) exhibited smaller average differences between observed and predicted values and narrower confidence intervals, indicating higher stability and reduced bias.
The individual SVR model showed a negative bias and systematic underestimation of SWC values, particularly for extreme ranges.

Contributions

Provided a comprehensive comparison of various machine learning paradigms (regression-based, vector-based, network-based, and tree-based ensemble learners) for Substrate Water Content (SWC) prediction.
Demonstrated the superior predictive accuracy and robustness of ensemble machine learning models, especially XGBoost, in modeling complex, non-linear soil-water-plant interactions.
Utilized a diverse, multi-source dataset, enhancing the generalizability of the findings across different plant species, environmental conditions, and irrigation strategies.
Highlighted the importance of key input parameters (volumetric water content, time since last irrigation, and porosity) for accurate SWC prediction using data-driven approaches, which are often challenging for traditional empirical models.
Emphasized the potential of ML models for optimizing irrigation scheduling and improving water use efficiency in agriculture.

Funding

Institutional Fund Projects under grant No. (GPIP: 458-829-2024).
King Abdulaziz University, Deanship of Scientific Research (DSR), Jeddah, Saudi Arabia.

Citation

@article{Kheimi2025Multiboosting,
  author = {Kheimi, Marwan and Ramezani‐Charmahineh, Abdollah and Zounemat‐Kermani, Mohammad},
  title = {Multi-boosting and machine learning for soil substrate water content prediction},
  journal = {Soft Computing},
  year = {2025},
  doi = {10.1007/s00500-025-10984-3},
  url = {https://doi.org/10.1007/s00500-025-10984-3}
}

Original Source: https://doi.org/10.1007/s00500-025-10984-3