Gülcan et al. (2026) Unveiling the performance of pre-processing approaches in machine learning based flood susceptibility mapping

Identification

Journal: Natural Hazards
Year: 2026
Date: 2026-02-23
Authors: Nihal Gülcan, Ömer Ekmekcioğlu
DOI: 10.1007/s11069-026-08034-8

Research Groups

Civil Engineering Department, Faculty of Engineering, Gebze Technical University, Kocaeli, Turkey
Disaster and Emergency Management Department, Disaster Management Institute, Istanbul Technical University, Istanbul, Turkey

Short Summary

This study systematically evaluates various pre-processing techniques for machine learning-based flood susceptibility mapping in the San Joaquin River Basin using the XGBoost algorithm. It identifies that robust scaling with a 70/30 train-test split, combined with Random Under Sampling at a 10x class imbalance ratio, yields the most accurate flood susceptibility predictions.

Objective

To systematically evaluate the impact of diverse pre-processing schemes, including data scaling, train-test splitting ratios, and class imbalance handling strategies, on the performance of an eXtreme Gradient Boosting (XGBoost) model for flood susceptibility mapping.

Study Configuration

Spatial Scale: San Joaquin River Basin, California, US. The basin covers approximately 40,000 square kilometers (4 million hectares). Geospatial layers were resampled to a uniform cell size of 30 meters x 30 meters.
Temporal Scale: Flood inventory data from the NOAA Storm Event Database since 1950.

Methodology and Data

Models used: eXtreme Gradient Boosting (XGBoost) for flood susceptibility prediction; SHapley Additive exPlanation (SHAP) for model interpretability and feature importance analysis.
Data sources:
- Flood Inventory: 636 historical flood event locations from the NOAA Storm Event Database.
- Flood Conditioning Factors (22 predictors): Digital Elevation Model (DEM) from Shuttle Radar Topography Mission (SRTM) (30 m resolution), slope, aspect, plan curvature, profile curvature, curvature, distance from rivers, curve number, Normalized Difference Vegetation Index (NDVI), Topographic Position Index (TPI), Terrain Ruggedness Index (TRI), Topographic Wetness Index (TWI), Sediment Transport Index (STI), Stream Power Index (SPI), heavy rain likelihood, Land Use/Land Cover (LULC), geology, distance from roads, road density, distance from faults, fault density, and river density. All raster layers were resampled to 30 m resolution.

Main Results

Optimal Pre-processing Configuration:
- Stage 1 (Scaling and Train-Test Split): The XGBoost model achieved the highest performance in detecting flooded regions with robust scaling and a 70/30 train-test split (AUROC of 0.851, F1-score of 0.764 for the testing set, and 81.32% recall for the flood class).
- Stage 2 (Class Imbalance Handling): Utilizing the optimal Stage 1 configuration, Random Under Sampling (RUS) with a 10x class imbalance ratio (10 non-flood points for each flood point) yielded the most accurate outcomes for flood detection (AUROC of 0.835 for the testing set, and 77.55% recall for the flood class).
Flood Susceptibility Mapping: The generated map indicates that over 20% of the San Joaquin River Basin is classified as having high (11.4%) or very high (9.7%) flood susceptibility, primarily in the southeastern portions and urban areas.
Model Interpretability (SHAP Analysis):
- Most Influential Factors: Distance to faults was identified as the most significant factor, followed by distance to roads and road density.
- Positive Correlation with Flood Susceptibility: Geology, road density, TWI, and heavy rain likelihood.
- Inverse Correlation with Flood Susceptibility: Elevation, slope, SPI, and TRI.

Contributions

Provides a systematic and comprehensive assessment of various data pre-processing techniques (scaling, train-test splitting, and class imbalance handling) within an integrated evaluation framework for flood susceptibility mapping.
Addresses the often-overlooked issue of class imbalance in flood hazard analysis by establishing a holistic scheme that considers significant differences between flooded and non-flooded instances.
Represents the first application of data-driven techniques for flood susceptibility mapping in the San Joaquin River Basin, California, US, offering valuable insights for regional decision-makers.
Utilizes the SHAP algorithm to enhance model interpretability, elucidating the positive or negative influences and causal relationships of individual conditioning factors on flood susceptibility predictions.

Funding

Istanbul Technical University, Türkiye, Scientific Research Projects (Project No: MYLB-2023-45089)
Turkish Academy of Sciences (The Young Scientists Award Programme—GEBİP)
Scientific and Technological Research Council of Türkiye (TÜBİTAK) (Open access funding)

Citation

@article{Gülcan2026Unveiling,
  author = {Gülcan, Nihal and Ekmekcioğlu, Ömer},
  title = {Unveiling the performance of pre-processing approaches in machine learning based flood susceptibility mapping},
  journal = {Natural Hazards},
  year = {2026},
  doi = {10.1007/s11069-026-08034-8},
  url = {https://doi.org/10.1007/s11069-026-08034-8}
}

Original Source: https://doi.org/10.1007/s11069-026-08034-8