Scarpin et al. (2025) Peanut yield and grade prediction in Georgia, USA: integrating management, climate, and remote sensing data with explainable AI
Identification
- Journal: Computers and Electronics in Agriculture
- Year: 2025
- Date: 2025-12-04
- Authors: Gonzalo Joel Scarpin, Sara Beth Studstill, W. Scott Monfort, R. Scott Tubbs, Cristiane Pilon, Amrinder Jakhar, Anish Bhattarai, Amandeep Kaur Dhaliwal, Leonardo M. Bastos
- DOI: 10.1016/j.compag.2025.111270
Research Groups
- Department of Crop and Soil Sciences, University of Georgia, Athens, GA, USA
- Instituto Nacional de Tecnología Agropecuaria (INTA), Estación Experimental de Reconquista, Santa Fe, Argentina
- Bayer Crop Sciences, USA
- Department of Crop and Soil Sciences, University of Georgia, Tifton, GA, USA
Short Summary
This study integrates management, climate, and remote sensing data with explainable AI to predict peanut yield and grade in Georgia, USA, finding that Cubist-rule and support vector machine models, particularly with management and soil/remote sensing data, achieve the lowest prediction errors and reveal irrigation and vegetation indices as key drivers.
Objective
- Compare the performance of various machine learning (ML) models for predicting peanut yield and grade across diverse variable groups (GV).
- Select the most accurate ML and GV combination.
- Identify the most important factors driving outcomes using SHapley Additive exPlanations (SHAP).
- Assess the spatial generalization of the best models using a Leave-One-Site-Year-Out (LOSYO) cross-validation.
Study Configuration
- Spatial Scale: Field-level data from 51 different counties in the Georgia peanut belt, USA.
- Temporal Scale: Three growing seasons, from 2017 to 2019.
Methodology and Data
- Models used: 18 machine learning regression algorithms were compared, including: Extreme Gradient Boosting (XGBoost), Bayesian Additive Regression Trees (BART), Bagged Decision Trees (Bag-Tree), Random Forest (RF), RuleFit, Cubist Rule-Based Model (Cubist-rule), Decision Tree (CART), Geographically Weighted Random Forest Model (GRF), Linear Regression (LASSO), Partial Least Squares (PLS), Poisson Regression, k-Nearest Neighbors (kNN), Multivariate Adaptive Regression Splines (MARS), Bagged Multivariate Adaptive Regression Splines (Bag-MARS), Multilayer Perceptron (MLP), Bagged neural networks (bagMLP), Support Vector Machine polynomial (SVMp), and Support Vector Machine linear (SVM_l).
- Data sources:
- Management (M): Farmer surveys (2017-2019) from over 200 peanut farms, including planting/digging dates, growing season length, irrigation, variety, seed source, row pattern, and field location (latitude, longitude).
- Weather (W): Daymet (1 km spatial resolution) daily data, including vapor pressure (Pa), minimum and maximum temperature (°C), snow water equivalent (kg m⁻²), solar radiation (watt m⁻²), precipitation (mm), day length (s day⁻¹), and Growing Degree Days (GDD), summarized for three developmental stages.
- Soil (S): POLARIS (30 m spatial resolution) database, providing 13 soil variables (e.g., total clay, sand, silt, organic matter percentages, pH, saturated water content (m³ m⁻³), saturated hydraulic conductivity (m s⁻¹)) at four depths (0-5 cm, 5-15 cm, 15-30 cm, 30-60 cm), along with river basin and soil series.
- Remote Sensing (R): Sentinel-2 optical imagery and Digital Elevation Models (DEM) from Google Earth Engine, used to calculate median values of vegetation indices (NDVI, NDRE, GNDVI, EVI) and median elevation (m) for each field.
- Target variables: Peanut yield (kg ha⁻¹) and grade (%).
Main Results
- The average Root Mean Squared Error (RMSE) across all models and variable group combinations was 1045 kg ha⁻¹ for yield and 2.03 % for grade.
- For yield prediction, the SVM_p model with Management + Soil (M + S) data achieved the lowest RMSE of 816 kg ha⁻¹.
- For grade prediction, the Cubist-rule model with Management + Remote Sensing (M + R) data achieved the lowest RMSE of 1.52 %.
- Variable group combinations including management data (M + S, M + R, M + W) consistently outperformed single data sources or the full dataset (M + W + S + R), suggesting that increasing features did not always improve performance.
- Support Vector Machine and tree-based models generally performed better than other model types, while neural networks showed the poorest prediction accuracy.
- SHAP analysis revealed that irrigation, geographic location (latitude, longitude), and soil properties were key drivers for yield, with irrigated fields, early planting dates, and southern field locations positively impacting yield.
- For grade, vegetation indices (NDRE, GNDVI), irrigation, digging date, and elevation were the most important factors, with lower NDRE, higher GNDVI, irrigation, and later digging dates positively affecting grade.
- Leave-One-Site-Year-Out (LOSYO) cross-validation showed robust generalization for yield (R² = 0.52, RMSE = 922 kg ha⁻¹) but indicated performance heterogeneity for grade (R² = 0.28, RMSE = 1.94), with accurate predictions for high grades but overestimation for lower grades, possibly due to data imbalance.
Contributions
- This is the most comprehensive study to date on peanut yield and grade prediction in Georgia, USA, integrating a diverse and extensive dataset including management practices, climate, soil, and remote sensing data.
- It provides a robust comparison of 18 different machine learning models and 15 variable group combinations to identify optimal predictive approaches.
- The study pioneers the application of the SHapley Additive exPlanations (SHAP) framework in peanut production, offering a transparent and interpretable understanding of complex variable interactions driving yield and quality, moving beyond "black-box" predictions.
- It successfully develops a model for accurate pre-harvest prediction of peanut grade, an area with limited prior research, which holds significant economic and logistical importance for the peanut supply chain.
- The use of real-world, farmer-surveyed data enhances the practical significance and representativeness of the findings for actual production environments.
Funding
- Instituto Nacional de Tecnología Agropecuaria (INTA) [P176905-BIRF] (human resources improvement program).
- Georgia Peanut Commission (financial support to Sara Beth Studstill).
- University of Georgia Department of Crop & Soil Sciences (financial support to Sara Beth Studstill).
- National Peanut Board (financial support to Sara Beth Studstill).
Citation
@article{Scarpin2025Peanut,
author = {Scarpin, Gonzalo Joel and Studstill, Sara Beth and Monfort, W. Scott and Tubbs, R. Scott and Pilon, Cristiane and Jakhar, Amrinder and Bhattarai, Anish and Dhaliwal, Amandeep Kaur and Bastos, Leonardo M.},
title = {Peanut yield and grade prediction in Georgia, USA: integrating management, climate, and remote sensing data with explainable AI},
journal = {Computers and Electronics in Agriculture},
year = {2025},
doi = {10.1016/j.compag.2025.111270},
url = {https://doi.org/10.1016/j.compag.2025.111270}
}
Original Source: https://doi.org/10.1016/j.compag.2025.111270