Wang et al. (2025) CitrusNet: A vision transformer-CNN approach for citrus detection from multi-source imagery with multi-scale feature integration

Identification

Journal: Computers and Electronics in Agriculture
Year: 2025
Date: 2025-12-01
Authors: Haochen Wang, Juan Shi, Hamed Karimian, Fei Wang, Faizan Javed, Bo Liu, Shengnan Shi, Ziwei Li, Tao Yang
DOI: 10.1016/j.compag.2025.111260

Research Groups

School of Marine Technology and Geomatics, Jiangsu Ocean University, Lianyungang, China
Key Laboratory of Marine Meteorological Disaster Prevention and Mitigation of Jiangsu Province, Lianyungang, China
School of Engineering, The University of Western Australia, Crawley, WA, Australia
Centre for Water and Spatial Science, The University of Western Australia, Crawley, WA, Australia
School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW, Australia

Short Summary

This paper introduces CitrusNet, a novel deep learning model combining Vision Transformers and Convolutional Neural Networks with multi-scale feature integration, to accurately detect citrus fruits across diverse multi-source imagery, outperforming state-of-the-art models.

Objective

To develop a robust deep learning model capable of accurately detecting citrus fruits from multi-source imagery (UAV, mobile, AI-synthesized) despite variations in scale, resolution, and sensor types, for efficient crop monitoring and management.

Study Configuration

Spatial Scale: Multi-scale, adapting to citrus fruits in various complex backgrounds and at different sizes, using imagery from UAVs, mobile devices, and AI synthesis.
Temporal Scale: Implied for crop monitoring and yield forecasting, but no specific duration or frequency is mentioned.

Methodology and Data

Models used: CitrusNet, a hybrid Vision Transformer-CNN model, incorporating:
- Improved Residual Multi-Layer Perceptron (Res-MLP) based on Swin Transformer.
- Convolutional Neural Network (CNN) with a plug-and-play Adaptive Feature Fusion Module (AFM).
- Decoupled detect head with a Multi-Scale Depthwise Fusion Module (MSDM).
Data sources: Self-created Citrus Multi-Source Detection Dataset (CMSDD), comprising UAV, mobile, and AI-synthesized imagery.

Main Results

CitrusNet achieved high performance metrics for citrus detection:
- Precision: 91.20%
- Recall: 87.16%
- F1 score: 0.891
- mAP50: 94.07%
- mAP50:95: 84.25%
The model demonstrated superior accuracy and robustness compared to state-of-the-art models, making it a promising solution for citrus crop monitoring.

Contributions

Introduction of CitrusNet, a novel hybrid Vision Transformer-CNN deep learning model designed for robust citrus detection in multi-source imagery.
Enhancement of the Residual Multi-Layer Perceptron (Res-MLP) using Swin Transformer to improve perceptual capability across diverse scales by integrating local and global features.
Proposal of a plug-and-play Adaptive Feature Fusion Module (AFM) within the CNN, which automatically adjusts feature channel weights to enhance multi-scale feature fusion.
Development of a decoupled detection head incorporating a Multi-Scale Depthwise Fusion Module (MSDM) to improve the model’s adaptability to complex backgrounds and targets at varying scales.
Creation of the Citrus Multi-Source Detection Dataset (CMSDD) to facilitate research and development in multi-source citrus detection.

Funding

Not specified in the provided paper text.

Citation

@article{Wang2025CitrusNet,
  author = {Wang, Haochen and Shi, Juan and Karimian, Hamed and Wang, Fei and Javed, Faizan and Liu, Bo and Shi, Shengnan and Li, Ziwei and Yang, Tao},
  title = {CitrusNet: A vision transformer-CNN approach for citrus detection from multi-source imagery with multi-scale feature integration},
  journal = {Computers and Electronics in Agriculture},
  year = {2025},
  doi = {10.1016/j.compag.2025.111260},
  url = {https://doi.org/10.1016/j.compag.2025.111260}
}

Original Source: https://doi.org/10.1016/j.compag.2025.111260