Page 2297
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Explainable Deep Learning for Intelligent Plant Disease Detection
Dr. Pallavi Sharma
1
, Ngah Hesly Kilofonyuy
2
1
Assistant Professor, School of Engineering, Design and Automation- E, Department of ECE, GNA
University, Phagwara, Punjab, India
2
Undergraduate Student, School of Engineering, Design and Automation- E, Department of ECE, GNA
University, Phagwara, Punjab, India
DOI:
https://doi.org/10.51583/IJLTEMAS.2026.150500184
Received: 11 May 2026; Accepted: 16 May 2026; Published: 12 June 2026
ABSTRACT
The world suffers from 10–40% loss in crop yields each year because of plant disease. This threat is serious and
growing; it threatens food security, rural livelihoods, and agricultural economies. Advances being made through
deep learning, computer vision, and mobile technology have presented a unique opportunity to use leaf images
to automatically recognize plant disease. Published classification accuracies on benchmark datasets now exceed
97%, which is an important achievement but achieving high accuracy on a benchmark alone does not indicate
that traditional methods will work when deployed in the real world: all four stakeholders (i.e., farmers,
agronomists, regulatory authorities, and extension agents) must therefore have the ability to understand, and
interpret the output of automatically recognized plant diseases in a way that enhances human expertise rather
than replacing it. In this chapter, we provide a compendium of technical deep learning architectures and methods
related to Explainable Artificial Intelligence (XAI) for plant disease detection, including convolutional networks,
residual architectures, dense architectures, transformer networks, and hybrid models. We also systematically
evaluate the explainability methods used in both post-hoc and intrinsic explanation and evaluate the applicability
of these methods across a variety of imaging modalities used in agriculture, including RGB, multispectral, and
hyperspectral. This chapter characterizes major benchmark datasets; discusses major challenges to their
deployment, including class imbalance, domain shift, model size reduction, and human–AI trust calibration; then
ends with potential new directions for research in areas such as foundation models (FM), causal interpretable
models (Explanations), federated learning, and continual learning to build resilience for each evolving pathogen
landscape.
Keywords: Explainable Artificial Intelligence; Plant Disease Detection; LIME, Grad-CAM; SHAP; Federated
Learning; Precision Agriculture; Hyperspectral Imaging.
INTRODUCTION
The backbone of human civilization is Agriculture. Agriculture directly provides the livelihood of approximately
570 million farm households around the globe and indirectly supports the food security of a global population
projected to reach over 9.7 billion by 2050. Among various agricultural products globally, plant diseases (caused
by a variety of pathogens including fungi, bacteria, viruses, and oomycetes as well as abiotic stressors) are among
the most widespread and economically damaging threats. According to the Food and Agriculture Organization
of the United Nations, crop diseases, pests, and weeds alone account for loss of 20% to 40% of worldwide
agricultural crop production each year, resulting in providing an estimated $220 billion dollars each year in
economic loss. Smallholder farmers in low- and middle-income countries disproportionately experience these
consequences; a single, failed harvest threatens their future food security or long-term financial viability [1].
Correctly identifying disease pathogens is essential to effectively manage diseases because the method of
treatment depends on whether the pathogen is fungal, bacterial, or viral. Incorrectly identifying a pathogen leads
to misapplication of chemicals, waste of costly inputs, rapid progression of resistance and delays in effective
intervention. Currently, traditional methodologies for diagnosing plant diseases are limited to the expert visual
Page 2298
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
assessment of a trained plant pathologist or agricultural extension officer. However, these experts are not found
equally throughout the world but are instead primarily located in research institutions and government agencies
in high-income areas with very limited availability in sub-Saharan Africa, South Asia and generally in areas
where smallholder farmers grow crops. In sub-Saharan Africa in some areas, the ratio of extension officers to
the number of farm households is greater than 1:3,000 creating an impossible task for expert assessments of
large numbers of households.
The rise of inexpensive smartphones with high-quality cameras and good mobile connectivity has made it
possible to provide all farmers with tools they can use for diagnosing the health of their crops. This is now
possible due to the presence of a strong infrastructure (high-resolution images and mobile connectivity) and the
improvement of artificial intelligence (AI) through deep learning to identify plant diseases using images from
smartphones. Deep learning is a sub-component of AI (specifically machine learning) that uses multiple layers
of neural networks to extract features training (end-to-end) on very large datasets to obtain hierarchical features.
In the field of image recognition, deep learning has demonstrated revolutionary outcomes. The use of deep
convolutional neural networks (CNNs) to achieve classification accuracy above 97% on the PlantVillage
benchmark demonstrates that expert human diagnostic specialists have comparable classification accuracies to
deep learning systems under controlled conditions [2]. The success of AI for crop protection has also resulted in
significant investments from commercial and government entities in the development of decision support
systems based on AI, e.g. Plantix, Plant.id, PEAT and various early-warning systems funded by governments in
Indonesia, Kenya, India and the Netherlands.
There is a disparity between how well systems work under ideal conditions and how well they will work on the
ground: interpretability. A farming operation cannot use a system that produces a disease label without additional
information to indicate what visual or spectral evidence supports the prediction. For example, an agronomist
deciding whether to apply a systemic fungicide (an expensive intervention) wants to ensure that the AI system
has identified real signs of disease and is not presenting falsehoods (i.e., things in the image that do not really
exist). Additionally, smallholder farmers may require a way of linking the location of abnormality indicated by
AI with what they see on the leaf. More regulatory agencies in the EU and other parts of the world are requiring
that systems used to support important decisions provide transparency about how the algorithms operate [8]. In
response to this gap, the field of Explainable Artificial Intelligence (XAI) has emerged to fill these needs;
providing predictive systems with explanatory features based upon their outputs.
This chapter provides a thorough and technically sound approach to the use of deep learning techniques for
identifying plant diseases. The principle focus of this investigation will be upon the interpretability of deep
learning systems designed for such purposes. The aim will be to present not only the evolution of deep learning
architectures from convolutional neural networks (CNNs) through residual networks or DenseNet to more
efficient methods such as vision transformers, but also evaluate each architecture's corresponding accuracy and
compatibility with XAI (i.e., interpretability). This will be accomplished in Section 2 by presenting the evolution
of the aforementioned architectures as well as a discussion of both their accuracies and their compatibility with
XAI. The systematic survey for each category of XAI technique available for application to deep learning models
trained for plant disease identification is presented in Section 3 and includes: gradient saliency mapping;
perturbation-based techniques; game theory-based attribution methods; attention mechanisms and their use in
deep learning models; and relevance propagation techniques using backpropagation. The primary benchmark
data associated with deep learning and plant disease will be described in Section 4 and consist of: laboratory-
controlled datasets, field-collected datasets and hyperspectral imaging datasets. Section 5 discusses the
challenges with regard to deploying a deep learning model trained to detect plant diseases including class
imbalance, domain shift, edge computing restrictions and the interaction between humans and AI. Section 6
identifies emerging research directions that include foundation models, causal explainable artificial intelligence
(XAI), federated learning, and continuous learning in response to the changing environment presented by plant
pathogens. Section 7 presents discussion upon our findings and research recommendations.
Page 2299
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Deep Learning Architectures for Plant Disease Detection
Foundational Convolutional Neural Networks
The convolutional neural network (CNN) has been the preeminent way to classify plant disease Image using
artificial neural networks. The three principal advancements of the CNN (local receptive field, shared spatial
position weights, and hierarchical feature construction) application to visual recognition make CNN the best
representation of visual imaging tasks.
The Convolutional Network has small convolutional window sizes (usually 3x3 or 5x5 pixels) to extract local
information (i.e., edges, colour changes, or textures) from the image into its base components in the initial layers.
The subsequent layers will then use those base components to build up feature representations of higher-level
abstractions such as venation, lesion border, or spore patterns. Max-Pooling introduces a limited amount of
translation in the image being recognized while simultaneously reducing the image's spatial dimension and
increasing the depth of how CNN recognizes images. The ability to operate on a global feature vector compactly,
in the fully connected layer of CNN is dependent on how well the global features of the feature representation
have been reduced in size (dimensionality) using Max-Pooling. AlexNet won the ImageNet Large Scale Visual
Recognition Competition in 2012 and demonstrated how dramatically Deep CNN using GPU-accelerated
training, dropout for preventing overfitting, and data augmentation can change the outcome of a visual
recognition task. AlexNet and less deep CNNs have struggled to clearly differentiate closely related plant
diseases such as Septoria leaf spot vs early blight on tomato and angular leaf spot and bacteria blight on soybean
because they do not have sufficiently deep hierarchies of features to make this distinction. The VGG architecture
family, consisting of stacks of 3x3 convolutional blocks (i.e., layers) that have anywhere from 16 to 19 total
layers deep, increased their ability to represent objects and therefore were widely viewed as a useful baseline for
modeling/representing plants diseases from 2015 until 2017 when the VGG families achieved approximately
91% to 93% accuracy with the use of newly-developed methodologies (e.g., transfer-learning using an ImageNet
dataset) [2].
Residual Networks and Dense Connectivity
The reason this achieved success is likely due in part to solving the vanishing gradient problem (i.e.,
backpropagation errors diminishing/attenuating at an exponential rate throughout many layers), which had
limited the practical depth of CNN to roughly 20 layers until he et al. [3] introduced the use of residual
architectures with the addition of an identity shortcut connection to a typical convolutional block which allowed
for an error message to be transmitted unattenuated to lower level blocks, thus allowing greater complexity, or
the use of deeper architecture. Because of this added flexibility in adding layers to a configuration of
architectures such as ResNet, ResNets also have been able to achieve a state of success in
classifying/representing plant diseases at around 94% to 96% accuracy as compared to a known baseline for
representing agricultural-related data (i.e., PlantVillage) and have subsequently become the most widely cited
convolution architecture in the literature related to agricultural deep-learning models. Modifications made to
ResNets, known as "Wide ResNets", where channel width is increased but depth is changed minimally,
demonstrate an enhanced representation of features with minimal associated additional parameters.
Huang et al. [4] developed DenseNet, which takes the basic residual idea and extends it to link all layers in a
dense block (using feature map concatenation, as opposed to using addition) so that there is a fully dense flow
of information through the network while promoting re-use of features, reducing the number of parameters
unnecessarily, and providing the most optimal gradient pathways for the use of backpropagation methods for
explaining what the model has learned. DenseNet-121 has performed well in both the vanilla controlled
PlantVillage data set and also in PlantDoc, the more difficult, real-world validation data set, due to the fact that
it enables the integration of low-level features with high-level features, such as integrating lesion colours and
lesion chlorosis patterns (low-level) with lesion shape and lesion spatial distribution (high-level), into one unified
feature representation. Furthermore, the extensive structure of feature concatenation in DenseNet makes
DenseNet attribution maps derived from LRP and similar methods much more comprehensive and pathologically
interpretable than those derived from previous networks.
Page 2300
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Transfer Learning, Domain Adaptation, and Data Augmentation
The ImageNet dataset, which has 1.28 Million images with 1000 categories of objects, provides CNNs with a
visual dictionary of hierarchical primitive components (e.g., oriented edges, colour blobs, etc), which can serve
to enhance the ability for the model to efficiently process and recognise agricultural leaf images as a function of
fine-tuning. When fine-tuning a pre-trained CNN backbone on plant disease datasets, the model’s capacity to
generalise and adapt features acquired from the ImageNet dataset enables multiple orders-of-magnitude
larger amounts of plant disease labelled training data to be developed. This represents an immense advantage to
agricultural applications where creating the number of expert-annotated images to train CNNs using an initial
random start requires lots of cost-effectiveness given the typical expense associated with an individual expert
creating a labelled image of an expert-level plant disease. Another area of focus for techniques used for the
domain adaptation are the variability between the training domain, typically from the PlantVillage dataset, as
well as for cameras and image characteristics such as lighting, the amount of reduced intra-class variation versus
inter-class variation due to different stages of plant leaf development, etc. Adversarial domain adaptation (ADA)
will allow the use of two different types of training methodologies (the training of domain discriminator as a
function of supervised classification of disease images through CNN methods combined with gradient reversal
to simultaneously learn new and existing feature sets) and domain discrimination during the training process to
support developing features that discriminate across both the domain and among the different disease classes.
Overall, ADA provides support to USDA for developing CNN architecture to be trained with minimal datasets,
adapted to variable conditions, and computationally efficient implementations. Through multi-source domain
generalization, data from different locations and conditions provide a larger distribution to support better general
plan. Using augmentation methods, you can expand the apparent range of training data to improve stability
against geometric and photometric changes. These methods include random cropping, horizontal and vertical
flips, color jitter, random erasing, CutMix, and MixUp animal tests.
EfficientNet: Compound Scaling
Tan and Le's EfficientNet [5] was introduced in 2019 to accommodate the dependencies among the proficiencies
gained by evaluating a network's depth, breadth, and resolution independently. If you scale one of these aspects
while keeping the others constant, you will get quickly diminishing returns. An architecture search for an
efficient architecture from a constrained baseline (B0) yielded an EfficientNet family of models from B1 to B7
produced by applying a compound scaling (balanced) coefficient to scale all three dimensions at once.
EfficientNet-B4 provides approximately 19 million parameters and achieves roughly 97% classification
accuracy on PlantVillage and requires fewer floating-point operations than ResNet-101 or DenseNet-201,
making it an excellent candidate for deployment on resource-constrained mobile devices, while still retaining
classification performance. The modularity of the MBConv block structure allows for Grad-CAM and SHAP
attributions using standard tools.
Vision Transformers and Self-Attention Mechanisms
The Vision Transformer (ViT) created by Dosovitskiy et al. [6] uses a scale-dot-product self-attention
mechanism; the main operation used in Natural Language Processing Transformers on sequences of image
patches that do not overlap. These patches contain 16 pixels x 16 pixels or 32 pixels x 32 pixels. Each image
patch is projected linearly into an embedding space with positional encoding. There exists a learned classification
token that collects information from all the image patches via multiple heads of attention and multiple feed-
forward networks layered together. Existing models pre-trained on large datasets such as ImageNet-21K (14
million images) and fine-tuned on plant disease datasets produce accuracy rates between 95-97% when compared
to PlantVillage, while providing different forms of explainability compared to feature importance, with attention
weights representing how much influence each patch has upon the overall classification decision.
Swin Transformer [7], which is an extension of ViT, features hierarchical windowed attention with shifted
windows, allowing efficient modeling of multi-scale features (which is particularly important for plant disease
applications, which may contain pathological evidence) that can be found from individual stomata all the way
to the pattern of symptoms exhibited on a single leaf. The Swin-Tiny and Swin-Small versions of the Swin
Page 2301
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Transformer perform very well with fine-grained agricultural classification tasks, all while achieving
computations that are competitive with that of ResNet-50.
Figure 1: Evolution of Deep Learning Architectures for Plant Disease Detection
Hybrid architectures that combine CNN based feature extraction with transformer based self-attention are being
studied. For example, a ResNet or EfficientNet backbone can provide spatially localized features to compensate
for the need for very large pretraining datasets with ViTs. A transformer type decoder can be used for modeling
long range contextual dependencies between spatially separated groupings of lesions. For example, to detect the
classic distribution pattern of interveinal chlorosis resulting from magnesium deficiency across an entire leaflet.
Figure 1 illustrates the variations of data encoding used on a bar chart display and also describes both the trend
line (dashed) and how colour codes work to group architecture families by type.
Table 1: Summary of the various deep learning architectures that have been used for the classification of
plant diseases.
Architecture
Year
Depth
Plant Acc. (%)
XAI Compatibility
AlexNet
2012
8
87.3
Moderate — limited feature depth
VGG-16
2014
16
91.6
High — uniform block structure
ResNet-50
2015
50
94.2
High — skip connections aid gradients
InceptionV3
2016
48
93.7
Moderate — multi-branch complexity
DenseNet-121
2017
121
95.8
High — dense feature reuse
EfficientNet-B4
2019
~50
97.1
Moderate — compound scaling
Vision Transformer (ViT)
2020
12–24 L
96.4
Emerging — intrinsic attention
Swin Transformer
2021
12–24 L
97.3
High — hierarchical attention
Explainable Artificial Intelligence Methods
The Imperative for Explainability in Agricultural AI
Scholars and regulators are calling for algorithmic transparency as a result of the deployment of opaque machine
learning systems in areas such as medical diagnosis, credit assessment, and criminal justice. Agriculture also has
similar issues and is just as serious. Misidentifying diseases could cause pesticide applications to be applied
incorrectly, wasting valuable resources, leading to pesticide resistance, and delaying the ability to treat an issue
Page 2302
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
in a timely manner [8]. In organic or integrated pest management systems, misidentified diseases could also
result in irreversible agricultural damage to a farmer's field during a single growing season. Farmers are doubly
disadvantaged when they follow AI-resulted recommendations and do not have knowledge of supporting data;
they receive poor agricultural production and do not have confidence in technologies that could help them.
Explanation mechanisms for artificial intelligence models have several different audiences and needs. People in
agriculture and disease diagnostics need faithfulness or correctness (the explanation needs to show how the
model actually made its decision) and precision (the highlighted features need to represent symptoms that can
be diagnosed accurately rather than just random background items). People who create and work with data
science and models need diagnostic utility (to be able to look at the way the features were attributed in the test
set to help identify if there are biases in the dataset, spurious correlations, or ways the model fails to generalise).
Regulatory agencies need auditability (a record that can be traced back to which features were used to influence
how a prediction was made as well as to facilitate a subsequent review of that process). Farmers who are
smallholders and have no formal education will need visual accessibility (heat map overlays of features that can
be directly related to how they see the disease on a leaf, without having a technical background). XAI methods
can be classified based on two main types of characteristics. The first type is how they relate to the model training
timeline: post-hoc XAI methods are applied after the model's training has taken place, using the model in its
"fixed" form, while intrinsic XAI methods use explainability as a design component of the model. The second
type is how they relate to model architecture: methods that relate to the model (classifiers) look at the internal
representation of the model, such as gradients or attention weights, while methods that do not relate to the model
(black box classifiers) examine the classifier through the alteration of the input. The most common method for
studying the effects of specific crops on diseases in the fields of agriculture and horticulture is through model-
based testing methods. The use of these types of testing systems is based on their ability to provide accurate
results and provide visual representations to assist in evaluating those results.
Gradient-weighted Class Activation Mapping (Grad-CAM) and Variants
Grad-CAM was introduced by Selvaraju and colleagues in 2017. The purpose of this method is to generate a
spatial heatmap, where a particular region of the input image has an effect on the predicted class identified by
the classifier. The image importances are calculated by first taking the gradient of the class score with respect to
the final convolutional layer activation maps [9]. The gradients from the images are then global average-pooled
across all spatial dimensions and used to calculate neuron importance weights. The neuron importance weights
are then applied to all activation maps to generate a composite weighted linear combination of all final layer
activation maps. The generated map is then processed using ReLU nonlinearity (retaining only positive
activation) and upsampled using bilinear interpolation, resulting in a generated overlay of the input where the
area of each component of the input corresponds to the classification decision [23]. The work builds on previous
work with Class Activation Mapping to highlight discriminative localisation in global average pooling layers.
When accurately trained, Grad-CAM can identify the locations of lesions on plants. For example, it can correctly
indicate necrotic spots, chlorotic halos, pustules with sporulation, and areas with water-soaked conditions, if
those traits are indicative of biological similarities between classes. In contrast, when Grad-CAM does not
identify the appropriate lesions, those deviations may be an indication of failure in the training of the model. For
example, if Grad-CAM identifies leaf margins as the site of disease rather than laminar lesions, then the model
is likely using a faulty correlation between disease types and leaf morphology that occurred due to how the
dataset was collected. To remove some limitations experienced by the original method, Grad-CAM++ added
second-order gradients to allow for producing an explanation that was smoother and more precisely located in
terms of where on the image the lesions are located; this was especially effective when the image contained
multiple lesions or lesions of varying size. Score-CAM eliminated the use of a gradient for generating activation
importance and determined importance of a channel by comparing the difference between the model’s output
when each activation map is used to permute the output of the model, thereby producing higher quality results
under condition of gradient saturation.
Page 2303
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Local Interpretable Model-agnostic Explanations (LIME)
To create local approximations of the behaviour of complex models around individual predictions, LIME [11]
builds an interpretable surrogate model in a neighbourhood of the input to the target model. For image-based
classification, LIME partitions the input image into adjoining superpixels using a segmentation algorithm such
as SLIC [10] (Simple Linear Iterative Clustering) or Felzenszwalb’s graph-based method. It then creates a
diverse neighbourhood of images with perturbations by randomly masking out portions of the superpixels,
replacing them with either a background colour or the mean pixel value of the image, and querying the target
model to obtain predicted class-probabilities for all the perturbed samples. The perturbation samples are then
weighted according to their distance from the original image, a sparse linear model (typically lasso regression)
is created via least-square approximation to the weighted perturbation samples, and the coefficients from the
linear model represent the contribution of each of the superpixels to the predicted class. LIME is model-agnostic,
giving it exceptional generality; it can apply equally well across CNN classifiers, random forests, disease
detection services that use commercial APIs, and ensemble models that do not produce gradients. When LIME
has been applied to models of foliar diseases in plants, and the model is functioning correctly, the superpixel
attributions produced by LIME are consistent with the visible boundaries of the lesions, therefore serving as a
validation mechanism for users who are not experts in plant pathology. The first significant limitations of the
method are performance instability due to stochasticity, which causes randomly selected descriptions for each
input to differ from one instance to the next; and sensitivity to the selected resolution of the generated
superpixels, i.e., by varying the amount of superpixel granularity, one is required to essentially trade-off between
spatial accuracy vs. reliability of an attribution. Recent advancements to this method, such as BayLIME include
Bayesian inference as a means of measuring uncertainty associated with an explanation, as well as establishing
intervals of confidence for the attributions produced for each superpixel, which adds significant value when
dealing with critical, large-scale agricultural decision-making.
SHapley Additive exPlanations (SHAP)
Feature contribution metrics that are derived from Cooperative Game Theory, are known as SHAP, which
establishes how particular features contribute towards a prediction via computing the average marginal
contribution from the feature across every possible feature coalition, known as its Shapley value. The Shapley
value has a number of mathematically provable properties that make it a good measure of feature contribution,
including: efficiency where attributions add to a total prediction; symmetry where identical contributory
features receive identical attributions; dummy – where irrelevant features receive no attribution; and linearity
where attributions add across multiple models [12]. Through the use of axioms, there are theoretically rigorous
attributions produced by the SHAP method in contrast to gradient-based methods, albeit with exponential time
complexity per each Shapley value computed. DeepSHAP uses the backpropagation rules from DeepLIFT to get
the Shapley values without the need to enumerate all the possible coalitions; this allows it to provide an efficient
way to approximate Shapley values using these backpropagation rules in deep learning models. In addition, by
combining the computation of gradients with the sampling of averages based on a reference background set of
data, GradientSHAP is able to facilitate the estimation of expected Shapley Values considering an inputs
uncertainty. The use of KernelSHAP to the spectral feature vector allows a reveal as to which wavelength bands
provide maximum diagnostic information for each pathogen class when performing plant disease classification
using hyperspectral/multispectral imaging. Most notably, through the analysis of the SHAP values, the highly
important wavelengths for the identification of fungi and bacteria with lesions are found to be within the near-
infrared reflectance spectral region of 700–900 nm due to sensitivity of this wavelength range to water content
in cells as well as damage associated with disruption of mesophyll structures; additionally, the red-edge
wavelength range of 680–730 nm, which is associated with the degradation of chlorophyll, is also found to be
diagnostically information for viral and nutritional diseases through SHAP analyses. As a result, one can utilize
this information to design cost-efficient multispectral sensors that address specific crop and pathogen
combinations.
Page 2304
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Attention-based Intrinsi c Explainability
Through multi-head self-attention matrices, Vision Transformer (ViT) models provide built-in explainability,
which records the amount of attention received from all other image patches for the computation of the contextual
representation for each image patch. The method for generating a global attention map from the residual
connections of the matrix multiplications of the attention weights from all transformer layers is based on
Attention Rollout, as proposed by Abnar and Zuidema [13]. By creating a visual representation of the global
attention map as an overlay of saliency on the 2D spatial layout of the original images, this technique can help
users understand the importance of the various patches in terms of their respective relationships to each other in
the context of the final prediction for the entire image. Attention rollout maps have been used to provide high-
quality spatial localisation of lesions due to pathogen infections in plant disease ViT models, specifically where
diffuse or distributed symptom patterns exist, such as in Downy Mildew or Mosaic Virus Disease. This is
necessary to provide more long-range contextual reasoning of all patches across the entire surface of the leaf,
allowing for discrimination between infected patches based on the long-range relationship of their respective
positions. As well as providing intrinsic explainability to models based on ViT architecture, spatial attention
mechanisms embedded in CNN architectures (e.g. Convolutional Block Attention Module (CBAM) [14] and
Squeeze-and-Excitation (SE) networks using channel attention) provide some degree of intrinsic explainability
for CNN models as well when compared to using a transformer architecture. For example, spatial attention maps
learned by ML-trained plant disease CNNs with CBAM-based improvements will suppress non-informative
background areas (e.g. soil, sky, adjacent healthy tissues) while enhancing features related to lesions.
This simultaneous improvement in classification accuracy and explanation quality — a positive-sum rather than
trade-off relationship makes attention-augmented CNNs particularly attractive for agricultural deployment
contexts where both performance and transparency are critical requirements.
Layer-wise Relevance Propagation (LRP)
This simultaneous improvement in accuracy and interpretability creates a positive-sum relationship rather than
a trade-off, making attention-augmented CNNs especially appealing for deployment in agriculture, which
requires both accuracy and transparency for successful implementation. LRP [15] decomposes the output of a
neural network into relevance scores on a per-pixel basis by propagating the classification score backward
through the neural network according to conservation rules that preserve total relevance at all layers. LRP does
not compute derivatives as do gradient methods but instead uses propagation rules - epsilon-LRP, alpha-beta-
LRP, and composite rule variations - that distribute relevance at each layer from every activated neuron to its
inputs according to how activated that neuron was, thus conserving total relevance by preventing positive and
negative relevances from canceling out. The output of this process is a pixel-level attribution map with much
higher spatial resolution than Grad-CAM and is able to provide individuals with specific symptomatic structures,
such as acervuli (fungal fruiting bodies), edges of bacterial ooze, and necrotic tissues that form along veins. It
has been determined that LRP-based explanation methods yield superior scores when compared to both Grad-
CAM [9] and LIME [11], as tracked by measuring the loss of classification accuracy associated with the
sequential removal of the most significant pixels (i.e., pixel flipping). In addition, LRP-based explanations
provided higher faithfulness—i.e., LRP minimizes reliance on "visually plausible" heatmaps rather than
providing true transparency into the model's decision-making process—compared to both vanilla gradient-based
methods and LRP-based methods, as well as providing superior confidence in making predictions on new
examples. As agricultural decision support increasingly adopts agricultural XAI models, they need to ensure the
reliability of the explanation approach chosen and how actual results will correlate with each of the explanation
approaches. Using composite LRP rules, where one set of propagation rules (epsilon) is applied to the upper
convolutional layers and another set of propagation rules (alpha and beta) for the lower, enabling an optimal
trade-off between stability and faithfulness for CNNs used in agriculture [15].
Counterfactual Explanations and Example-based Methods
The question of what is the smallest change to this input that will lead to an alternate predicted classification for
the model has been addressed via counterfactual explanations. With respect to leaf disease classification, this
question can also be framed in terms of the minimum changes needed on this leaf for the model to classify the
Page 2305
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
leaf as healthy. Several methods exist to generate counterfactuals, including DiCE (Diverse Counterfactual
Explanations), as well as using prototypes of images or feature variations (modifications) to create visual
representations of the model's classifying boundaries, making this information much easier for a layperson to
understand. Counterfactual explanations are especially helpful for describing disease severity threshold,
specifically the point where features on a plant leaf cross the boundary from being classified as healthy to being
classified as diseased; this is especially relevant for deciding the timing of interventions.
Figure 2: End-to-End Explainable AI Pipeline
Figure 2 details the various pipeline stages associated with a data point in both pre-processed form of raw leaf
imagery, model inference, and Decisional Support Output, and an example of using Explainable Artificial
Intelligence (XAI) for each of these points.
Table 2: Comparison table for XAI methods for deep learning models to detect and classify diseases in
plants.
XAI Method
Agnostic
Spatial Res.
Plant Disease Application
Grad-CAM
No
Coarse
Lesion heatmap overlay for diagnosis
Grad-CAM++
No
Coarse–Med.
Multi-lesion localisation on leaves
Score-CAM
No
Medium
Gradient-free lesion attribution
LIME
Yes
Superpixel
Boundary of infected vs healthy tissue
SHAP (Deep/Kernel)
Yes
Feature
Spectral band ranking in hyperspectral
Attention Rollout
No
Fine patch
Early symptom pre-visual detection
LRP
No
Pixel-precise
Pixel-level symptom tracing
Counterfactual XAI
Yes
Image-level
Healthy vs diseased comparison
Datasets for Plant Disease Deep Learning
PlantVillage: The Foundational Benchmark
Hughes and Salathé first gathered images onto one platform and made them publicly available to act as both a
benchmark in plant disease deep learning research; they did this in 2015. The benchmark includes over 54,306
images showing healthy and diseased plant leaves from 14 different types of crops (for example: tomato, potato,
grape, corn, apple and strawberry) that come from 38 different classes and cover 26 different diseases. The
dataset is made up of high-quality pictures taken under controlled studio light on a grey background, which
provides a consistent dataset to use for reproducible benchmarking and allows researchers to compare their
Page 2306
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
results. Because of the high-quality controlled conditions of the PlantVillage dataset, rapid development of CNN-
based classifiers has led to the emergence of high-performing classifiers for detecting plant diseases and aided
the growth of AI research in the plant disease community. On the contrary, there are significant limitations
concerning ecological validity when using the PlantVillage dataset for training models. That is, models built on
only the PlantVillage dataset exhibited consistent and systematic generalization failure when deployed into the
real-world environment and tested using the PlantDoc benchmark (which is comprised of field-acquired
photographs). The photographs used to train the models (from PlantVillage) have an unchanging laboratory
background, which serves as a very strong visual cue (the laboratory background does exist in agricultural
production fields). Therefore, the models using the laboratory-based visual cue via shortcut learning (as
identified through Grad-CAM) could not perform at all when deployed under realistic testing conditions [16].
As part of ongoing development efforts, multiple augmented versions of PlantVillage have been developed:
segmentation foregrounds, colour-adjusted variations, and simulated field conditions via background
replacement/compositing methods; but they still face the same fundamental issue of a significant domain gap.
Field-Acquired and Open-World Datasets
The PlantDoc dataset created by Singh et al. [17] contains 2,569 photos taken from the internet with open-source
licensing representing at least 27 different diseases across 13 distinct plant species depicted within natural field
settings. The dataset has highly variable backgrounds, lighting and orientations of leaves as well as other
compositional aspects making it a good benchmark for assessing real world generalisation. Many photos in
PlantDoc contain bounding box annotations which allow both classification model and object detection
frameworks to be evaluated on the same set of data. Although most deep learning models require large amounts
of data; therefore, given that there are only 2,500-3,000 images in this dataset, cannot be used to train deep
learning models from scratch but offer an excellent opportunity for transfer learning evaluations.
The Cassava Leaf Disease dataset published as part of a Kaggle competition — contains 21,000+ field images
of cassava leaves from within sub-Saharan Africa, which is an important food source for over 800 million people
globally. As the largest dataset, field-realistic images of a major food insecurity crop, this dataset has become
the primary benchmark used to evaluate deep learning in agricultural applications. Another major dataset is the
2018 AI Challenger dataset with approximately 53,744 field photographs of multi-class crops and their semantic
segmentation labels allowing for the evaluation and comparison of both classification and segmentation models
relative to each other. The 2018 FGVC8 Plant Pathology Competition introduced a new method for separate
classes of the same type to be processed on one single image, in a way that reflects how a crop would commonly
exhibit a number of overlapping diseases. This type of multi-class classification creates an issue with XAI
attributions, since there needs to be separate attribution maps created and interpreted for each active label/class,
this will also account for and need to measure the potential overlap in space of each of those located attributions
and labels, for those multiple classes of the same type, and the resultant class labels.
Hyperspectral and Multi-Modal Imaging Datasets
Morphenological and color indicators of disease visible in rgb images do not have the capacity to identify or
determine biochemically based indicators of disease from spectral ranges of incidence present in imagery
acquired through near infrared reflectance, chlorophyll fluorescence emission or thermal emission prior to
becoming visually identifiable. Identification of diseases when they are pre-symptomatic i.e., prior to the
presence of visible lesions — is an important goal of disease management because smaller pathogen populations
are present during early stages of infection, intervention options (e.g., pesticide applications) tend to be greater,
and treatment costs are lower. The hyperspectral data sets for plant disease produced by UAV mounted
pushbroom spectrometers acquired over plot locations contain hundreds of contiguous spectral bands that span
the 400 to 2500 nanometer range and can provide attribution using shapley-based measures (shap) to diagnose
and locate wavelengths associated with specific biochemical processes of disease. Changes in temperature
resulting from pathogen damage may be detected through thermal infrared imaging, allowing for the pre-
symptomatic identification of diseases affecting vascular plants (e.g., root diseases and vascular wilts) by
obtaining temperature anomalies resulting from reduced stomatal conductance [12]. Three-dimensional plant
canopy architecture changes resulting from diseased and damaged plants can be obtained from lidar and
Page 2307
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
structured light scanning technologies. Developing new methods for integrating multiple modalities of data,
including spectral imaging, thermal imaging and three-dimensional data, using deep learning architectures that
employ feature extractors for each specific modality together with attention model mechanisms for cross-modal
interactions represents a new area of research. In support of this multi-modal application, an understanding of
the contribution of the different modalities to make a prediction is crucial in XAI to ensure that there are no
examples in which the model is leveraging a single modality inappropriately if there is also modality that may
support the prediction as a form of complementary evidence.
Figure 3: Benchmark Datasets for Plant Disease Deep Learning
Figure 3 describes both subplots to show the number of datasets and how these were acquired by acquisition
method (lab, field, UAV/hyperspectral).
Table 3: A summary of the benchmark datasets available for plant disease deep learning research.
Dataset
Images
Classes
Setting
Crop Coverage
Annotation
PlantVillage
54,306
38
Laboratory
14 crops
Disease label
PlantDoc
2,569
27
Field
13 crops
BBox + label
Rice Disease DB
10,845
10
Field
Rice
Disease label
Cassava Leaf DB
21,367
5
Field
Cassava
Disease label
AI Challenger 2018
53,744
18
Field
Multi-crop
Segmentation
FGVC8-Plant Path.
31,985
6 (multi-label)
Field
Apple, grape…
Multi-label
DeepWeeds (ref.)
15,007
9
Field (UAV)
Weed species
Species label
Practical Challenges in Real-World Deployment
Class Imbalance and Long-Tail Disease Distributions
The datasets used for agricultural diseases are heavily imbalanced in terms of examples from different classes
and represent the current epidemiological situation of those diseases. Powdery mildew, early blight and downy
mildew are among the most common and there are thousands of images showing them in the database; however,
the greater proportion of economically important pathogen species that are not widely spread or are at lower
levels of distrust, for example, newly invasive species, new forms of endemic disease, quarantine microbes
typically have fewer than 50 verified-field images available globally. Therefore, when we look at trained models
using standard cross-entropy loss functions (all using the same data set) to evaluate their accuracy, the model
optimisation will be biased towards the majority class and hence will not be sensitive to minority classes, which
is most relevant for AI detection for food and biosecurity purposes. The most common methods of dealing with
Page 2308
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
the class imbalance are loss function modifications, data resampling, and generative augmentation of data. Class-
balanced focal loss [18] down-weights ‘easy or even ‘very easy examples of the majority class and focuses
learning signal on ‘difficult or ‘very difficult examples in the minority class. Oversampling of the minority
class using augmentation and undersampling of the majority class when creating batches provides some degree
of alternative class rebalancing. Another very promising method of generating additional training data for the
minority class is to synthesise photorealistic synthetic disease images using Generative Adversarial Networks
(GANs) and more recently via the new trend of diffusion models. Synthetic samples produced using these models
might not occur with the pathological realism necessary to properly conduct XAI attribution analysis. Therefore,
the integrity of synthetic training data should be evaluated as both a means of classification performance, as well
as the plausibility of explanation.
Domain Shift and Generalisation to Unseen Environments
The area of domain shift between the initial condition of training data and the field deployment context,
represents the single most important barrier to the scalability of plant disease applications of AI. Domain shifts
in illumination between the times of day, sensor characteristics between smartphone models, geographic
variation in strain morphology and severity, and variation in the development of leaves from seedling to
senescence and in composition between single leaf close-up images and whole plant views alters the marginal
distributions of the input data in such a way as to negatively impact the performance of the classifier. More
troublesome still than the aforementioned is the fact that domain shift effect XAI attribution may or may not
produce any obvious degradation of classification accuracy in the training distribution; a model can exhibit a
high average accuracy while relying on shifted spurious correlations that may produce misleading explanations
in different environments. Domain generalization evaluation has become the proper evaluation framework for
agricultural AI systems by evaluating models on test data collected from geographic regions, growing seasons
or sensor types that were completely absent from training. If a model performs well against standard PlantVillage
[16] test splits but does not perform well against PlantDoc [17], Cassava or cross-geographic field collections,
it is not considered to be ready to deploy. Beyond performance metrics, an additional diagnostic that can provide
a more comprehensive assessment of model performance is XAI attribution stability analysis, which compares
the stability of Grad-CAM [9] and/or SHAP [12] focus across domains to determine that the model continues to
focus on pathologically relevant features as opposed to background features.
On-Device Inference and Model Compression
To operate plant disease AI within smallholder farmer environments where Internet access is limited or sporadic,
models must be deployed on consumer smartphones with no reliance on the internet. For this type of
implementation, the use of consumer smartphones will necessitate that libraries exist to provide inference on-
the-device; a full-precision ResNet-50 [3] model has more than 25 million parameters (and 98 MB of storage
requirement), resulting in potential memory limitations within low-end Android devices, and slow inference
times for field use (due to the lack of GPU reliance). Model compression methods (such as knowledge
distillation, structured filter pruning, and post-training quantisation to either 8- or 4-bit integer arithmetic) can
reduce both the size and inference latency of these models such that they are appropriate for implementing
mobile neural processing units. Optimised inference engines such as TensorFlow Lite and ONNX Runtime for
Android/iOS utilize hardware-specific acceleration (ARM NEON SIMD, GPU compute shaders, and vendor-
specific AI accelerators such as Qualcomm’s Hexagon NPU). However, there is usually some degree of accuracy
or quality degradation associated with model compression techniques that have been applied to the model (and
thereby would degrade XAI attribution). Specifically, integer quantisation introduces rounding
variance/uncertainty into the gradient computation leading to poor Grad-CAM visualisation; structured pruning
may eliminate convolutional channels, which contain pathologically relevant features (even if the average
accuracy of the convolution is small); and knowledge distillation is unable to reliably transfer fidelity of the
teacher model’s explanation to the student model. Research into XAI-based compression techniques to explicitly
enhance the quality of the attribution maps while maintaining prediction accuracy is a developing area with
immediate practical relevance to agricultural use.
Page 2309
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Figure 4: Multi-dimensional Comparison of XAI Methods
Figure 4 illustrates all 5 axes of a radar plot and details the strengths and weaknesses of several types of Data-
Agnostic / Data-Sensitive Explainability (e.g., LIME, SHAP, LRP).
Table 4. Effect of various model compression strategies on XAI in relation to their application for edge
deployment.
Strategy
Target Platform
Technique
XAI Impact
Quantisation
Mobile CPU/NPU
INT8 / INT4 post-training quantisation
Gradient noise may distort CAM
maps
Pruning
Edge devices
Structured channel pruning + fine-
tuning
May remove channels critical to
attribution
Knowledge
Distillation
Smartphone
Teacher–student training
Student explanations may differ
from teacher
TFLite / ONNX RT
Android / iOS
Format conversion + hardware
delegation
XAI must be revalidated post-
conversion
Human–AI Interaction and Trust Calibration
AI's effectiveness in positively influencing agriculture through disease detection and diagnostic systems depends
more on whether the explanations and outcomes from these systems lead to improved agricultural decision
making than on laboratory performance accuracy. Research has examined human-computer interaction and
established that the type of explanation presented with input into a system and type of output generated from
that system has some type of variable, in addition to being non-monotonic, relationship with user levels of trust.
When users receive accurate explanations that are relevant to the type of task they performed, those users will
develop appropriately calibrated trust in the AI system and make proper use of that system's recommendations.
Conversely, when users receive visually appealing explanations, but those explanations are inaccurate (i.e.,
technically unfaithful), those users can develop overtruste, which will suppress proper human oversight with
regard to using the AI system and, therefore, will lead to outcomes that are generally worse than would result
with no AI system in use. Similarly, when users receive explanations that are overly technical or provide
excessive information, those users can completely disengage from the AI system. In user studies performed with
Page 2310
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
farmers from Kenya and India using early prototype developments for AI-assisted prediction of possible plant
diseases, key interaction design guidelines emerged: (1) spatial heat overlay on the captured picture of the leaf
is easier for non-expert users to understand than rankings of abstract feature importance values; (2) descriptions
of identified symptoms in the farmer's local language build trust and even motivate action to treat their plants;
(3) distinct areas of expression of how confident in the predictions of the AI model reduce farmers' over-
confidence in predictions based on low levels of certainty surrounding the prediction; (4) providing an image of
the planted crop to the farmer along with a reference image showing the symptoms of the identified disease
establishes a contextual anchor for the non-literate farmer when visually validating the predictions made by the
AI. These results will inform, in at least the same capacity as the underlying methodology used to produce the
Explanation of the AI model outputs, the design of the human AI interface for users what types of XAI
mechanisms to provide, how to format them, and how to express uncertainty.
Translating Explanations for Non-Expert End-Users
Outputs created from XAI techniques such as saliency maps, SHAP value plots, attention rollout overlays and
LIME superpixel attributions are intended for AI researchers and the people who develop the models. When
these outputs are presented to smallholder farmers, rural extension agents, or any end-users who are non-literate
and do not have additional explanatory information, they can cause confusion, create false confidence, or cause
them to disengage completely. To effectively connect machine-made explanations to the actions and decisions
made on a farm is a major unknown challenge faced by the agriculture industry. Concrete strategies for
converting technical XAI outputs into meaningful, actionable and trustworthy forms will be presented to enable
non-expert agricultural stakeholders to make sound decisions based on the XAI outputs.
One foundational aspect of creating explanations using non-expert knowledge is the casting of the input (in this
case, a model's output, the Grad-CAM heatmap) into a short description using natural language. Rather than
outputting a Grad-CAM heatmap as an independent output, a system might generate a simple sentence for the
portion of the crop that is highlighted by the Grad-CAM heatmap. This sentence would be created using language
farmers are familiar with, for example, "The AI has identified yellowing and irregular brown spots on the leaves
near the leaf tips as being consistent with Early Blight caused by Alternaria solani." In addition, there are large
vision-language foundation models like GPT-4V, as well as InstructBLIP, which can be used to generate natural
language descriptions of symptoms directly from leaf images. Additionally, retrieval-augmented generation (i.e.,
grounding output from the model to curated phytopathology reference literature) can improve both the factual
accuracy and relevance of agronomy-related content generated via machine learning model.
The second strategy is to create decision-support templates that map the XAI attributions to structured and
action-oriented advice. The application interface can provide a tiered output based on a consistent template,
rather than exposing farmers directly to probability scores or attribution values. Each tier presents the following;
(1) a statement identifying the disease and indicating its level of confidence in layman's terms; (2) a brief
explanation of the visible evidence that the AI used to produce its results, taken automatically from the XAI
attribution map; (3) a specific recommended course of action; and (4) a referral to the local extension agent (or
agricultural help line) when the model's level of confidence is below a defined threshold. This template approach
allows XAI attributions to serve as an internal verification mechanism while allowing only meaningful
agronomic information to be seen by farmers. The thresholds for escalation of referrals should be determined in
partnership with agronomists for each crop-pathogen combination, and reflect realistic timelines for managing
diseases with different pathogens.
Another technique is example-based/analogical explanation that uses both the prototype and counterfactual
models as the basis of the explanation for the model’s prediction. Instead of using abstract terms to explain the
model’s prediction, the interface may be able to use an image of a reference from a disease atlas that is labeled
or annotated with information related to how the user’s leaf image was determined to exhibit similar symptoms.
The use of a side-by-side comparison of the user’s leaf image and the corresponding reference will create an
instant visual similarity that is interpretable without prior knowledge of neural networks or attribution mapping
methods. For non-literate users, an image-to-image comparison is a more approachable way to provide the
anchor or example than a text description. The reference library can contain images that have been taken in the
Page 2311
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
field and documented by crop, pathogen, and stage of severity, and can grow with Federated Learning reports
from agronomists and extension agents.
Assessing the quality of non-expert explanations (as opposed to purely offering technology-based performance
metrics), requires that there be some distinct measures develop for user-centred evaluations. Aspects that can be
assessed in an intuitive manner would include comprehension of the explanation by the farmer; judgements of
the appropriateness of the suggested action to the actual agronomic conditions in the field at the time of
engagement with the explanation; and calibration (the relationship of the farmer's confidence in the ability of the
AI system to comprehend current field conditions, with the systems understood uncertainty of those conditions).
Examples of how user-centred evaluations would be completed include participatory design (where farmers,
extension officers and agronomists would be directly involved in an iterative prototyping process developing
their desired explanation delivery format) to ensure that the formats produced would be contextually appropriate,
functionally useful, and free of accidental assumptions related to either literacy or the need for access to external
resources. Without a grounded evaluation framework, any superior technically designed XAI systems may not
provide any benefit to the smallholder farming community who may realize the largest benefit from intelligent
plant disease detection systems.
Emerging Research Directions
Foundation Models and Vision-Language Models
The birth of large vision-language foundation models such as CLIP [19] (Contrastive Language-Image Pre-
training), Flamingo, InstructBLIP and GPT-4V, will allow plant disease detection to be accomplished in ways
that would be impossible without a paradigm shift in how we approach and interact with plant disease. The
vision-language foundation models can do this because they afford zero-shot inference; they can identify/classify
diseases they had never seen while fine-tuning by utilizing rich cross-modality semantic representations that
exist in their weights. For example, a CLIP based tomato leaf classifier can classify the disease epidemic
associated with "an early blight lesion tomato leaf that had concentric rings and yellow-haloed lesions" with
useful and meaningful accuracy in zero-shot classification versus if they had been developed through task
specific fine-tuning before being deployed on the system (i.e. they are capable of being utilized to fast track
deployment of newly emerging diseases that haven't been cataloged). Moreover, self-supervised pre-training
techniques (e.g. SimCLR [20]) can also be used with negative sample identification to learn robust visual
representations from unlabelled leaves through the use of contrastive methods in low label agricultural systems.
Additionally, they can also produce natural language descriptions of the observed symptom(s) on a leaf for novel
and linguistically describable purposes. By using prompt engineering and utilizing retrieval-augmented
generation (i.e. conditioning the outputs of models on literature relevant to phytopathology) as a means to support
the generation of symptom descriptions relative to existing knowledge of plant pathology, they further provide
ways of generating aesthetically pleasing, accurate and usable outputs relative to diseases. Collaborative efforts
between AI researchers and plant pathologists will be necessary to evaluate the factual correctness and
agronomic relevance of the developed descriptions since this presents a significant interdisciplinary challenge
for researchers.
Causal Explainability
Currently, the available XAI techniques are primarily for association (classifying based on co-occurring features)
rather than being causative (classifying based on features which generatively determine the output when
examined through an appropriate understanding of the biology of the disease). The distinction between
correlation and causation in plant pathology is important; for example, an AI model may appropriately generate
a Grad-CAM heatmap showing the location of a lesion [8]. However, the model could rely upon a spurious
correlation between the lesion area and an image acquisition artifact, resulting in a visually viable but non-
informative explanation about the causal relationship between those regions. Causal XAI methods based on
structural causal models, do-calculus interventions, and counterfactual reasoning provide a theoretically
principled framework for establishing causal attributions that is superior to correlation-based approaches [15].
Plant disease applications require explicitly stated assumptions regarding cause and effect relations between
image conditions, leaf appearances, pathogen attributes, and predictive capability of models (the assumptions
Page 2312
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
will require verification by plant pathologists). Causal XAI provides an operational way to conduct causal
attribution via twin network architectures. Specifically, these architectures provide counterfactual predictions
via the intervention (manipulation) of some specific input variables while keeping others constant. This
methodology is more computationally intensive and more theoretically complex than gradient-based approaches,
but causal XAI has a substantially higher reliability than gradient-based approaches when current data differs
from previous data and/or relationships between variables present as false correlations. For these reasons, causal
XAI offers a preferable method of producing reliable, accurate causal explanations for adversarial use cases
associated with agricultural AI.
Federated Learning for Decentralised Collaboration
In any one geographic area, there is a shortage of images of diseases, especially rare diseases, newly emerging
pathogens, and crops that are not well studied; this affects the development of good quality disease detection
models and constitutes a structural impediment. Federated learning [21] overcomes this structural impediment
to the creation of high quality data by providing a way for multiple organizations, researchers and individual
users to collaboratively train model without aggregating the raw image data in one central location. In a federated
learning scheme, the participating parties will each independently train a model (local model) on their own
independent (private) data set, and only send the difference (model gradients, or model parameters) to a central
aggregation server. The aggregation server will combine the contributions made by each federated learning
participant by using a technique called federated averaging or other methods of sophisticated aggregation. The
privacy of the data is preserved, as the raw images will never leave the facility from which it originated; this is
particularly important when the data relates to a disease outbreak and the data is economically sensitive in nature,
as the premature release of information about an outbreak could disrupt the commodities market and cause trade
restrictions. The technical issues facing federated learning for plant disease detection are different than those
faced in other types of federated learning since agricultural datasets tend to be highly non-IID (not independent
and identically distributed) across the different devices: with respect to the prevalence of diseases, pathogen
strains, varieties of crops, and imaging conditions there is significant variability across geographic regions
resulting in heterogeneous distributions at the client level and causing convergence problems. Personalised
federated learning, which is a new type of federated learning, will enable the participants to maintain their own
personal model components in addition to the global model components and thus provide a mechanism for
addressing the heterogeneity of the agricultural datasets. In the federated setting, it is necessary to address the
distributional heterogeneity of the datasets in order for Federated XAI recommendations to be useful for
determining individual client Grad-CAM [9] or SHAP [12] results. These values cannot be directly mapped into
the global aggregated model based on each client’s unique data distribution. Therefore, it is necessary to have
federation-based XAI aggregation protocols that allow for the creation of globally representative Grad-CAM or
SHAP values without the requirement for data centralisation.
Continual Learning for Evolving Pathogen Landscapes
As dynamic systems subject to constant evolutionary forces, plant pathogens exhibit adaptability by
accumulating pathogen strains with different levels of virulence, host range or appearance of symptom.
Recombination, through mutation or horizontal gene transfer, continues to produce new strains that may have
different levels of virulence, host range or symptom expression. Climate variability is expanding the geographic
distribution of many plant pathogens, creating risk from the emergence of new disease/crop combinations as
pathogens emerge into new areas where they did not previously exist (resulting in training data that are not
representative of the new combination). The newly-emerged pathogen Xylella fastidiosa in olive groves of
Europe, and wheat blast (Magnaporthe triticum pathotype triticum) in the wheat-growing regions of South Asia,
illustrate the need for rapid, AI-based methods of pathogen identification that occur in a timeframe that does not
permit retraining from scratch on newly accumulated samples. Frameworks for ongoing learning, also referred
to as lifelong or incremental learning, allow for new disease classifications to be learned incrementally while
avoiding catastrophic loss of previously learned disease classes. There are three categories of approaches to
continual learning: methods based on regularisation (such as Elastic Weight Consolidation and Synaptic
Intelligence), methods based on replay (such as generative replay and exemplar memory), and methods based
on expansion (such as progressive neural networks and Packnet). Monitoring models through Explainable
Page 2313
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Artificial Intelligence (XAI) systems provides substantial benefits to the continual learning system by checking
whether the continuous determination of the model's attribution map remains consistent for the disease classes
that it already knows as it learns new ones and breaks down error rate and loss distinctions (i.e., catastrophic
forgetting vs. maintaining classification accuracy on held out benchmarks). EfficientNet [5] and ViT [6]
backbones have exhibited notable potential as basis frameworks for continual learning due to their modular and
re-usable nature.
CONCLUSION
Deep learning is being applied to the detection of plant diseases and has achieved tremendously high levels of
performance technically where it has now exceeded specialist accuracy in a controlled environment and
substantial real-world applications in the field, as well. The transition from using shallow CNNs, to using residual
and dense architectures, to now using vision transformers has continuously advanced the accuracy levels for
classifying plant images while simultaneously increasing the amount of representative features that are internally
available for attribution analysis of these characteristics. Transfer learning, and domain adaptation techniques,
and data augmentation techniques, have partially eliminated the difference between laboratory benchmark and
the laboratory setting and field realism, while model compression approaches have improved overall
development and deployment of deep learning applications on consumer-grade mobile devices. Even though this
chapter's main point is that accuracy is an essential part of agricultural AI, it alone is not enough. In order to trust
agricultural AI, to be held accountable for the decisions of agricultural AI and to provide the right level of
oversight for its responsible usage, the agricultural AI must have explainability. The XAI landscape has a wide
array of possible methods available, including gradient-based saliency mapping, perturbation-based probing,
game-theory based attribution, intrinsic attention mechanism, backpropagation-based relevance propagation and
counter factual reasoning, each method has its own strengths and weaknesses and use case in agricultural AI in
relation to plant diseases.
There are several urgent areas for research. First, the faithfulness evaluation of XAI methods must be
standardised as part of the evaluation of plant disease AI systems, to ensure effectiveness/efficiency, in
conjunction with classification accuracy. Secondly, user studies from farmers and agronomists must inform the
human-centred design Interface for technical explainability, to assist users in practical decision making. By
incorporating causal reasoning into XAI approaches, we can develop methods that have a greater chance of
producing more consistent and generalizable outputs in the context of distribution shifts typical of agricultural
deployment. Additionally, foundation models, federated learning, and continual learning are converging to
produce AI systems that can be powerful, collaborative, privacy-preserving, and adaptive to constantly shifting
threats from plant pathogens. Achieving success for explainable plant disease AI will require measurable impacts
— not just on benchmark scores but also on crop yield, income and helping farmers around the world meet the
challenges of producing enough food sustainably. To have an impact, the field must integrate multiple disciplines
plant pathology, agriculture, computer vision, human-computer interaction and responsible AI and a
commitment to using technology that is both effective and genuinely trustworthy.
REFERENCES
1. Mohanty, S. P., Hughes, D. P., & Salathé, M. (2016). Using deep learning for image-based plant disease
detection. Frontiers in Plant Science, 7, Article 1419.
https://doi.org/10.3389/fpls.2016.01419
2. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image
recognition. International Conference on Learning Representations (ICLR).
3. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778).
https://doi.org/10.1109/CVPR.2016.90
4. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional
networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.
4700–4708).
https://doi.org/10.1109/CVPR.2017.243
5. Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks.
In Proceedings of the 36th International Conference on Machine Learning (pp. 6105–6114).
Page 2314
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
6. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An
image is worth 16×16 words: Transformers for image recognition at scale. arXiv.
https://arxiv.org/abs/2010.11929
7. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021). Swin Transformer: Hierarchical vision
transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on
Computer Vision (pp. 10012–10022). https://doi.org/10.1109/ICCV48922.2021.00986
8. Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv.
https://arxiv.org/abs/1702.08608
9. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual
explanations from deep networks via gradient-based localization. In Proceedings of the IEEE
International Conference on Computer Vision (pp. 618–626). https://doi.org/10.1109/ICCV.2017.74
10. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2012). SLIC superpixels compared
to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence,
34(11), 2274–2282. https://doi.org/10.1109/TPAMI.2012.120
11. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why should I trust you?”: Explaining the predictions
of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (pp. 1135–1144).
https://doi.org/10.1145/2939672.2939778
12. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances
in Neural Information Processing Systems (Vol. 30, pp. 4765–4774).
13. Abnar, S., & Zuidema, W. (2020). Quantifying attention flow in transformers. In Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics (pp. 4190–4197).
https://doi.org/10.18653/v1/2020.acl-main.385
14. Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). CBAM: Convolutional block attention module. In
Proceedings of the European Conference on Computer Vision (pp. 3–19). https://doi.org/10.1007/978-3-
030-01234-2_1
15. Samek, W., Montavon, G., Lapuschkin, S., Anders, C. J., & Müller, K. R. (2021). Explaining deep neural
networks and beyond: A review of methods and applications. Proceedings of the IEEE, 109(3), 247–278.
https://doi.org/10.1109/JPROC.2021.3060483
16. Hughes, D. P., & Salathé, M. (2015). An open access repository of images on plant health to enable the
development of mobile disease diagnostics. arXiv. https://arxiv.org/abs/1511.08060
17. Singh, D., Jain, N., Jain, P., Kayal, P., Kumawat, S., & Batra, N. (2020). PlantDoc: A dataset for visual
plant disease detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD (pp. 249–253).
https://doi.org/10.1145/3371158.3371196
18. Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In
Proceedings of the IEEE International Conference on Computer Vision (pp. 2980–2988).
https://doi.org/10.1109/ICCV.2017.324
19. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning
transferable visual models from natural language supervision. In Proceedings of the 38th International
Conference on Machine Learning (pp. 8748–8763).
20. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning
of visual representations. In Proceedings of the 37th International Conference on Machine Learning (pp.
1597–1607).
21. McMahan, B., Moore, E., Ramage, D., Hampson, S., & Agüera y Arcas, B. (2017). Communication-
efficient learning of deep networks from decentralized data. In Proceedings of the 20th International
Conference on Artificial Intelligence and Statistics (pp. 1273–1282).
22. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1251–1258).
https://doi.org/10.1109/CVPR.2017.195
23. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for
discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 2921–2929).
https://doi.org/10.1109/CVPR.2016.319