Page 2297

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

Explainable Deep Learning for Intelligent Plant Disease Detection

Dr. Pallavi Sharma

, Ngah Hesly Kilofonyuy

Assistant Professor, School of Engineering, Design and Automation- E, Department of ECE, GNA

University, Phagwara, Punjab, India

Undergraduate Student, School of Engineering, Design and Automation- E, Department of ECE, GNA

University, Phagwara, Punjab, India

DOI:

https://doi.org/10.51583/IJLTEMAS.2026.150500184

Received: 11 May 2026; Accepted: 16 May 2026; Published: 12 June 2026

ABSTRACT

The world suffers from 10–40% loss in crop yields each year because of plant disease. This threat is serious and

growing; it threatens food security, rural livelihoods, and agricultural economies. Advances being made through

deep learning, computer vision, and mobile technology have presented a unique opportunity to use leaf images

to automatically recognize plant disease. Published classification accuracies on benchmark datasets now exceed

97%, which is an important achievement but achieving high accuracy on a benchmark alone does not indicate

that traditional methods will work when deployed in the real world: all four stakeholders (i.e., farmers,

agronomists, regulatory authorities, and extension agents) must therefore have the ability to understand, and

interpret the output of automatically recognized plant diseases in a way that enhances human expertise rather

than replacing it. In this chapter, we provide a compendium of technical deep learning architectures and methods

related to Explainable Artificial Intelligence (XAI) for plant disease detection, including convolutional networks,

residual architectures, dense architectures, transformer networks, and hybrid models. We also systematically

evaluate the explainability methods used in both post-hoc and intrinsic explanation and evaluate the applicability

of these methods across a variety of imaging modalities used in agriculture, including RGB, multispectral, and

hyperspectral. This chapter characterizes major benchmark datasets; discusses major challenges to their

deployment, including class imbalance, domain shift, model size reduction, and human–AI trust calibration; then

ends with potential new directions for research in areas such as foundation models (FM), causal interpretable

models (Explanations), federated learning, and continual learning to build resilience for each evolving pathogen

landscape.

Keywords: Explainable Artificial Intelligence; Plant Disease Detection; LIME, Grad-CAM; SHAP; Federated

Learning; Precision Agriculture; Hyperspectral Imaging.

INTRODUCTION

The backbone of human civilization is Agriculture. Agriculture directly provides the livelihood of approximately

570 million farm households around the globe and indirectly supports the food security of a global population

projected to reach over 9.7 billion by 2050. Among various agricultural products globally, plant diseases (caused

by a variety of pathogens including fungi, bacteria, viruses, and oomycetes as well as abiotic stressors) are among

the most widespread and economically damaging threats. According to the Food and Agriculture Organization

of the United Nations, crop diseases, pests, and weeds alone account for loss of 20% to 40% of worldwide

agricultural crop production each year, resulting in providing an estimated $220 billion dollars each year in

economic loss. Smallholder farmers in low- and middle-income countries disproportionately experience these

consequences; a single, failed harvest threatens their future food security or long-term financial viability [1].

Correctly identifying disease pathogens is essential to effectively manage diseases because the method of

treatment depends on whether the pathogen is fungal, bacterial, or viral. Incorrectly identifying a pathogen leads

to misapplication of chemicals, waste of costly inputs, rapid progression of resistance and delays in effective

intervention. Currently, traditional methodologies for diagnosing plant diseases are limited to the expert visual

Page 2298

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

assessment of a trained plant pathologist or agricultural extension officer. However, these experts are not found

equally throughout the world but are instead primarily located in research institutions and government agencies

in high-income areas with very limited availability in sub-Saharan Africa, South Asia and generally in areas

where smallholder farmers grow crops. In sub-Saharan Africa in some areas, the ratio of extension officers to

the number of farm households is greater than 1:3,000 creating an impossible task for expert assessments of

large numbers of households.

The rise of inexpensive smartphones with high-quality cameras and good mobile connectivity has made it

possible to provide all farmers with tools they can use for diagnosing the health of their crops. This is now

possible due to the presence of a strong infrastructure (high-resolution images and mobile connectivity) and the

improvement of artificial intelligence (AI) through deep learning to identify plant diseases using images from

smartphones. Deep learning is a sub-component of AI (specifically machine learning) that uses multiple layers

of neural networks to extract features training (end-to-end) on very large datasets to obtain hierarchical features.

In the field of image recognition, deep learning has demonstrated revolutionary outcomes. The use of deep

convolutional neural networks (CNNs) to achieve classification accuracy above 97% on the PlantVillage

benchmark demonstrates that expert human diagnostic specialists have comparable classification accuracies to

deep learning systems under controlled conditions [2]. The success of AI for crop protection has also resulted in

significant investments from commercial and government entities in the development of decision support

systems based on AI, e.g. Plantix, Plant.id, PEAT and various early-warning systems funded by governments in

Indonesia, Kenya, India and the Netherlands.

There is a disparity between how well systems work under ideal conditions and how well they will work on the

ground: interpretability. A farming operation cannot use a system that produces a disease label without additional

information to indicate what visual or spectral evidence supports the prediction. For example, an agronomist

deciding whether to apply a systemic fungicide (an expensive intervention) wants to ensure that the AI system

has identified real signs of disease and is not presenting falsehoods (i.e., things in the image that do not really

exist). Additionally, smallholder farmers may require a way of linking the location of abnormality indicated by

AI with what they see on the leaf. More regulatory agencies in the EU and other parts of the world are requiring

that systems used to support important decisions provide transparency about how the algorithms operate [8]. In

response to this gap, the field of Explainable Artificial Intelligence (XAI) has emerged to fill these needs;

providing predictive systems with explanatory features based upon their outputs.

This chapter provides a thorough and technically sound approach to the use of deep learning techniques for

identifying plant diseases. The principle focus of this investigation will be upon the interpretability of deep

learning systems designed for such purposes. The aim will be to present not only the evolution of deep learning

architectures from convolutional neural networks (CNNs) through residual networks or DenseNet to more

efficient methods such as vision transformers, but also evaluate each architecture's corresponding accuracy and

compatibility with XAI (i.e., interpretability). This will be accomplished in Section 2 by presenting the evolution

of the aforementioned architectures as well as a discussion of both their accuracies and their compatibility with

XAI. The systematic survey for each category of XAI technique available for application to deep learning models

trained for plant disease identification is presented in Section 3 and includes: gradient saliency mapping;

perturbation-based techniques; game theory-based attribution methods; attention mechanisms and their use in

deep learning models; and relevance propagation techniques using backpropagation. The primary benchmark

data associated with deep learning and plant disease will be described in Section 4 and consist of: laboratory-

controlled datasets, field-collected datasets and hyperspectral imaging datasets. Section 5 discusses the

challenges with regard to deploying a deep learning model trained to detect plant diseases including class

imbalance, domain shift, edge computing restrictions and the interaction between humans and AI. Section 6

identifies emerging research directions that include foundation models, causal explainable artificial intelligence

(XAI), federated learning, and continuous learning in response to the changing environment presented by plant

pathogens. Section 7 presents discussion upon our findings and research recommendations.

Page 2299

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

Deep Learning Architectures for Plant Disease Detection

Foundational Convolutional Neural Networks

The convolutional neural network (CNN) has been the preeminent way to classify plant disease Image using

artificial neural networks. The three principal advancements of the CNN (local receptive field, shared spatial

position weights, and hierarchical feature construction) application to visual recognition make CNN the best

representation of visual imaging tasks.

The Convolutional Network has small convolutional window sizes (usually 3x3 or 5x5 pixels) to extract local

information (i.e., edges, colour changes, or textures) from the image into its base components in the initial layers.

The subsequent layers will then use those base components to build up feature representations of higher-level

abstractions such as venation, lesion border, or spore patterns. Max-Pooling introduces a limited amount of

translation in the image being recognized while simultaneously reducing the image's spatial dimension and

increasing the depth of how CNN recognizes images. The ability to operate on a global feature vector compactly,

in the fully connected layer of CNN is dependent on how well the global features of the feature representation

have been reduced in size (dimensionality) using Max-Pooling. AlexNet won the ImageNet Large Scale Visual

Recognition Competition in 2012 and demonstrated how dramatically Deep CNN using GPU-accelerated

training, dropout for preventing overfitting, and data augmentation can change the outcome of a visual

recognition task. AlexNet and less deep CNNs have struggled to clearly differentiate closely related plant

diseases such as Septoria leaf spot vs early blight on tomato and angular leaf spot and bacteria blight on soybean

because they do not have sufficiently deep hierarchies of features to make this distinction. The VGG architecture

family, consisting of stacks of 3x3 convolutional blocks (i.e., layers) that have anywhere from 16 to 19 total

layers deep, increased their ability to represent objects and therefore were widely viewed as a useful baseline for

modeling/representing plants diseases from 2015 until 2017 when the VGG families achieved approximately

91% to 93% accuracy with the use of newly-developed methodologies (e.g., transfer-learning using an ImageNet

dataset) [2].

Residual Networks and Dense Connectivity

The reason this achieved success is likely due in part to solving the vanishing gradient problem (i.e.,

backpropagation errors diminishing/attenuating at an exponential rate throughout many layers), which had

limited the practical depth of CNN to roughly 20 layers until he et al. [3] introduced the use of residual

architectures with the addition of an identity shortcut connection to a typical convolutional block which allowed

for an error message to be transmitted unattenuated to lower level blocks, thus allowing greater complexity, or

the use of deeper architecture. Because of this added flexibility in adding layers to a configuration of

architectures such as ResNet, ResNets also have been able to achieve a state of success in

classifying/representing plant diseases at around 94% to 96% accuracy as compared to a known baseline for

representing agricultural-related data (i.e., PlantVillage) and have subsequently become the most widely cited

convolution architecture in the literature related to agricultural deep-learning models. Modifications made to

ResNets, known as "Wide ResNets", where channel width is increased but depth is changed minimally,

demonstrate an enhanced representation of features with minimal associated additional parameters.

Huang et al. [4] developed DenseNet, which takes the basic residual idea and extends it to link all layers in a

dense block (using feature map concatenation, as opposed to using addition) so that there is a fully dense flow

of information through the network while promoting re-use of features, reducing the number of parameters

unnecessarily, and providing the most optimal gradient pathways for the use of backpropagation methods for

explaining what the model has learned. DenseNet-121 has performed well in both the vanilla controlled

PlantVillage data set and also in PlantDoc, the more difficult, real-world validation data set, due to the fact that

it enables the integration of low-level features with high-level features, such as integrating lesion colours and

lesion chlorosis patterns (low-level) with lesion shape and lesion spatial distribution (high-level), into one unified

feature representation. Furthermore, the extensive structure of feature concatenation in DenseNet makes

DenseNet attribution maps derived from LRP and similar methods much more comprehensive and pathologically

interpretable than those derived from previous networks.

Page 2300

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

Transfer Learning, Domain Adaptation, and Data Augmentation

The ImageNet dataset, which has 1.28 Million images with 1000 categories of objects, provides CNNs with a

visual dictionary of hierarchical primitive components (e.g., oriented edges, colour blobs, etc), which can serve

to enhance the ability for the model to efficiently process and recognise agricultural leaf images as a function of

fine-tuning. When fine-tuning a pre-trained CNN backbone on plant disease datasets, the model’s capacity to

generalise and adapt features acquired from the ImageNet dataset enables multiple orders-of-magnitude

larger amounts of plant disease labelled training data to be developed. This represents an immense advantage to

agricultural applications where creating the number of expert-annotated images to train CNNs using an initial

random start requires lots of cost-effectiveness given the typical expense associated with an individual expert

creating a labelled image of an expert-level plant disease. Another area of focus for techniques used for the

domain adaptation are the variability between the training domain, typically from the PlantVillage dataset, as

well as for cameras and image characteristics such as lighting, the amount of reduced intra-class variation versus

inter-class variation due to different stages of plant leaf development, etc. Adversarial domain adaptation (ADA)

will allow the use of two different types of training methodologies (the training of domain discriminator as a

function of supervised classification of disease images through CNN methods combined with gradient reversal

to simultaneously learn new and existing feature sets) and domain discrimination during the training process to

support developing features that discriminate across both the domain and among the different disease classes.

Overall, ADA provides support to USDA for developing CNN architecture to be trained with minimal datasets,

adapted to variable conditions, and computationally efficient implementations. Through multi-source domain

generalization, data from different locations and conditions provide a larger distribution to support better general

plan. Using augmentation methods, you can expand the apparent range of training data to improve stability

against geometric and photometric changes. These methods include random cropping, horizontal and vertical

flips, color jitter, random erasing, CutMix, and MixUp animal tests.

EfficientNet: Compound Scaling

Tan and Le's EfficientNet [5] was introduced in 2019 to accommodate the dependencies among the proficiencies

gained by evaluating a network's depth, breadth, and resolution independently. If you scale one of these aspects

while keeping the others constant, you will get quickly diminishing returns. An architecture search for an

efficient architecture from a constrained baseline (B0) yielded an EfficientNet family of models from B1 to B7

produced by applying a compound scaling (balanced) coefficient to scale all three dimensions at once.

EfficientNet-B4 provides approximately 19 million parameters and achieves roughly 97% classification

accuracy on PlantVillage and requires fewer floating-point operations than ResNet-101 or DenseNet-201,

making it an excellent candidate for deployment on resource-constrained mobile devices, while still retaining

classification performance. The modularity of the MBConv block structure allows for Grad-CAM and SHAP

attributions using standard tools.

Vision Transformers and Self-Attention Mechanisms

The Vision Transformer (ViT) created by Dosovitskiy et al. [6] uses a scale-dot-product self-attention

mechanism; the main operation used in Natural Language Processing Transformers on sequences of image

patches that do not overlap. These patches contain 16 pixels x 16 pixels or 32 pixels x 32 pixels. Each image

patch is projected linearly into an embedding space with positional encoding. There exists a learned classification

token that collects information from all the image patches via multiple heads of attention and multiple feed-

forward networks layered together. Existing models pre-trained on large datasets such as ImageNet-21K (14

million images) and fine-tuned on plant disease datasets produce accuracy rates between 95-97% when compared

to PlantVillage, while providing different forms of explainability compared to feature importance, with attention

weights representing how much influence each patch has upon the overall classification decision.

Swin Transformer [7], which is an extension of ViT, features hierarchical windowed attention with shifted

windows, allowing efficient modeling of multi-scale features (which is particularly important for plant disease

applications, which may contain pathological evidence) that can be found from individual stomata all the way

to the pattern of symptoms exhibited on a single leaf. The Swin-Tiny and Swin-Small versions of the Swin

Page 2301

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

Transformer perform very well with fine-grained agricultural classification tasks, all while achieving

computations that are competitive with that of ResNet-50.

Figure 1: Evolution of Deep Learning Architectures for Plant Disease Detection

Hybrid architectures that combine CNN based feature extraction with transformer based self-attention are being

studied. For example, a ResNet or EfficientNet backbone can provide spatially localized features to compensate

for the need for very large pretraining datasets with ViTs. A transformer type decoder can be used for modeling

long range contextual dependencies between spatially separated groupings of lesions. For example, to detect the

classic distribution pattern of interveinal chlorosis resulting from magnesium deficiency across an entire leaflet.

Figure 1 illustrates the variations of data encoding used on a bar chart display and also describes both the trend

line (dashed) and how colour codes work to group architecture families by type.

Table 1: Summary of the various deep learning architectures that have been used for the classification of

plant diseases.

Architecture

Year

Depth

Plant Acc. (%)

XAI Compatibility

AlexNet

2012

87.3

Moderate — limited feature depth

VGG-16

2014

91.6

High — uniform block structure

ResNet-50

2015

94.2

High — skip connections aid gradients

InceptionV3

2016

93.7

Moderate — multi-branch complexity

DenseNet-121

2017

121

95.8

High — dense feature reuse

EfficientNet-B4

2019

~50

97.1

Moderate — compound scaling

Vision Transformer (ViT)

2020

12–24 L

96.4

Emerging — intrinsic attention

Swin Transformer

2021

12–24 L

97.3

High — hierarchical attention

Explainable Artificial Intelligence Methods

The Imperative for Explainability in Agricultural AI

Scholars and regulators are calling for algorithmic transparency as a result of the deployment of opaque machine

learning systems in areas such as medical diagnosis, credit assessment, and criminal justice. Agriculture also has

similar issues and is just as serious. Misidentifying diseases could cause pesticide applications to be applied

incorrectly, wasting valuable resources, leading to pesticide resistance, and delaying the ability to treat an issue

Page 2302

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

in a timely manner [8]. In organic or integrated pest management systems, misidentified diseases could also

result in irreversible agricultural damage to a farmer's field during a single growing season. Farmers are doubly

disadvantaged when they follow AI-resulted recommendations and do not have knowledge of supporting data;

they receive poor agricultural production and do not have confidence in technologies that could help them.

Explanation mechanisms for artificial intelligence models have several different audiences and needs. People in

agriculture and disease diagnostics need faithfulness or correctness (the explanation needs to show how the

model actually made its decision) and precision (the highlighted features need to represent symptoms that can

be diagnosed accurately rather than just random background items). People who create and work with data

science and models need diagnostic utility (to be able to look at the way the features were attributed in the test

set to help identify if there are biases in the dataset, spurious correlations, or ways the model fails to generalise).

Regulatory agencies need auditability (a record that can be traced back to which features were used to influence

how a prediction was made as well as to facilitate a subsequent review of that process). Farmers who are

smallholders and have no formal education will need visual accessibility (heat map overlays of features that can

be directly related to how they see the disease on a leaf, without having a technical background). XAI methods

can be classified based on two main types of characteristics. The first type is how they relate to the model training

timeline: post-hoc XAI methods are applied after the model's training has taken place, using the model in its

"fixed" form, while intrinsic XAI methods use explainability as a design component of the model. The second

type is how they relate to model architecture: methods that relate to the model (classifiers) look at the internal

representation of the model, such as gradients or attention weights, while methods that do not relate to the model

(black box classifiers) examine the classifier through the alteration of the input. The most common method for

studying the effects of specific crops on diseases in the fields of agriculture and horticulture is through model-

based testing methods. The use of these types of testing systems is based on their ability to provide accurate

results and provide visual representations to assist in evaluating those results.

Gradient-weighted Class Activation Mapping (Grad-CAM) and Variants

Grad-CAM was introduced by Selvaraju and colleagues in 2017. The purpose of this method is to generate a

spatial heatmap, where a particular region of the input image has an effect on the predicted class identified by

the classifier. The image importances are calculated by first taking the gradient of the class score with respect to

the final convolutional layer activation maps [9]. The gradients from the images are then global average-pooled

across all spatial dimensions and used to calculate neuron importance weights. The neuron importance weights

are then applied to all activation maps to generate a composite weighted linear combination of all final layer

activation maps. The generated map is then processed using ReLU nonlinearity (retaining only positive

activation) and upsampled using bilinear interpolation, resulting in a generated overlay of the input where the

area of each component of the input corresponds to the classification decision [23]. The work builds on previous

work with Class Activation Mapping to highlight discriminative localisation in global average pooling layers.

When accurately trained, Grad-CAM can identify the locations of lesions on plants. For example, it can correctly

indicate necrotic spots, chlorotic halos, pustules with sporulation, and areas with water-soaked conditions, if

those traits are indicative of biological similarities between classes. In contrast, when Grad-CAM does not

identify the appropriate lesions, those deviations may be an indication of failure in the training of the model. For

example, if Grad-CAM identifies leaf margins as the site of disease rather than laminar lesions, then the model

is likely using a faulty correlation between disease types and leaf morphology that occurred due to how the

dataset was collected. To remove some limitations experienced by the original method, Grad-CAM++ added

second-order gradients to allow for producing an explanation that was smoother and more precisely located in

terms of where on the image the lesions are located; this was especially effective when the image contained

multiple lesions or lesions of varying size. Score-CAM eliminated the use of a gradient for generating activation

importance and determined importance of a channel by comparing the difference between the model’s output

when each activation map is used to permute the output of the model, thereby producing higher quality results

under condition of gradient saturation.

Page 2303

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

Local Interpretable Model-agnostic Explanations (LIME)

To create local approximations of the behaviour of complex models around individual predictions, LIME [11]

builds an interpretable surrogate model in a neighbourhood of the input to the target model. For image-based

classification, LIME partitions the input image into adjoining superpixels using a segmentation algorithm such

as SLIC [10] (Simple Linear Iterative Clustering) or Felzenszwalb’s graph-based method. It then creates a

diverse neighbourhood of images with perturbations by randomly masking out portions of the superpixels,

replacing them with either a background colour or the mean pixel value of the image, and querying the target

model to obtain predicted class-probabilities for all the perturbed samples. The perturbation samples are then

weighted according to their distance from the original image, a sparse linear model (typically lasso regression)

is created via least-square approximation to the weighted perturbation samples, and the coefficients from the

linear model represent the contribution of each of the superpixels to the predicted class. LIME is model-agnostic,

giving it exceptional generality; it can apply equally well across CNN classifiers, random forests, disease

detection services that use commercial APIs, and ensemble models that do not produce gradients. When LIME

has been applied to models of foliar diseases in plants, and the model is functioning correctly, the superpixel

attributions produced by LIME are consistent with the visible boundaries of the lesions, therefore serving as a

validation mechanism for users who are not experts in plant pathology. The first significant limitations of the

method are performance instability due to stochasticity, which causes randomly selected descriptions for each

input to differ from one instance to the next; and sensitivity to the selected resolution of the generated

superpixels, i.e., by varying the amount of superpixel granularity, one is required to essentially trade-off between

spatial accuracy vs. reliability of an attribution. Recent advancements to this method, such as BayLIME include

Bayesian inference as a means of measuring uncertainty associated with an explanation, as well as establishing

intervals of confidence for the attributions produced for each superpixel, which adds significant value when

dealing with critical, large-scale agricultural decision-making.

SHapley Additive exPlanations (SHAP)

Feature contribution metrics that are derived from Cooperative Game Theory, are known as SHAP, which

establishes how particular features contribute towards a prediction via computing the average marginal

contribution from the feature across every possible feature coalition, known as its Shapley value. The Shapley

value has a number of mathematically provable properties that make it a good measure of feature contribution,

including: efficiency – where attributions add to a total prediction; symmetry – where identical contributory

features receive identical attributions; dummy – where irrelevant features receive no attribution; and linearity –

where attributions add across multiple models [12]. Through the use of axioms, there are theoretically rigorous

attributions produced by the SHAP method in contrast to gradient-based methods, albeit with exponential time

complexity per each Shapley value computed. DeepSHAP uses the backpropagation rules from DeepLIFT to get

the Shapley values without the need to enumerate all the possible coalitions; this allows it to provide an efficient

way to approximate Shapley values using these backpropagation rules in deep learning models. In addition, by

combining the computation of gradients with the sampling of averages based on a reference background set of

data, GradientSHAP is able to facilitate the estimation of expected Shapley Values considering an input’s

uncertainty. The use of KernelSHAP to the spectral feature vector allows a reveal as to which wavelength bands

provide maximum diagnostic information for each pathogen class when performing plant disease classification

using hyperspectral/multispectral imaging. Most notably, through the analysis of the SHAP values, the highly

important wavelengths for the identification of fungi and bacteria with lesions are found to be within the near-

infrared reflectance spectral region of 700–900 nm due to sensitivity of this wavelength range to water content

in cells as well as damage associated with disruption of mesophyll structures; additionally, the red-edge

wavelength range of 680–730 nm, which is associated with the degradation of chlorophyll, is also found to be

diagnostically information for viral and nutritional diseases through SHAP analyses. As a result, one can utilize

this information to design cost-efficient multispectral sensors that address specific crop and pathogen

combinations.

Page 2304

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

Attention-based Intrinsi c Explainability

Through multi-head self-attention matrices, Vision Transformer (ViT) models provide built-in explainability,

which records the amount of attention received from all other image patches for the computation of the contextual

representation for each image patch. The method for generating a global attention map from the residual

connections of the matrix multiplications of the attention weights from all transformer layers is based on

Attention Rollout, as proposed by Abnar and Zuidema [13]. By creating a visual representation of the global

attention map as an overlay of saliency on the 2D spatial layout of the original images, this technique can help

users understand the importance of the various patches in terms of their respective relationships to each other in

the context of the final prediction for the entire image. Attention rollout maps have been used to provide high-

quality spatial localisation of lesions due to pathogen infections in plant disease ViT models, specifically where

diffuse or distributed symptom patterns exist, such as in Downy Mildew or Mosaic Virus Disease. This is

necessary to provide more long-range contextual reasoning of all patches across the entire surface of the leaf,

allowing for discrimination between infected patches based on the long-range relationship of their respective

positions. As well as providing intrinsic explainability to models based on ViT architecture, spatial attention

mechanisms embedded in CNN architectures (e.g. Convolutional Block Attention Module (CBAM) [14] and

Squeeze-and-Excitation (SE) networks using channel attention) provide some degree of intrinsic explainability

for CNN models as well when compared to using a transformer architecture. For example, spatial attention maps

learned by ML-trained plant disease CNNs with CBAM-based improvements will suppress non-informative

background areas (e.g. soil, sky, adjacent healthy tissues) while enhancing features related to lesions.

This simultaneous improvement in classification accuracy and explanation quality — a positive-sum rather than

trade-off relationship — makes attention-augmented CNNs particularly attractive for agricultural deployment

contexts where both performance and transparency are critical requirements.

Layer-wise Relevance Propagation (LRP)

This simultaneous improvement in accuracy and interpretability creates a positive-sum relationship rather than

a trade-off, making attention-augmented CNNs especially appealing for deployment in agriculture, which

requires both accuracy and transparency for successful implementation. LRP [15] decomposes the output of a

neural network into relevance scores on a per-pixel basis by propagating the classification score backward

through the neural network according to conservation rules that preserve total relevance at all layers. LRP does

not compute derivatives as do gradient methods but instead uses propagation rules - epsilon-LRP, alpha-beta-

LRP, and composite rule variations - that distribute relevance at each layer from every activated neuron to its

inputs according to how activated that neuron was, thus conserving total relevance by preventing positive and

negative relevances from canceling out. The output of this process is a pixel-level attribution map with much

higher spatial resolution than Grad-CAM and is able to provide individuals with specific symptomatic structures,

such as acervuli (fungal fruiting bodies), edges of bacterial ooze, and necrotic tissues that form along veins. It

has been determined that LRP-based explanation methods yield superior scores when compared to both Grad-

CAM [9] and LIME [11], as tracked by measuring the loss of classification accuracy associated with the

sequential removal of the most significant pixels (i.e., pixel flipping). In addition, LRP-based explanations

provided higher faithfulness—i.e., LRP minimizes reliance on "visually plausible" heatmaps rather than

providing true transparency into the model's decision-making process—compared to both vanilla gradient-based

methods and LRP-based methods, as well as providing superior confidence in making predictions on new

examples. As agricultural decision support increasingly adopts agricultural XAI models, they need to ensure the

reliability of the explanation approach chosen and how actual results will correlate with each of the explanation

approaches. Using composite LRP rules, where one set of propagation rules (epsilon) is applied to the upper

convolutional layers and another set of propagation rules (alpha and beta) for the lower, enabling an optimal

trade-off between stability and faithfulness for CNNs used in agriculture [15].

Counterfactual Explanations and Example-based Methods

The question of what is the smallest change to this input that will lead to an alternate predicted classification for

the model has been addressed via counterfactual explanations. With respect to leaf disease classification, this

question can also be framed in terms of the minimum changes needed on this leaf for the model to classify the

Page 2305

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

leaf as healthy. Several methods exist to generate counterfactuals, including DiCE (Diverse Counterfactual

Explanations), as well as using prototypes of images or feature variations (modifications) to create visual

representations of the model's classifying boundaries, making this information much easier for a layperson to

understand. Counterfactual explanations are especially helpful for describing disease severity threshold,

specifically the point where features on a plant leaf cross the boundary from being classified as healthy to being

classified as diseased; this is especially relevant for deciding the timing of interventions.

Figure 2: End-to-End Explainable AI Pipeline

Figure 2 details the various pipeline stages associated with a data point in both pre-processed form of raw leaf

imagery, model inference, and Decisional Support Output, and an example of using Explainable Artificial

Intelligence (XAI) for each of these points.

Table 2: Comparison table for XAI methods for deep learning models to detect and classify diseases in

plants.

XAI Method

Category

Agnostic

Spatial Res.

Plant Disease Application

Grad-CAM

Gradient / Saliency

Coarse

Lesion heatmap overlay for diagnosis

Grad-CAM++

Gradient / Saliency

Coarse–Med.

Multi-lesion localisation on leaves

Score-CAM

Perturbation / CAM

Medium

Gradient-free lesion attribution

LIME

Perturbation

Yes

Superpixel

Boundary of infected vs healthy tissue

SHAP (Deep/Kernel)

Game-theoretic

Yes

Feature

Spectral band ranking in hyperspectral

Attention Rollout

Intrinsic (ViT)

Fine patch

Early symptom pre-visual detection

LRP

Backpropagation

Pixel-precise

Pixel-level symptom tracing

Counterfactual XAI

Example-based

Yes

Image-level

Healthy vs diseased comparison

Datasets for Plant Disease Deep Learning

PlantVillage: The Foundational Benchmark

Hughes and Salathé first gathered images onto one platform and made them publicly available to act as both a

benchmark in plant disease deep learning research; they did this in 2015. The benchmark includes over 54,306

images showing healthy and diseased plant leaves from 14 different types of crops (for example: tomato, potato,

grape, corn, apple and strawberry) that come from 38 different classes and cover 26 different diseases. The

dataset is made up of high-quality pictures taken under controlled studio light on a grey background, which

provides a consistent dataset to use for reproducible benchmarking and allows researchers to compare their

Page 2306

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

results. Because of the high-quality controlled conditions of the PlantVillage dataset, rapid development of CNN-

based classifiers has led to the emergence of high-performing classifiers for detecting plant diseases and aided

the growth of AI research in the plant disease community. On the contrary, there are significant limitations

concerning ecological validity when using the PlantVillage dataset for training models. That is, models built on

only the PlantVillage dataset exhibited consistent and systematic generalization failure when deployed into the

real-world environment and tested using the PlantDoc benchmark (which is comprised of field-acquired

photographs). The photographs used to train the models (from PlantVillage) have an unchanging laboratory

background, which serves as a very strong visual cue (the laboratory background does exist in agricultural

production fields). Therefore, the models using the laboratory-based visual cue via shortcut learning (as

identified through Grad-CAM) could not perform at all when deployed under realistic testing conditions [16].

As part of ongoing development efforts, multiple augmented versions of PlantVillage have been developed:

segmentation foregrounds, colour-adjusted variations, and simulated field conditions via background

replacement/compositing methods; but they still face the same fundamental issue of a significant domain gap.

Field-Acquired and Open-World Datasets

The PlantDoc dataset created by Singh et al. [17] contains 2,569 photos taken from the internet with open-source

licensing representing at least 27 different diseases across 13 distinct plant species depicted within natural field

settings. The dataset has highly variable backgrounds, lighting and orientations of leaves as well as other

compositional aspects making it a good benchmark for assessing real world generalisation. Many photos in

PlantDoc contain bounding box annotations which allow both classification model and object detection

frameworks to be evaluated on the same set of data. Although most deep learning models require large amounts

of data; therefore, given that there are only 2,500-3,000 images in this dataset, cannot be used to train deep

learning models from scratch but offer an excellent opportunity for transfer learning evaluations.

The Cassava Leaf Disease dataset — published as part of a Kaggle competition — contains 21,000+ field images

of cassava leaves from within sub-Saharan Africa, which is an important food source for over 800 million people

globally. As the largest dataset, field-realistic images of a major food insecurity crop, this dataset has become

the primary benchmark used to evaluate deep learning in agricultural applications. Another major dataset is the

2018 AI Challenger dataset with approximately 53,744 field photographs of multi-class crops and their semantic

segmentation labels allowing for the evaluation and comparison of both classification and segmentation models

relative to each other. The 2018 FGVC8 Plant Pathology Competition introduced a new method for separate

classes of the same type to be processed on one single image, in a way that reflects how a crop would commonly

exhibit a number of overlapping diseases. This type of multi-class classification creates an issue with XAI

attributions, since there needs to be separate attribution maps created and interpreted for each active label/class,

this will also account for and need to measure the potential overlap in space of each of those located attributions

and labels, for those multiple classes of the same type, and the resultant class labels.

Hyperspectral and Multi-Modal Imaging Datasets

Morphenological and color indicators of disease visible in rgb images do not have the capacity to identify or

determine biochemically based indicators of disease from spectral ranges of incidence present in imagery

acquired through near infrared reflectance, chlorophyll fluorescence emission or thermal emission prior to

becoming visually identifiable. Identification of diseases when they are pre-symptomatic — i.e., prior to the

presence of visible lesions — is an important goal of disease management because smaller pathogen populations

are present during early stages of infection, intervention options (e.g., pesticide applications) tend to be greater,

and treatment costs are lower. The hyperspectral data sets for plant disease produced by UAV mounted

pushbroom spectrometers acquired over plot locations contain hundreds of contiguous spectral bands that span

the 400 to 2500 nanometer range and can provide attribution using shapley-based measures (shap) to diagnose

and locate wavelengths associated with specific biochemical processes of disease. Changes in temperature

resulting from pathogen damage may be detected through thermal infrared imaging, allowing for the pre-

symptomatic identification of diseases affecting vascular plants (e.g., root diseases and vascular wilts) by

obtaining temperature anomalies resulting from reduced stomatal conductance [12]. Three-dimensional plant

canopy architecture changes resulting from diseased and damaged plants can be obtained from lidar and

Page 2307

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

structured light scanning technologies. Developing new methods for integrating multiple modalities of data,

including spectral imaging, thermal imaging and three-dimensional data, using deep learning architectures that

employ feature extractors for each specific modality together with attention model mechanisms for cross-modal

interactions represents a new area of research. In support of this multi-modal application, an understanding of

the contribution of the different modalities to make a prediction is crucial in XAI to ensure that there are no

examples in which the model is leveraging a single modality inappropriately if there is also modality that may

support the prediction as a form of complementary evidence.

Figure 3: Benchmark Datasets for Plant Disease Deep Learning

Figure 3 describes both subplots to show the number of datasets and how these were acquired by acquisition

method (lab, field, UAV/hyperspectral).

Table 3: A summary of the benchmark datasets available for plant disease deep learning research.

Dataset

Images

Classes

Setting

Crop Coverage

Annotation

PlantVillage

54,306

Laboratory

14 crops

Disease label

PlantDoc

2,569

Field

13 crops

BBox + label

Rice Disease DB

10,845

Field

Rice

Disease label

Cassava Leaf DB

21,367

Field

Cassava

Disease label

AI Challenger 2018

53,744

Field

Multi-crop

Segmentation

FGVC8-Plant Path.

31,985

6 (multi-label)

Field

Apple, grape…

Multi-label

DeepWeeds (ref.)

15,007

Field (UAV)

Weed species

Species label

Practical Challenges in Real-World Deployment

Class Imbalance and Long-Tail Disease Distributions

The datasets used for agricultural diseases are heavily imbalanced in terms of examples from different classes

and represent the current epidemiological situation of those diseases. Powdery mildew, early blight and downy

mildew are among the most common and there are thousands of images showing them in the database; however,

the greater proportion of economically important pathogen species that are not widely spread or are at lower

levels of distrust, for example, newly invasive species, new forms of endemic disease, quarantine microbes —

typically have fewer than 50 verified-field images available globally. Therefore, when we look at trained models

using standard cross-entropy loss functions (all using the same data set) to evaluate their accuracy, the model

optimisation will be biased towards the majority class and hence will not be sensitive to minority classes, which

is most relevant for AI detection for food and biosecurity purposes. The most common methods of dealing with

Page 2308

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

the class imbalance are loss function modifications, data resampling, and generative augmentation of data. Class-

balanced focal loss [18] down-weights ‘easy’ or even ‘very easy’ examples of the majority class and focuses

learning signal on ‘difficult’ or ‘very difficult’ examples in the minority class. Oversampling of the minority

class using augmentation and undersampling of the majority class when creating batches provides some degree

of alternative class rebalancing. Another very promising method of generating additional training data for the

minority class is to synthesise photorealistic synthetic disease images using Generative Adversarial Networks

(GANs) and more recently via the new trend of diffusion models. Synthetic samples produced using these models

might not occur with the pathological realism necessary to properly conduct XAI attribution analysis. Therefore,

the integrity of synthetic training data should be evaluated as both a means of classification performance, as well

as the plausibility of explanation.

Domain Shift and Generalisation to Unseen Environments

The area of domain shift between the initial condition of training data and the field deployment context,

represents the single most important barrier to the scalability of plant disease applications of AI. Domain shifts

in illumination between the times of day, sensor characteristics between smartphone models, geographic

variation in strain morphology and severity, and variation in the development of leaves from seedling to

senescence and in composition between single leaf close-up images and whole plant views alters the marginal

distributions of the input data in such a way as to negatively impact the performance of the classifier. More

troublesome still than the aforementioned is the fact that domain shift effect XAI attribution may or may not

produce any obvious degradation of classification accuracy in the training distribution; a model can exhibit a

high average accuracy while relying on shifted spurious correlations that may produce misleading explanations

in different environments. Domain generalization evaluation has become the proper evaluation framework for

agricultural AI systems — by evaluating models on test data collected from geographic regions, growing seasons

or sensor types that were completely absent from training. If a model performs well against standard PlantVillage

[16] test splits but does not perform well against PlantDoc [17], Cassava or cross-geographic field collections,

it is not considered to be ready to deploy. Beyond performance metrics, an additional diagnostic that can provide

a more comprehensive assessment of model performance is XAI attribution stability analysis, which compares

the stability of Grad-CAM [9] and/or SHAP [12] focus across domains to determine that the model continues to

focus on pathologically relevant features as opposed to background features.

On-Device Inference and Model Compression

To operate plant disease AI within smallholder farmer environments where Internet access is limited or sporadic,

models must be deployed on consumer smartphones with no reliance on the internet. For this type of

implementation, the use of consumer smartphones will necessitate that libraries exist to provide inference on-

the-device; a full-precision ResNet-50 [3] model has more than 25 million parameters (and 98 MB of storage

requirement), resulting in potential memory limitations within low-end Android devices, and slow inference

times for field use (due to the lack of GPU reliance). Model compression methods (such as knowledge

distillation, structured filter pruning, and post-training quantisation to either 8- or 4-bit integer arithmetic) can

reduce both the size and inference latency of these models such that they are appropriate for implementing

mobile neural processing units. Optimised inference engines such as TensorFlow Lite and ONNX Runtime for

Android/iOS utilize hardware-specific acceleration (ARM NEON SIMD, GPU compute shaders, and vendor-

specific AI accelerators such as Qualcomm’s Hexagon NPU). However, there is usually some degree of accuracy

or quality degradation associated with model compression techniques that have been applied to the model (and

thereby would degrade XAI attribution). Specifically, integer quantisation introduces rounding

variance/uncertainty into the gradient computation leading to poor Grad-CAM visualisation; structured pruning

may eliminate convolutional channels, which contain pathologically relevant features (even if the average

accuracy of the convolution is small); and knowledge distillation is unable to reliably transfer fidelity of the

teacher model’s explanation to the student model. Research into XAI-based compression techniques to explicitly

enhance the quality of the attribution maps while maintaining prediction accuracy is a developing area with

immediate practical relevance to agricultural use.

Page 2309

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

Figure 4: Multi-dimensional Comparison of XAI Methods

Figure 4 illustrates all 5 axes of a radar plot and details the strengths and weaknesses of several types of Data-

Agnostic / Data-Sensitive Explainability (e.g., LIME, SHAP, LRP).

Table 4. Effect of various model compression strategies on XAI in relation to their application for edge

deployment.

Strategy

Target Platform

Technique

XAI Impact

Quantisation

Mobile CPU/NPU

INT8 / INT4 post-training quantisation

Gradient noise may distort CAM

maps

Pruning

Edge devices

Structured channel pruning + fine-

tuning

May remove channels critical to

attribution

Knowledge

Distillation

Smartphone

Teacher–student training

Student explanations may differ

from teacher

TFLite / ONNX RT

Android / iOS

Format conversion + hardware

delegation

XAI must be revalidated post-

conversion

Human–AI Interaction and Trust Calibration

AI's effectiveness in positively influencing agriculture through disease detection and diagnostic systems depends

more on whether the explanations and outcomes from these systems lead to improved agricultural decision

making than on laboratory performance accuracy. Research has examined human-computer interaction and

established that the type of explanation presented with input into a system and type of output generated from

that system has some type of variable, in addition to being non-monotonic, relationship with user levels of trust.

When users receive accurate explanations that are relevant to the type of task they performed, those users will

develop appropriately calibrated trust in the AI system and make proper use of that system's recommendations.

Conversely, when users receive visually appealing explanations, but those explanations are inaccurate (i.e.,

technically unfaithful), those users can develop overtruste, which will suppress proper human oversight with

regard to using the AI system and, therefore, will lead to outcomes that are generally worse than would result

with no AI system in use. Similarly, when users receive explanations that are overly technical or provide

excessive information, those users can completely disengage from the AI system. In user studies performed with

Page 2310

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

farmers from Kenya and India using early prototype developments for AI-assisted prediction of possible plant

diseases, key interaction design guidelines emerged: (1) spatial heat overlay on the captured picture of the leaf

is easier for non-expert users to understand than rankings of abstract feature importance values; (2) descriptions

of identified symptoms in the farmer's local language build trust and even motivate action to treat their plants;

(3) distinct areas of expression of how confident in the predictions of the AI model reduce farmers' over-

confidence in predictions based on low levels of certainty surrounding the prediction; (4) providing an image of

the planted crop to the farmer along with a reference image showing the symptoms of the identified disease

establishes a contextual anchor for the non-literate farmer when visually validating the predictions made by the

AI. These results will inform, in at least the same capacity as the underlying methodology used to produce the

Explanation of the AI model outputs, the design of the human AI interface for users — what types of XAI

mechanisms to provide, how to format them, and how to express uncertainty.

Translating Explanations for Non-Expert End-Users

Outputs created from XAI techniques such as saliency maps, SHAP value plots, attention rollout overlays and

LIME superpixel attributions are intended for AI researchers and the people who develop the models. When

these outputs are presented to smallholder farmers, rural extension agents, or any end-users who are non-literate

and do not have additional explanatory information, they can cause confusion, create false confidence, or cause

them to disengage completely. To effectively connect machine-made explanations to the actions and decisions

made on a farm is a major unknown challenge faced by the agriculture industry. Concrete strategies for

converting technical XAI outputs into meaningful, actionable and trustworthy forms will be presented to enable

non-expert agricultural stakeholders to make sound decisions based on the XAI outputs.

One foundational aspect of creating explanations using non-expert knowledge is the casting of the input (in this

case, a model's output, the Grad-CAM heatmap) into a short description using natural language. Rather than

outputting a Grad-CAM heatmap as an independent output, a system might generate a simple sentence for the

portion of the crop that is highlighted by the Grad-CAM heatmap. This sentence would be created using language

farmers are familiar with, for example, "The AI has identified yellowing and irregular brown spots on the leaves

near the leaf tips as being consistent with Early Blight caused by Alternaria solani." In addition, there are large

vision-language foundation models like GPT-4V, as well as InstructBLIP, which can be used to generate natural

language descriptions of symptoms directly from leaf images. Additionally, retrieval-augmented generation (i.e.,

grounding output from the model to curated phytopathology reference literature) can improve both the factual

accuracy and relevance of agronomy-related content generated via machine learning model.

The second strategy is to create decision-support templates that map the XAI attributions to structured and

action-oriented advice. The application interface can provide a tiered output based on a consistent template,

rather than exposing farmers directly to probability scores or attribution values. Each tier presents the following;

(1) a statement identifying the disease and indicating its level of confidence in layman's terms; (2) a brief

explanation of the visible evidence that the AI used to produce its results, taken automatically from the XAI

attribution map; (3) a specific recommended course of action; and (4) a referral to the local extension agent (or

agricultural help line) when the model's level of confidence is below a defined threshold. This template approach

allows XAI attributions to serve as an internal verification mechanism while allowing only meaningful

agronomic information to be seen by farmers. The thresholds for escalation of referrals should be determined in

partnership with agronomists for each crop-pathogen combination, and reflect realistic timelines for managing

diseases with different pathogens.

Another technique is example-based/analogical explanation that uses both the prototype and counterfactual

models as the basis of the explanation for the model’s prediction. Instead of using abstract terms to explain the

model’s prediction, the interface may be able to use an image of a reference from a disease atlas that is labeled

or annotated with information related to how the user’s leaf image was determined to exhibit similar symptoms.

The use of a side-by-side comparison of the user’s leaf image and the corresponding reference will create an

instant visual similarity that is interpretable without prior knowledge of neural networks or attribution mapping

methods. For non-literate users, an image-to-image comparison is a more approachable way to provide the

anchor or example than a text description. The reference library can contain images that have been taken in the

Page 2311

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

field and documented by crop, pathogen, and stage of severity, and can grow with Federated Learning reports

from agronomists and extension agents.

Assessing the quality of non-expert explanations (as opposed to purely offering technology-based performance

metrics), requires that there be some distinct measures develop for user-centred evaluations. Aspects that can be

assessed in an intuitive manner would include comprehension of the explanation by the farmer; judgements of

the appropriateness of the suggested action to the actual agronomic conditions in the field at the time of

engagement with the explanation; and calibration (the relationship of the farmer's confidence in the ability of the

AI system to comprehend current field conditions, with the systems understood uncertainty of those conditions).

Examples of how user-centred evaluations would be completed include participatory design (where farmers,

extension officers and agronomists would be directly involved in an iterative prototyping process developing

their desired explanation delivery format) to ensure that the formats produced would be contextually appropriate,

functionally useful, and free of accidental assumptions related to either literacy or the need for access to external

resources. Without a grounded evaluation framework, any superior technically designed XAI systems may not

provide any benefit to the smallholder farming community who may realize the largest benefit from intelligent

plant disease detection systems.

Emerging Research Directions

Foundation Models and Vision-Language Models

The birth of large vision-language foundation models such as CLIP [19] (Contrastive Language-Image Pre-

training), Flamingo, InstructBLIP and GPT-4V, will allow plant disease detection to be accomplished in ways

that would be impossible without a paradigm shift in how we approach and interact with plant disease. The

vision-language foundation models can do this because they afford zero-shot inference; they can identify/classify

diseases they had never seen while fine-tuning by utilizing rich cross-modality semantic representations that

exist in their weights. For example, a CLIP based tomato leaf classifier can classify the disease epidemic

associated with "an early blight lesion tomato leaf that had concentric rings and yellow-haloed lesions" with

useful and meaningful accuracy in zero-shot classification versus if they had been developed through task

specific fine-tuning before being deployed on the system (i.e. they are capable of being utilized to fast track

deployment of newly emerging diseases that haven't been cataloged). Moreover, self-supervised pre-training

techniques (e.g. SimCLR [20]) can also be used with negative sample identification to learn robust visual

representations from unlabelled leaves through the use of contrastive methods in low label agricultural systems.

Additionally, they can also produce natural language descriptions of the observed symptom(s) on a leaf for novel

and linguistically describable purposes. By using prompt engineering and utilizing retrieval-augmented

generation (i.e. conditioning the outputs of models on literature relevant to phytopathology) as a means to support

the generation of symptom descriptions relative to existing knowledge of plant pathology, they further provide

ways of generating aesthetically pleasing, accurate and usable outputs relative to diseases. Collaborative efforts

between AI researchers and plant pathologists will be necessary to evaluate the factual correctness and

agronomic relevance of the developed descriptions since this presents a significant interdisciplinary challenge

for researchers.

Causal Explainability

Currently, the available XAI techniques are primarily for association (classifying based on co-occurring features)

rather than being causative (classifying based on features which generatively determine the output when

examined through an appropriate understanding of the biology of the disease). The distinction between

correlation and causation in plant pathology is important; for example, an AI model may appropriately generate

a Grad-CAM heatmap showing the location of a lesion [8]. However, the model could rely upon a spurious

correlation between the lesion area and an image acquisition artifact, resulting in a visually viable but non-

informative explanation about the causal relationship between those regions. Causal XAI methods based on

structural causal models, do-calculus interventions, and counterfactual reasoning provide a theoretically

principled framework for establishing causal attributions that is superior to correlation-based approaches [15].

Plant disease applications require explicitly stated assumptions regarding cause and effect relations between

image conditions, leaf appearances, pathogen attributes, and predictive capability of models (the assumptions

Page 2312

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

will require verification by plant pathologists). Causal XAI provides an operational way to conduct causal

attribution via twin network architectures. Specifically, these architectures provide counterfactual predictions

via the intervention (manipulation) of some specific input variables while keeping others constant. This

methodology is more computationally intensive and more theoretically complex than gradient-based approaches,

but causal XAI has a substantially higher reliability than gradient-based approaches when current data differs

from previous data and/or relationships between variables present as false correlations. For these reasons, causal

XAI offers a preferable method of producing reliable, accurate causal explanations for adversarial use cases

associated with agricultural AI.

Federated Learning for Decentralised Collaboration

In any one geographic area, there is a shortage of images of diseases, especially rare diseases, newly emerging

pathogens, and crops that are not well studied; this affects the development of good quality disease detection

models and constitutes a structural impediment. Federated learning [21] overcomes this structural impediment

to the creation of high quality data by providing a way for multiple organizations, researchers and individual

users to collaboratively train model without aggregating the raw image data in one central location. In a federated

learning scheme, the participating parties will each independently train a model (local model) on their own

independent (private) data set, and only send the difference (model gradients, or model parameters) to a central

aggregation server. The aggregation server will combine the contributions made by each federated learning

participant by using a technique called federated averaging or other methods of sophisticated aggregation. The

privacy of the data is preserved, as the raw images will never leave the facility from which it originated; this is

particularly important when the data relates to a disease outbreak and the data is economically sensitive in nature,

as the premature release of information about an outbreak could disrupt the commodities market and cause trade

restrictions. The technical issues facing federated learning for plant disease detection are different than those

faced in other types of federated learning since agricultural datasets tend to be highly non-IID (not independent

and identically distributed) across the different devices: with respect to the prevalence of diseases, pathogen

strains, varieties of crops, and imaging conditions there is significant variability across geographic regions

resulting in heterogeneous distributions at the client level and causing convergence problems. Personalised

federated learning, which is a new type of federated learning, will enable the participants to maintain their own

personal model components in addition to the global model components and thus provide a mechanism for

addressing the heterogeneity of the agricultural datasets. In the federated setting, it is necessary to address the

distributional heterogeneity of the datasets in order for Federated XAI recommendations to be useful for

determining individual client Grad-CAM [9] or SHAP [12] results. These values cannot be directly mapped into

the global aggregated model based on each client’s unique data distribution. Therefore, it is necessary to have

federation-based XAI aggregation protocols that allow for the creation of globally representative Grad-CAM or

SHAP values without the requirement for data centralisation.

Continual Learning for Evolving Pathogen Landscapes

As dynamic systems subject to constant evolutionary forces, plant pathogens exhibit adaptability by

accumulating pathogen strains with different levels of virulence, host range or appearance of symptom.

Recombination, through mutation or horizontal gene transfer, continues to produce new strains that may have

different levels of virulence, host range or symptom expression. Climate variability is expanding the geographic

distribution of many plant pathogens, creating risk from the emergence of new disease/crop combinations as

pathogens emerge into new areas where they did not previously exist (resulting in training data that are not

representative of the new combination). The newly-emerged pathogen Xylella fastidiosa in olive groves of

Europe, and wheat blast (Magnaporthe triticum pathotype triticum) in the wheat-growing regions of South Asia,

illustrate the need for rapid, AI-based methods of pathogen identification that occur in a timeframe that does not

permit retraining from scratch on newly accumulated samples. Frameworks for ongoing learning, also referred

to as lifelong or incremental learning, allow for new disease classifications to be learned incrementally while

avoiding catastrophic loss of previously learned disease classes. There are three categories of approaches to

continual learning: methods based on regularisation (such as Elastic Weight Consolidation and Synaptic

Intelligence), methods based on replay (such as generative replay and exemplar memory), and methods based

on expansion (such as progressive neural networks and Packnet). Monitoring models through Explainable

Page 2313

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

Artificial Intelligence (XAI) systems provides substantial benefits to the continual learning system by checking

whether the continuous determination of the model's attribution map remains consistent for the disease classes

that it already knows as it learns new ones and breaks down error rate and loss distinctions (i.e., catastrophic

forgetting vs. maintaining classification accuracy on held out benchmarks). EfficientNet [5] and ViT [6]

backbones have exhibited notable potential as basis frameworks for continual learning due to their modular and

re-usable nature.

CONCLUSION

Deep learning is being applied to the detection of plant diseases and has achieved tremendously high levels of

performance technically where it has now exceeded specialist accuracy in a controlled environment and

substantial real-world applications in the field, as well. The transition from using shallow CNNs, to using residual

and dense architectures, to now using vision transformers has continuously advanced the accuracy levels for

classifying plant images while simultaneously increasing the amount of representative features that are internally

available for attribution analysis of these characteristics. Transfer learning, and domain adaptation techniques,

and data augmentation techniques, have partially eliminated the difference between laboratory benchmark and

the laboratory setting and field realism, while model compression approaches have improved overall

development and deployment of deep learning applications on consumer-grade mobile devices. Even though this

chapter's main point is that accuracy is an essential part of agricultural AI, it alone is not enough. In order to trust

agricultural AI, to be held accountable for the decisions of agricultural AI and to provide the right level of

oversight for its responsible usage, the agricultural AI must have explainability. The XAI landscape has a wide

array of possible methods available, including gradient-based saliency mapping, perturbation-based probing,

game-theory based attribution, intrinsic attention mechanism, backpropagation-based relevance propagation and

counter factual reasoning, each method has its own strengths and weaknesses and use case in agricultural AI in

relation to plant diseases.

There are several urgent areas for research. First, the faithfulness evaluation of XAI methods must be

standardised as part of the evaluation of plant disease AI systems, to ensure effectiveness/efficiency, in

conjunction with classification accuracy. Secondly, user studies from farmers and agronomists must inform the

human-centred design Interface for technical explainability, to assist users in practical decision making. By

incorporating causal reasoning into XAI approaches, we can develop methods that have a greater chance of

producing more consistent and generalizable outputs in the context of distribution shifts typical of agricultural

deployment. Additionally, foundation models, federated learning, and continual learning are converging to

produce AI systems that can be powerful, collaborative, privacy-preserving, and adaptive to constantly shifting

threats from plant pathogens. Achieving success for explainable plant disease AI will require measurable impacts

— not just on benchmark scores but also on crop yield, income and helping farmers around the world meet the

challenges of producing enough food sustainably. To have an impact, the field must integrate multiple disciplines

— plant pathology, agriculture, computer vision, human-computer interaction and responsible AI — and a

commitment to using technology that is both effective and genuinely trustworthy.

REFERENCES

1. Mohanty, S. P., Hughes, D. P., & Salathé, M. (2016). Using deep learning for image-based plant disease

detection. Frontiers in Plant Science, 7, Article 1419.

https://doi.org/10.3389/fpls.2016.01419

2. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image

recognition. International Conference on Learning Representations (ICLR).

3. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778).

https://doi.org/10.1109/CVPR.2016.90

4. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional

networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.

4700–4708).

https://doi.org/10.1109/CVPR.2017.243

5. Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks.

In Proceedings of the 36th International Conference on Machine Learning (pp. 6105–6114).

Page 2314

www.rsisinternational.org

INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,

MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)

ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026

6. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An

image is worth 16×16 words: Transformers for image recognition at scale. arXiv.

https://arxiv.org/abs/2010.11929

7. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021). Swin Transformer: Hierarchical vision

transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on

Computer Vision (pp. 10012–10022). https://doi.org/10.1109/ICCV48922.2021.00986

8. Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv.

https://arxiv.org/abs/1702.08608

9. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual

explanations from deep networks via gradient-based localization. In Proceedings of the IEEE

International Conference on Computer Vision (pp. 618–626). https://doi.org/10.1109/ICCV.2017.74

10. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2012). SLIC superpixels compared

to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence,

34(11), 2274–2282. https://doi.org/10.1109/TPAMI.2012.120

11. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions

of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining (pp. 1135–1144).

https://doi.org/10.1145/2939672.2939778

12. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances

in Neural Information Processing Systems (Vol. 30, pp. 4765–4774).

13. Abnar, S., & Zuidema, W. (2020). Quantifying attention flow in transformers. In Proceedings of the 58th

Annual Meeting of the Association for Computational Linguistics (pp. 4190–4197).

https://doi.org/10.18653/v1/2020.acl-main.385

14. Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). CBAM: Convolutional block attention module. In

Proceedings of the European Conference on Computer Vision (pp. 3–19). https://doi.org/10.1007/978-3-

030-01234-2_1

15. Samek, W., Montavon, G., Lapuschkin, S., Anders, C. J., & Müller, K. R. (2021). Explaining deep neural

networks and beyond: A review of methods and applications. Proceedings of the IEEE, 109(3), 247–278.

https://doi.org/10.1109/JPROC.2021.3060483

16. Hughes, D. P., & Salathé, M. (2015). An open access repository of images on plant health to enable the

development of mobile disease diagnostics. arXiv. https://arxiv.org/abs/1511.08060

17. Singh, D., Jain, N., Jain, P., Kayal, P., Kumawat, S., & Batra, N. (2020). PlantDoc: A dataset for visual

plant disease detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD (pp. 249–253).

https://doi.org/10.1145/3371158.3371196

18. Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In

Proceedings of the IEEE International Conference on Computer Vision (pp. 2980–2988).

https://doi.org/10.1109/ICCV.2017.324

19. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning

transferable visual models from natural language supervision. In Proceedings of the 38th International

Conference on Machine Learning (pp. 8748–8763).

20. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning

of visual representations. In Proceedings of the 37th International Conference on Machine Learning (pp.

1597–1607).

21. McMahan, B., Moore, E., Ramage, D., Hampson, S., & Agüera y Arcas, B. (2017). Communication-

efficient learning of deep networks from decentralized data. In Proceedings of the 20th International

Conference on Artificial Intelligence and Statistics (pp. 1273–1282).

22. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1251–1258).

https://doi.org/10.1109/CVPR.2017.195

23. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for

discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition (pp. 2921–2929).

https://doi.org/10.1109/CVPR.2016.319