Page 413
www.rsisinternaonal.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Predicting Pest and Disease Occurrence Using Synthetic Data and
Explainable Machine Learning Methods
Priyanka Balley*, Prof. Kanchan K. Doke
Department of Computer Engineering, Bharti Vidyapeeth College of Engineering, Navi Mumbai,
University of Mumbai.
DOI: https://doi.org/10.51583/IJLTEMAS.2026.150500037
Received: 29 April 2026; Accepted: 04 May 2026; Published: 26 May 2026
ABSTRACT
Prediction of occurrence for pests and diseases is an essential problem for agriculture, as such events have a
huge influence on the productivity of the crop with regard to the security of food production. Traditional methods
lack datasets and tend not to incorporate domain knowledge, which leads to suboptimal performance with limited
sets of interpretation. This study addresses such gaps by developing a systematic machine learning-based
framework for combining synthetic data generation, robust predictive modeling, and explainability techniques
to produce actionable insights in pest and disease dynamics. Synthetic datasets are first generated based on the
domain-driven logic simulating the correlations between critical environmental and biological factors such as
temperature, humidity, rainfall, pest lifecycle stage, and soil moisture and the incidence of pests or diseases. For
interpretability, Local Interpretable Model-agnostic Explanations LIME with Random Forest provides localized,
instance-level insights on feature contributions to individual predictions. For complement, permutation
importance calculates the global relevance of every feature by assessing its effect on model performance. Both
of these techniques ensure that fine-grained and holistic understanding is achieved regarding the model's
behavior. This integrated approach therefore addresses the limitations of traditional methods by improving the
predictive accuracy and enhancing interpretability. The findings have tremendous implications for precision
agriculture in order to allow stakeholders to put into action data-driven strategies for pest and disease
management. This framework is reproducible and therefore adaptable to different contexts in agriculture sets.
Keywords: Pest Prediction, Disease Modeling, Random Forest, LIME Explainability, Permutation Importance,
Sets
INTRODUCTION
The major threats to global agricultural productivity are pest and disease outbreaks, and therefore management
strategies should be very accurate and proactive. Traditional methods [1, 2, 3] of predicting the occurrences of
pests and diseases depend on historical datasets, heuristic models, or expert knowledge that cannot generalize to
different environmental and biological conditions. Limitations related to lack of interpretability that accompanies
contemporary forms of machine learning further limit these systems in terms of practical usability from the
viewpoint of stakeholders, agriculture sets. This work overcomes these challenges in research using a novel and
systematic approach of combining domain-specific synthetic data generation with strong robust machine
learning and state-of-the-art explainability techniques. It applies domain-driven controlled synthetic logic for
simulating realistic interactions between key variables, including temperature, humidity, rainfall, soil moisture,
and pest lifecycle stages in synthetic datasets. All these features form the basis for developing a highly predictive
model by using a Random Forest Classifier, chosen here for its superior performance over heterogeneous data
and inherent interpretability sets.
Due to reasons such as transparency and trust, even methods like Local Interpretable Model-agnostic
Explanations (LIME) and permutation importance were utilized. LIME does fine-grained insights regarding a
prediction decision at an instance level, while permutation importance assesses global feature relevance on
samples of the dataset. Together, these methods clearly define the key environmental and biological factors that
drive pest and disease occurrence, which would then allow stakeholders to meaningfully interpret and act upon
Page 414
www.rsisinternaonal.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
model outputs. In achieving this, the framework succeeds in two folds: predictive accuracy and interpretability
are enhanced to open doors to data-driven, scalable solutions for pest and disease management in agriculture
process.
Systematic Literature Review
Machine learning and deep learning are increasingly being implemented into agricultural studies to make
tremendous changes possible in terms of pest detection, crop management, and the prediction of diseases.
Relevant studies of this section offer a comparative analytical review to contextualize the present work as part
of the existing sets of knowledge. Chithambarathanu and Jeyakumar [1] had conducted an intensive survey
regarding crop pest detection using machine learning and deep learning. Their study highlighted the utility of
convolutional neural networks (CNNs) for pest classification but pointed out challenges such as computational
overhead and scalability. These findings underline the importance of balancing accuracy and computational
efficiency in pest prediction models. Sailaja et al. [2] proposed a spatial temperature prediction approach using
machine learning and GIS. Although their efforts were oriented toward meteorological applications, their
methodology shows the necessity of spatial integration in agricultural modeling. Their outcome emphasizes the
significance of spatial variability in the frameworks of pest and disease prediction. Saravanan and
Bhagavathiappan [3] presented hybrid deep learning models for crop yield prediction. Their work demonstrated
how hybrid approaches improve the performance of standalone models, which aligns with this study's focus on
robust predictive frameworks. Kuppan and Priya [4] have been applied ensemble machine learning models for
yield prediction and have much improvement based on prediction accuracy, indicating that the techniques
bagging and boosting applied enhanced model importance of ensemble modeling such as Random Forest
Classifier, which were in use for this process work, and Shinde and Ambhaikar [5] proposed a classification
model of plant disease through both machine and deep learning classifiers. Their high accuracy in disease
classification was compromised by low explainability, and this clearly brought into focus the significance of
interpretability tools like LIME applied in the process of the current study. Venkatasaichandrakanth and
Iyapparaja [6] discussed deep learning models for pest detection and underlined image-based solutions. They
mentioned generalization problems due to less data sets and reiteratively explained the justification behind
synthetic data sets applied in the process of the current study.
Attri et al. [7] reviewed crop management applications of machine learning, noting the potential of these
techniques for real-time decision-making process. Their findings align with the study's focus on actionable
insights through explainable predictions. Nithya et al. [8] compared crop detection techniques using machine
learning and deep learning, finding that while deep learning provided higher accuracy, traditional machine
learning methods like Random Forests were computationally more efficient. This comparison supports the
Random Forest choice in this study to provide a balance between accuracy and efficiency. The Karnal bunt
disease prediction model analysis by Anand et al. [9] was based on specific conditions found in agriculture. They
focused on regional customization aspects while modeling pests and diseases. They have suggested the ability
of adapting this framework across multiple geographies. Verma et al. [10] discussed machine learning for the
management of urad bean crops emphasizing its ability to predict disease and pest incidence patterns. This study
showed the utility of incorporating environmental factors such as rainfall and temperature, features key in the
proposed model. Chacón-Maldonado et al. [11] presented a hybrid deep learning model with explanation for
olive fruit pest forecasting. Their work again emphasized the need to correlate prediction accuracy with
interpretability, which is the basis of this study process. Nithya et al. [12] proposed an IoT-based crop yield
prediction system based on machine learning. Their system showed how real-time data sources can be used for
improving the reliability of predictions, which fits well with the approach in this work process based on synthetic
data.
Mandrapa et al. [13] considered hyperspectral analysis in spider mite detection. The results of their study
demonstrate the possibility of increasing the accuracy of prediction using feature selection, which is in agreement
with the application of permutation importance for feature analysis in the process of this study. Abdel-salam et
al. [14] proposed a hybrid feature selection framework in crop yield prediction. Their results pointed out the
importance of feature optimization in enhancing model performance and proved the relevance of applying
domain-specific feature selection within this research work. Ahmed and Yadav [15] used machine learning and
deep learning for predicting apple plant diseases. Although they were dealing with orchard crops, their results
Page 415
www.rsisinternaonal.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
showed how hybrid models could effectively handle complex agricultural challenges. In summary, the works
reviewed collectively highlight the need to marry robust predictive models with interpretability and adaptability.
This proposed study will fill the gaps pointed out in these papers by combining synthetic data generation,
Random Forest Classifier, and explainability techniques to offer a scalable, interpretable solution for predicting
pests and diseases.
Proposed Model Design Analysis
The proposed model for the prediction of the occurrence of pests and diseases brings together synthetic data
generation, machine learning, and explainability techniques into a methodically designed system in an effort to
eliminate weaknesses inherent within traditional methods. It begins by creating a synthetic dataset like that
shown in figure 1, which captures real environmental and biological conditions. Domain expertise forms
correlations among the features like Temperature (T), Humidity (H), Rainfall (R), Soil moisture (M), and various
phases of Pest lifecycle stages(P). Process models these interactions mathematically. It is thus possible to map
the correlation between temperature with pest activity as a logarithmic function to capture nonlinear responses
in pest behavior given via equation 1,
󰇛󰇜
󰇛󰇜󰇛󰇜
Where k is a sensitivity constant and T₀ is the threshold temperature for this process. This ensures that the dataset
aligns with observed phenomena, enhancing realism and predictive capability sets. The classification model uses
a Random Forest Classifier (F), an ensemble method where multiple decision trees (Ti) are trained independently,
and their outputs are aggregated for the process. The prediction y is computed via equation 2,
 󰇛󰇛󰇜 󰇜
 󰇛󰇜
Figure 1. Model Architecture of the Proposed Analysis Process
Page 416
www.rsisinternaonal.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Where, I() is an indicator function, x is the input vector, and k indexes the classes. This architecture is
particularly suited for handling complex interactions among mixed data types and is less prone to overfitting due
to its inherent randomness and averaging mechanisms. To optimize the model's performance, cross-entropy loss
is minimized during training, defined via equation 3,

󰇟󰇛󰇜󰇛󰇜󰇛󰇜󰇠
 󰇛󰇜
Where, yi and i represents the true and predicted probabilities, respectively for the process. This helps in the
strong training, especially when there is an imbalance in the dataset for the process. The process suggested here
lays significant emphasis on explainability. LIME is utilized for generating instance-level explanations through
the perturbation of the input space and study of responses by the model process. Via equation 4, using a locally
weighted linear model, is applied to get an explanation for any instance x′,
󰇛󰇜
 󰇛󰇜
Where, z is the perturbed instance, m is the number of features, and βi are weights assigned to each feature,
reflecting their contribution to the predictions. Global feature importance is evaluated by permutation importance
levels. For a feature j, its importance is measured as the decrease in accuracy ΔA when the feature is
stochastically permuted via equation 5,
 󰇛󰇜󰇛󰇜
Where, Abase and Aperm(j) are the baseline and permuted accuracies, respectively for the process. This
quantifies the dependency of the model on each feature, thereby providing a comprehensive understanding of its
behavior sets. To ensure generalizability, the model is evaluated in process using a validation set. The area under
the receiver operating characteristic (ROC) curve is computed via equation 6,
 󰇛󰇜󰇛󰇜󰇛󰇜
Where TPR and FPR stand for true positive rate and false positive rate. An AUC close to one reflects the model's
discrimination capability, thereby proving robust. The integration of the two methods will ensure the whole
approach is well-balanced, where the Random Forest Classifier gives robust prediction results, LIME giving
actionability, and permutation importance validating that the model heavily relies on meaningful features. This
multi-faceted design not only improves the predictive accuracy but also gives confidence to stakeholders by
maintaining high interpretability and contextual relevance sets.
Comparative Result Analysis
This experimental setting for the proposed model assesses the model on a synthetic dataset designed to emulate
real-world pest and disease dynamics under various environmental and biological conditions. There are 1,000
samples and six variables: temperature, humidity, rainfall, soil moisture, pest lifecycle stage, and the binary
target variable - whether or not the case is a pest/disease.
The synthetic data had been generated by sampling according to uniform distributions, imposing realistic ranges
for each of the features, and according to domain-driven rules by deriving the target variable. This dataset was
split into a training set of 80% and corresponding testing sets of 20%.
The performance of the proposed Random Forest Classifier was compared against three existing methods,
namely, Method [3] - Logistic Regression, Method [8] - Support Vector Machine, and Method [12] - Gradient
Boosting. Every model is evaluated with critical metrics in terms of accuracy, precision, recall, F1-score, and
AUC levels.
Page 417
www.rsisinternaonal.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Table 1: Model Accuracy Comparison
Model
Accuracy (%)
Proposed Model
85.4
Method [3]
78.6
Method [8]
81.2
Method [12]
83.1
The proposed model outperformed the other methods, achieving the highest accuracy due to its ability to capture
complex interactions between features.
Figure 2. Model’s Accuracy Analysis
Table 2: Precision Comparison
Model
Precision (%)
Proposed Model
88.3
Method [3]
75.4
Method [8]
79.8
Method [12]
84.6
Precision scores highlight the proposed model's strength in minimizing false positives, which is critical for
pest/disease management sets.
Table 3: Recall Comparison
Model
Proposed Model
Method [3]
Method [8]
Method [12]
The recall metric reflects the proposed model's superior ability to identify occurrences accurately.
Page 418
www.rsisinternaonal.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Figure 3. Model’s Precision Analysis
Table 4: F1-Score Comparison
Model
F1-Score (%)
Proposed Model
85.4
Method [3]
73.2
Method [8]
78.1
Method [12]
81.9
The F1-score, which balances precision and recall, underscores the proposed model's robustness compared to
others.
Table 5: Feature Importance Analysis (Permutation Importance)
Feature
Importance
(Proposed Model)
Importance
(Method [12])
Importance
(Method [8])
Importance
(Method [3])
Temperature
0.25
0.21
0.18
0.15
Humidity
0.22
0.19
0.16
0.14
Rainfall
0.15
0.12
0.11
0.09
Soil Moisture
0.05
0.04
0.03
0.02
Pest Lifecycle
0.05
0.03
0.02
0.01
The proposed model shows greater sensitivity to key features like temperature and humidity, aligning with
domain knowledge sets.
Page 419
www.rsisinternaonal.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Figure 4. Model’s Important Feature Analysis
Table 6: Computational Efficiency (Training Time in Seconds)
Model
Training Time (s)
Proposed Model
1.32
Method [3]
0.57
Method [8]
2.15
Method [12]
3.42
As evident, the proposed model acquires a trade-off of the computational efficiency and is still able to perform
because, in a runtime comparison, it proves to be significantly faster than Methods [8] and Method [12].
Results: According to the above results and based on all the obtained metrics, including accuracy, precision,
recall, and F1-score, proposed Random Forest Classifier outperforming the existing methods over these metrics.
The permutation importance analysis also favors meaningful features for the model.
It provides interpretive insights to stakeholders. The training time is a bit more than in Method [3] but much
smaller than even the most complex models of Methods [8] and Methods [12]. Thus, this method may be
practicable for real-world usage sets. The evaluation fully attests to the robustness and scalability of the presented
methodology process.
Page 420
www.rsisinternaonal.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Web Interface
Predict Pest
CONCLUSION & FUTURE SCOPES
The proposed method integrates the generation of synthetic data, Random Forest classifier, and state-of-the-art
explainability techniques for forecasting pest and disease outbreaks in high accuracy and interpretability. The
proposed model demonstrates an accuracy of 85.4% out of the existing approaches that are Method [3], 78.6%,
Method [8], 81.2%, and Method [12], 83.1%. The proposed model yields high precision at 88.3% and recall at
82.7%, with the F1-score being 85.4%, thus indicating an equilibrated performance by minimizing false positives
and negatives in the process. It results in the robust prediction of the occurrence of pests and diseases, especially
considering the complex conditions of both the environment and biology through synthetic domain-specific data
samples. The critical advantage of this approach relates to explainability; permutation importance points out that
temperature has the highest impact at 0.25 and humidity at 0.22, which explains established domain knowledge
sets. LIME instance-level explanation further confirms model predictions: it gives actionable insights that help
explain particular occurrences at the individual level. Besides, with a training time of 1.32 seconds, the proposed
model balances performance and scalability to be practical and efficient at runtime, surpassing both Method [8]
and Method [12] on runtime while achieving superior predictability. The study does have some limitations. Even
though the synthetic dataset used might be realistic, it can't capture the variability found in real-world conditions
that may affect generalization. Future research should validate the model against diverse real-world datasets to
enhance it further in terms of robustness. The addition of temporal and spatial data, such as the patterns of pest
migration or localized weather phenomena, will also increase the levels of predictive power sets.
Future work includes ensemble methods combining multiple classifiers or integrating more advanced techniques,
such as deep learning, to extract hierarchical features. This study creates a solid foundation for the application
of interpretable machine learning in precision agriculture, providing scalable and actionable solutions to mitigate
the impacts of pests and diseases in real time. From this perspective, the methodology has wide prospects to
change pest management and disease management practices across different agricultural landscapes.
Page 421
www.rsisinternaonal.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
REFERENCES
1. Chithambarathanu, M., Jeyakumar, M.K. Survey on crop pest detection using deep learning and
machine learning approaches. Multimed Tools Appl 82, 42277–42310 (2023).
https://doi.org/10.1007/s11042-023-15221-3
2. Sailaja, B., Gayatri, S., Rathod, S. et al. Spatial temperature prediction—a machine learning and GIS
perspective. Theor Appl Climatol 155, 9619–9642 (2024). https://doi.org/10.1007/s00704-024-
05167-3
3. Saravanan, K.S., Bhagavathiappan, V. Prediction of crop yield in India using machine learning and
hybrid deep learning models. Acta Geophys. 72, 4613–4632 (2024). https://doi.org/10.1007/s11600-
024-01312-8
4. Kuppan, P., Priya, V.V. Crop Yield Prediction Using Ensemble Machine Learning Techniques. SN
COMPUT. SCI. 5, 1160 (2024). https://doi.org/10.1007/s42979-024-03536-3
5. Shinde, N., Ambhaikar, A. An efficient plant disease prediction model based on machine learning and
deep learning classifiers. Evol. Intel. 18, 14 (2025). https://doi.org/10.1007/s12065-024-01000-y
6. Venkatasaichandrakanth, P., Iyapparaja, M. Review on Pest Detection and Classification in
Agricultural Environments Using Image-Based Deep Learning Models and Its Challenges. Opt. Mem.
Neural Networks 32, 295–309 (2023). https://doi.org/10.3103/S1060992X23040112
7. Attri, I., Awasthi, L.K. & Sharma, T.P. Machine learning in agriculture: a review of crop management
applications. Multimed Tools Appl 83, 12875–12915 (2024). https://doi.org/10.1007/s11042-023-
16105-2
8. Nithya, V., Josephine, M.S. & Jeyabalaraja, V. Comparative approach on crop detection using machine
learning and deep learning techniques. Int J Syst Assur Eng Manag 15, 4636–4648 (2024).
https://doi.org/10.1007/s13198-024-02483-9
9. Anand, S., Sandhu, S.K., Biswas, B. et al. Comparative analysis of different Karnal bunt disease
prediction models developed by machine learning techniques for Punjab conditions. Int J
Biometeorol 68, 1799–1810 (2024). https://doi.org/10.1007/s00484-024-02707-4
10. Verma, R., Kushwaha, K.P.S., Bijlwan, A. et al. Enhancing urad bean (Vigna mungo L.) crop
management with machine learning: Predictive analysis of pod rot severity and pod bug incidence
patterns. Australasian Plant Pathol. 53, 273–283 (2024). https://doi.org/10.1007/s13313-024-00967-7
11. Chacón-Maldonado, A.M., Melgar-García, L., Asencio-Cortés, G. et al. A novel method based on
hybrid deep learning with explainability for olive fruit pest forecasting. Neural Comput &
Applic (2024). https://doi.org/10.1007/s00521-024-10731-z
12. Nithya, V., Josephine, M.S. & Jeyabalaraja, V. IoT-Based Crop Yield Prediction System in Indian Sub-
continent Using Machine Learning Techniques. Remote Sens Earth Syst Sci 6, 156–166 (2023).
https://doi.org/10.1007/s41976-023-00097-6
13. Mandrapa, B., Spohrer, K., Wuttke, D. et al. Machine learning-based hyperspectral wavelength
selection and classification of spider mite-infested cucumber leaves. Exp Appl Acarol 93, 627–644
(2024). https://doi.org/10.1007/s10493-024-00953-0
14. Abdel-salam, M., Kumar, N. & Mahajan, S. A proposed framework for crop yield prediction using
hybrid feature selection approach and optimized machine learning. Neural Comput & Applic 36,
20723–20750 (2024). https://doi.org/10.1007/s00521-024-10226-x
15. Ahmed, I., Yadav, P.K. Predicting Apple Plant Diseases in Orchards Using Machine Learning and
Deep Learning Algorithms. SN COMPUT. SCI. 5, 700 (2024). https://doi.org/10.1007/s42979-024-
02959-2