
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
accommodate the high-dimensional, overlapping eligibility criteria typical of real-world welfare programs.
Subsequent research shifted toward supervised machine learning, employing classifiers such as logistic
regression, support vector machines, and decision trees to predict individual eligibility based on demographic
features [7]. Ensemble methods, particularly Random Forest, were shown to handle missing data and non-linear
interactions effectively, making them suitable for the heterogeneous citizen databases maintained by government
agencies [3].
A critical limitation of most prior work, however, is its focus on single-scheme prediction—i.e., building a
separate classifier for each welfare program. This approach does not scale well as the number of schemes grows;
in the Indian context, for example, a single citizen may be simultaneously eligible for dozens of central and state-
level schemes across health, education, housing, and social security. Multi-output classification, where a single
model predicts multiple binary targets simultaneously, has been proposed as a more scalable alternative [8]. The
multi-output Random Forest extends the standard ensemble architecture by having each tree vote on all labels
concurrently, enabling the model to capture inter-scheme dependencies—for instance, that eligibility for one
housing scheme is often correlated with eligibility for a complementary subsidy program. Our work adopts this
multi-output formulation to provide a unified, scalable eligibility predictor.
Retrieval-Augmented Generation in Domain-Specific Contexts
LLMs have demonstrated remarkable fluency in natural language generation, but their parametric knowledge is
frozen at training time, leading to hallucinations when queried about rapidly changing or non-public information
[9]. The RAG architecture mitigates this by retrieving relevant documents from a trusted external knowledge
base at inference time and conditioning the LLM’s output on that retrieved context [4]. This approach has been
successfully deployed in several high-stakes domains. In healthcare, RAG-enhanced systems have been used to
generate evidence-based clinical recommendations by grounding responses in peer-reviewed medical literature
[10]. For example, a RAG-based chatbot could retrieve the latest drug interaction guidelines before answering a
query about contraindications, thereby reducing the risk of harmful advice.
In the legal domain, RAG systems have been developed to assist with statute comprehension and case law
retrieval, where factual precision is paramount [11]. Similarly, in public administration, RAG has been proposed
to power citizen-facing chatbots that can answer questions about tax forms, pension schemes, and social benefits
[12]. A common challenge across these implementations is the retrieval latency when the knowledge base is
large; exhaustive search over millions of documents is computationally expensive and may introduce
unacceptable delays for real-time interactive systems. To address this, some works have proposed hybrid
retrieval pipelines that first perform a coarse-grained filter using metadata or keyword indexing before applying
dense embedding search [13]. Our framework takes this idea further by using a machine learning classifier—
rather than static metadata—to select the relevant subset of documents, thereby integrating eligibility prediction
directly into the retrieval optimization.
AI for Digital Governance and the Indian Context
Several recent initiatives have explored AI-powered tools for digital governance in India. The “Common Service
Centers” (CSCs) program has piloted rule-based chatbots for answering queries about identity documents and
subsidy applications [14]. Academic research has also proposed deep learning architectures for predicting
eligibility for schemes like Pradhan Mantri Awas Yojana (housing for all) and Ayushman Bharat (health
insurance) using household survey data [15]. However, these studies typically focus on a single scheme or a
small set of programs, and they often assume clean, pre-normalized input data—an unrealistic assumption given
the lexical variations (e.g., “caste SC” vs. “scheduled caste”) and missing fields common in public records.
Our work differs from these prior efforts in three key ways. First, we explicitly address data heterogeneity
through a dedicated fuzzy matching preprocessing module that normalizes categorical tokens before
classification. Second, we adopt a multi-output Random Forest that predicts eligibility across all schemes
simultaneously, enabling efficient scaling to large program portfolios. Third, we integrate the classifier with a
RAG module in a filtering pipeline: the classifier reduces the generative component’s search space from all
schemes to only those where eligibility is above a threshold, thereby improving both the factual accuracy and