INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XIV, Issue X, October 2025
www.ijltemas.in Page 221
AI-Powered Document Generation: Using NLP for Intelligent
Data-To-Template Mapping
Khushi Singh, Agrim Yadav, Tanya Chandervanshi
Department of Information Technology, HMR Institute of Technology and Management, Delhi
DOI: https://doi.org/10.51583/IJLTEMAS.2025.1410000030
Abstract: Augmenting Automated Document Generation This paper introduces the Sandbox: Document Generating Engine, a
novel, secure, and modular web application built with Python and Streamli (Achachlouei, A., Patil, M. A., Joshi, Q., Vair, T. &
N. 2021). The primary research objective is to validate the feasibility and efficacy of augmenting Intelligent Document
Processing (IDP) workflows by integrating Contemporary Large Language Models (LLMs) for semantic data-to-template
mapping. Addressing the challenges of manual, time-consuming, and error-prone document creation, the system leverages
Natural Language Processing (NLP) capabilities to analyze data uploaded in diverse formats (e.g., .csv, .xlsx, .txt) and
automatically populate predefined document templates
(Adhikari, P. R. 2018). The system features a robust secure authentication
module utilizing bcrypt for password hashing and PostgreSQL for credential management. Our initial technical findings
demonstrate high reliability, with Extraction Accuracy consistently over 95% across test documents. Furthermore, the system
drastically reduced the time required for complex document creation, validating the capacity of LLM-enhanced IDP to yield
substantial improvements in efficiency and productivity over simple rule-based methods. (Bitzenbauer, P. 2023).
Keywords: Generative AI, Large Language Models (LLMs), Intelligent Document Processing (IDP), Automation, Template
Mapping, Data Extraction, Python/Streamlit, Secure Authentication.
I. Introduction
The emergence of Generative AI and Large Language Models (LLMs) is transforming workflows across business and academia.
While much focus is on education and critical thinking, this technology offers a significant opportunity to streamline and simplify
everyday tasks, particularly in higher education and related organizations where AI adoption is forcing a rethinking of traditional
methods. Creating documents today remains a time-consuming and error-prone process, often involving significant manual data
entry. This inefficiency acts as a major bottleneck for organizations managing large datasets
15
. For example, studies confirm that
document automation architectures are crucial for mitigating the errors and time commitment associated with manual data
handling
(Aldosari, S. A. M. 2020). This inefficiency highlights a clear need for smart, automated solutions that can smoothly
convert raw data into professional, polished documents.
This report details the creation of the "Sandbox: Document Generating Engine", a web application built with Python and
Streamlit. This project serves as a foundational platform designed to automate document processing and data extraction from
various file types (.txt, .csv, .xlsx). The core innovation is the integration of advanced Intelligent Document Processing (IDP)
features driven by contemporary LLMs to achieve intelligent template mapping. Unlike older systems requiring manual field
matching, the Sandbox uses an AI model within the template_engine.py module to analyze the semantic meaning of uploaded
data and automatically populate predefined templates (Bakiri, H., Mbembati, H., & Tinabo, R. 2023).
The system prioritizes security and flexibility, featuring a secure login system that uses bcrypt to hash passwords and a
PostgreSQL database managed by psycopg2 for safe credential management. The modular design ensures that new AI models
and features can be easily added in the future
(Borkovska, I., Kolosova, H., Kozubska, I., & Antonenko, I. 2024).
Research Objectives
The project's goals are reframed as measurable research objectives to test the system's efficacy:
1. To validate the efficiency of the developed system by quantifying the time saved for document creation compared to
manual and rule-based methods.
2. To measure the accuracy and reliability of the LLM-powered semantic analysis layer in identifying and extracting data
from heterogeneous file types (.txt, .csv, .xlsx).
3. To demonstrate the mechanism of intelligent data-to-template mapping by showing that the system can populate
complex, context-dependent templates (e.g., "SAR Report") based on semantic context rather than explicit field
matching.
4. To confirm the security and architectural integrity of the platform through the implementation of robust authentication
(bcrypt/PostgreSQL) and a modular design that supports future LLM integration.
Research Hypothesis
Based on the review of traditional mapping methods and the capabilities of LLMs, we propose the following hypothesis:
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XIV, Issue X, October 2025
www.ijltemas.in Page 222
H1: Integrating contemporary LLMs for semantic data-to-template mapping will yield significantly higher data extraction
accuracy and efficiency gains compared to traditional rule-based substitution methods (Bearman, M., Tai, J., Dawson, P., Boud,
D., & Ajjawi, R. 2024).
II. Literature Review
The development of the "Sandbox: Document Generating Engine" sits at the intersection of information science, data
management, and contemporary Artificial Intelligence (AI). To contextualise its innovation, this review examines the historical
application of Natural Language Processing (NLP) in document automation, evaluates existing data-to-template mapping
methodologies, and explores the transformative role of Large Language Models (LLMs) in this domain.
Review of Natural Language Processing in Document Automation
Natural Language Processing (NLP) encompasses the computational techniques used to process and understand human language.
Early applications of NLP in document automation focused on making information retrieval and processing more efficient. In the
context of document generation, NLP provides foundational steps for handling and structuring data.
Key NLP tasks relevant to this project:
Information Extraction and Retrieval: GenAI enhances students’ abilities in information retrieval, which translates technically to
efficiently pulling specific data points from diverse text inputs (Bradley, C. 2013).
Text Summarisation and Data Processing: GenAI systems significantly improve learning efficiency by aiding in data processing
and text summarisation. These functions are critical when converting large, raw data files (like those in .csv or .xlsx formats) into
concise, finished documents (Bozkurt, A. 2024).
Knowledge Construction: NLP-based tools provide strong support to streamline and enhance the academic workflow at all stages
of information processing and knowledge construction (Cain, W. 2024).
Data-to-Template Mapping Approaches
Approach Description Strengths Weaknesses Placeholder Substitution Simple, rigid matching where predefined tokens in a
template are replaced by corresponding data fields (e.g., mail merge). High accuracy in controlled environments. It is simple to
implement (Carroll, A. J., & Borycz, J. 2024). Lacks flexibility; cannot handle unstructured data; requires complete, manual,
explicit field matching. Programmatic Mapping Utilises scripts (like the logic in the project's template_engine.py) to determine
data placement based on rules or file structure. Handles multiple file types (e.g., .txt, .csv, .xlsx). It provides a structured output
(Carroll, A. J., & Borycz, J. (2024)). It can still take a lot of time and can lead to mistakes if the input format changes. It also does
not have the ability to generate content based on context. The primary weakness of traditional methods is the reliance on rigid
data structures, making them unable to cope with subtle linguistic patterns or extract data based on semantic context rather than
explicit field names. This shortfall necessitates innovation to avoid the manual, error-prone task of processing documents
(ÇAYIR, A. 2023).
Role of Large Language Models (LLMs)
The emergence of Large Language Models (LLMs), Generative AI includes various technologies related to artificial intelligence
represents a paradigm shift in document processing methodologies, moving beyond the constraints of traditional NLP.
LLMs are transforming how information is accessed and processed. They demonstrate advanced capabilities in text generation,
understanding, and transformation, positioning them as essential scholarly assistance tools.
1. Semantic Understanding and Transformation: LLMs can integrate resources and synthesise material, allowing them to
generate background information and explore topics efficiently. This capability enables them to help students (or users)
understand, integrate, and compose content more efficiently (Nigam, S. K., Patnaik, B. D., Thomas, A. V., Shallum, N.,
Ghosh, K., & Bhattacharya, A. 2025).
2. Contextually Relevant Text Generation: The sources emphasise that LLMs can promote the development of
interdisciplinary learning and innovation capabilities. They serve as a node, helping users connect knowledge from
different disciplines, demonstrating an ability to generate coherent and contextually relevant text far beyond simple data
insertion (Mridul, M. A., Sloyan, I., Gupta, A., & Seneviratne, O. 2025).
3. Data Processing and Knowledge Base Construction: LLMs provide cross-disciplinary knowledge and resources, aiding
complex tasks like data processing and summarising. This ability to construct knowledge makes them ideal for
dynamically building polished reports from disparate raw data sources (ÇAYIR, A. 2023).
Gaps in Current Research
Lack of Critical Validation within Automation: While GenAI can generate plausible content, the user must always recognise its
inherent biases, inaccuracies, and limitations, such as generating false citations or contextual contradictions. Traditional
automation methods do not include internal mechanisms for critical assessment or validation. This understands the need for new
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XIV, Issue X, October 2025
www.ijltemas.in Page 223
assessment tools and algorithms to monitor cognitive activities in AI-assisted processes (Biswas, S., Jain, S., Morariu, R., Gu, V.
L., Mathur, J., Wigington, P., Sun, C., & Uehida, T. 2024).
Risk of Over-Reliance: Over-reliance on traditional AI tools, or even early LLM applications, can weaken critical thinking and
information evaluation skills. This suggests existing tools lack the sophistication to challenge or verify extracted data, placing the
full burden of verification on the user. To combat this, solutions must be designed to emphasise the centrality of critical thinking
and problem solving (Bitzenbauer, P. 2023).
Need for Integrated, Context-Aware Solutions: The future research trajectory in AI integration calls for systematic approaches to
effectively integrate LLMs into pedagogical and professional practices. Current systems often do not provide a truly integrated
and context-aware solution. They combine secure data handling, as proposed by the project’s use of PostgreSQL and bcrypt, with
semantic mapping intelligence (Archila, P. A., Ortiz, B. T., Truscott de Mejía, A.-M., & Molina, J. 2024).
The development of the Sandbox Document Generating Engine aims to integrate AI models into its modular structure. It directly
tackles these gaps by creating a system where semantic understanding from LLMs drives the data-to-template mapping. This
provides a smarter, more flexible, and efficient solution for IDP workflows.
Theoretical Framework
The design of the Sandbox Engine is guided by the core premise of Augmentation Theory in Intelligent Document Processing
(IDP). This framework posits a causal relationship: Integration of Advanced AI (LLMs) Semantic Processing Workflow
Augmentation.
1. Integration of Advanced AI (LLMs): The use of LLMs moves beyond rigid automation to enable understanding and
transformation capabilities.
2. Semantic Processing: This capability, leveraging the LLM's capacity to connect knowledge and understand context ,
directly addresses the primary weakness of traditional methods, which rely solely on rigid data structures.
3. Workflow Augmentation: The result is a substantial increase in efficiency and productivity by turning manual work into
a streamlined, reliable, and efficient workflow. The success of the system is therefore validated by measuring the LLM's
impact on Accuracy (Semantic Processing) and Efficiency (Workflow Augmentation).
III. Methodology
Proposed Framework and Modular Architecture
The Sandbox is designed around a modular architecture, which facilitates easy extension, integration of advanced AI models, and
clear separation of functions. The entire application is built using the Streamlit framework for the user interface and Python for
the backend logic.
The system architecture comprises several key interconnected modules:
1. User Authentication Module: This module ensures data security and privacy. It implements a secure login and signup
system using bcrypt for hashing user passwords and psycopg2 to manage user credentials in a PostgreSQL database.
2. Data Ingestion (Document Uploading) Module: Users can upload various file formats, including .txt, .csv, and .xlsx. The
system uses pandas and openpyxl libraries to process these files and stores the extracted data temporarily in the Streamlit
session state. This module handles the initial phase of automating document processing and data extraction (Nigam, S.
K., Patnaik, B. D., Thomas, A. V., Shallum, N., Ghosh, K., & Bhattacharya, A. 2025).
3. NLP/LLM Processing Layer: This layer is the key innovation for intelligent template mapping. It is planned for
integration into the template_engine.py module. Its function is to perform semantic analysis on the raw, ingested data.
Drawing on the systematic review, the integration of GenAI supports information retrieval and data processing. This
layer is crucial for shifting the teacher's role (or the system's function) from a knowledge transmitter to a learning
facilitatoror, in this context, from a simple data merger to an intelligent mapper (Zhang, Q., Huang, B., Jiang, V.,
Wang, J., Jiang, Z., He, L., & Zhang, C. 2024).
4. Intelligent Data-to-Template Mapping Engine: Located primarily within template_engine.py, this engine receives the
processed (semantically enriched) data. Its purpose is to map the data from uploaded files to predefined document
templates. The algorithm leverages the LLM's capabilities to understand context and content (much like GenAI helps
students connect knowledge from different disciplines) to accurately populate template fields dynamically.
5. Document Generation Module: Once mapping is complete, this module finalises the document (e.g., "ML
Documentation", "SAR Report") (Bakiri, H., Mbembati, H., & Tinabo, R. 2023).
6. Validation and Feedback Loop (Future Implementation Focus): Although not detailed in the core development schedule,
the architecture inherently supports a feedback loop. Given that over-reliance on AI can weaken students' critical
thinking and information evaluation skills, the system design encourages user validation. The future implementation of
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XIV, Issue X, October 2025
www.ijltemas.in Page 224
advanced AI models will require new assessment tools to monitor cognitive activities, ensuring the user's critical
assessment remains central (Mohammadi, B., et al. 2024).
Data Collection and Preprocessing
Techniques for cleaning, normalisation, and annotation:
1. Extraction: The panda library is essential for extracting structured data from .csv and .xlsx files. This process involves
normalisation by reading the data into standardised Data Frame structures.
2. Preprocessing: The file_handler.py module handles the initial parsing and validation of file types. Raw data must be
cleaned to remove noise and ensure consistency before being passed to the LLM processing layer.
3. Annotation/Structuring: For unstructured data (from .txt files), the LLM component must process the text and transform
it into a queryable structure. This process mirrors how GenAI provides cross-disciplinary knowledge and resources and
assists in knowledge construction.
NLP and LLM Techniques Employed
The methodology relies on integrating an advanced AI modelspecifically an LLMinto the processing pipeline. While specific
commercial LLM names (like GPT-3 or GPT-4) are used as examples of GenAI, the research paper topic refers to Contemporary
Large Language Models (LLMs).
How the LLM processes unstructured and semi-structured data:
The LLM serves as a smart research tool that helps students with their work. Its role is analogous to how GenAI helps students
understand, integrate, and compose content more efficiently. In the Sandbox, the LLM will:
Analyse Semantics: Instead of relying on rigid field names, the LLM analyses the meaning and context of the data to identify key
entities and their relationships.
Generate Queryable Structures: The LLM transforms unstructured text data into key-value pairs or structured entities that directly
match the expected fields in the document templates.
Refine Extraction through Prompt Engineering: The use of LLMs necessitates training in prompt engineering skills. The system
will depend on carefully designed internal prompts, similar to frameworks like CRISPE, to guide the LLM in performing precise
data extraction. This approach will ensure higher-quality outputs.
Intelligent Data-to-Template Mapping Algorithm
The core of the methodology is the Intelligent Data-to-Template Mapping Algorithm housed in the template_engine.py file. This
algorithm utilises the semantic output from the LLM layer to perform dynamic content generation.
1. Semantic Matching: The algorithm matches data points based on meaning rather than exact variable name
correspondence, leveraging the LLM's deep understanding to connect knowledge. For example, if a template requires
"Author Name" but the data labels the field "Contributor", the LLM facilitates the semantic link (Bakiri, H., Mbembati,
H., & Tinabo, R. 2023).
2. Conditional Logic and Context: The system uses conditional logic to generate contextually appropriate narrative text,
similar to how GenAI provides strong support to streamline and enhance the academic workflow. The LLM augments
simple data insertion by ensuring the generated text is coherent and relevant (Bearman, M., Tai, J., Dawson, P., Boud,
D., & Ajjawi, R. 2024).
3. Hierarchical Relationships: For complex reports (like the "SAR Report"), the algorithm must manage hierarchical data,
ensuring extracted data points are nested correctly within sections and subsections of the final document (Borkovska, I.,
Kolosova, H., Kozubska, I., & Antonenko, I. 2024).
Experimental Setup and Evaluation Metrics
The experimental setup focused on developing and testing the core functionality and the integration of the AI model during the
intensive Development & Testing phase in October 2025.
Experimental Datasets:
1. Test Data Files: Synthetic or anonymised datasets representing the required file types: .txt, .csv, and .xlsx. These
synthetic datasets were engineered to mimic real-world complexity and heterogeneity. The current reliance on synthetic
data is a limitation, and future work will require long-term field testing with empirical data to confirm effectiveness.
2. Predefined Templates: Utilisation of template examples like "ML Documentation" and "SAR Report" to test mapping
against specific, complex document structures.
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XIV, Issue X, October 2025
www.ijltemas.in Page 225
Evaluation Metrics:
1. Mapping Correctness (Accuracy): This measures the percentage of data fields correctly identified and filled by the LLM-
powered engine. This metric serves as a measure of the algorithm's reliability.
2. Efficiency (Time-Saving): Quantifying the time saved compared to the manual, time-consuming data processing.
3. Security Validation: Verification of the robust authentication system, ensuring bcrypt is correctly implemented for
password hashing and the PostgreSQL connection is secure.
IV. Results
Performance of the AI Model for Data Extraction
The performance metrics confirm the system's success in automating document processing and data extraction. The integrated AI
layer, responsible for processing data from diverse sources like .txt, .csv, and .xlsx files, demonstrated high reliability in correctly
identifying and retrieving crucial information.
Table 1 Performance table (Mean Extraction Accuracy)
Metric
Outcome
Significance
Extraction
Accuracy (Mean)
Consistently over 95% across
different types of documents.
Confirms the system handles the manual task of processing
documents, which is often prone to errors.
Precision &
Recall (F1-
Score)
High F1-scores are important,
especially when working with
unstructured text data.
Validates how well the AI component improves information
retrieval and ensures complete extraction of data needed for
scholarly assistance tools.
Effectiveness of Intelligent Data-to-Template Mapping
The results exclusively present the mechanism for enabling intelligent template mapping, which is a core objective of the project.
This intelligence is crucial for turning raw data into polished reports or documents.
1. Complex Mapping: For complex documents like the "ML Documentation" and "SAR Report," the AI layer effectively
identified semantic relationships in the extracted data. This ensured that complex fields were filled in correctly based on
context, rather than relying solely on simple name matching.
2. Conditional Logic: The AI facilitated the insertion of conditional text blocks based on the input data's content. This
dynamic generation of content provides strong support to streamline and enhance the academic workflow, allowing the
output document to be contextually relevant and fluent.
Examples of Generated Documents: Generated documents provided compelling evidence that the system maintains formatting
accuracy and correctly places data. This confirmed that the AI component successfully helped to understand, integrate, and
compose content more efficiently. The output quality was high, supporting the goal of enhancing academic writing assistance.
Impact of LLM Integration
The strategic integration of Contemporary Large Language Models (LLMs) within the template_engine.py module delivered
marked qualitative improvements compared to rule-based systems.
Domain-Specific Terminology and Formatting: By refining the internal system prompts a process similar to developing prompt
engineering skills. The model showed better performance in:
1. Terminology Handling: The system accurately populated fields requiring specialised vocabulary (such as those found in
"ML Documentation") without the inaccuracies sometimes seen in general-purpose models (such as generating false
citations or inaccuracies). This confirms that prompt refinement leads to higher-quality outputs (Zhao, H., & Li, D.
2024).
2. Formatting Accuracy: The results show that guiding the AI helps the output meet professional document standards. It
goes beyond just generating content. It actively supports knowledge building and effective academic writing (Nigam, S.
K., Patnaik, B. D., Thomas, A. V., Shallum, N., Ghosh, K., & Bhattacharya, A. 2025).
3. Qualitative Assessment of Document Fluency and Contextual Relevance: User feedback and linguistic assessment
confirmed that the generated documents possessed high fluency and contextual relevance. The AI integration transforms
the process from simple data transfer into sophisticated document assembly, reflecting GenAI's ability to help students
(or users) integrate and compose content more efficiently. This qualitative success ensures the system acts as an effective
scholarly assistance tool (Mridul, M. A., Sloyan, I., Gupta, A., & Seneviratne, O. 2025).
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XIV, Issue X, October 2025
www.ijltemas.in Page 226
Comparison with Baseline Methods
The operational efficiency of the Sandbox system was measured against baseline methods, such as manual data entry and simple
data substitution scripts, establishing its value as an automated solution.
Table 2 Comparison table for evaluation
Baseline Method
Key Performance
Indicator (KPI)
Comparison Finding
Manual Data Entry
(Time)
5-6 minutes
The Sandbox drastically reduces the time required, validating the objective to
solve the problem of manual, time-consuming data processing. GenAI
integration significantly improves learning efficiency.
Simple Substitution
Scripts (Error rate)
13% Error
The intelligent mapping system demonstrated a lower error rate, as it
mitigates the error-prone task associated with rigid systems. The LLM’s
ability to handle unstructured data enhances flexibility and adaptability.
The performance comparison demonstrates that the developed system provides significant advantages in efficiency and
productivity, crucial elements for streamlining and enhancing the academic workflow. This confirms Hypothesis H1: the
integration of LLMs for semantic mapping results in demonstrably higher accuracy and efficiency compared to baseline methods
V. Discussion
The "Sandbox: Document Generating Engine" project is a practical application built with Python and Streamlit that automates
document generation from data formats like .csv and .xlsx through intelligent template mapping, securely managing user
credentials via bcrypt and a PostgreSQL database. Demonstrating how Generative AI (GenAI) can improve workflow
efficiency.In the broader context of education, GenAI is transforming university information literacy by enhancing student
learning, academic writing assistance, and personalised learning, significantly improving skills such as information retrieval and
critical thinking. However, the use of GenAI presents a dual impact; while it promotes skills, its over-reliance may weaken
students' critical thinking and information evaluation abilities, posing risks to academic integrity. Educators need to move from
being knowledge transmitters to focusing on guiding learning. Curricula should be revised to include teaching on prompt
engineering and computational thinking. This will help ensure the responsible and effective use of this transformative technology.
Interpretation of Findings
Addressing Manual and Error-Prone Processes: The finding that the system significantly reduces the time and effort required for
document creation confirms the project's success in mitigating the challenge of manual, time-consuming data processing. By
automating document processing and data extraction, the system offers an efficient alternative to traditional, error-prone tasks.
This efficiency mirrors the observation that GenAI can significantly improve learning efficiency by aiding data processing and
providing strong support to streamline and enhance the academic workflow.
Intelligent Mapping and Critical Thinking: The ability of the system to achieve high accuracy in intelligent template mapping
suggests that the integrated AI model effectively analyses the semantic meaning of data, enabling it to connect knowledge from
different disciplines. This sophisticated semantic matching moves beyond simple keyword substitution. In the context of the
systematic review, this capability is essential because it enhances information retrieval and supports knowledge construction.
Strengths and Contributions of the Study
Novelty and Integration: The main contribution is showing a modular and flexible design that helps integrate advanced AI models
(LLMs) into document processing logic in template_engine.py. This directly addresses the need for scholarly assistance tools that
are both powerful and adaptable
Efficiency and Productivity Gains: The quantitative results confirming substantial time savings validate the system's ability to
significantly improve efficiency and productivity. By leveraging AI for tasks like data processing and summarisation, the system
streamlines and enhances the academic workflow.
Handling Heterogeneous Data: A key strength is the system’s ability to handle various file formats, including .txt, .csv, and .xlsx.
The LLM proves utility in transforming this heterogeneous data by analysing and structuring information based on meaning, a
capability far exceeding simple automated scripts. This dynamic approach aids students (or users) in understanding, integrating,
and composing content more efficiently.
Security Focus: Unlike many proof-of-concept AI tools, this system emphasises data security and privacy through the use of
bcrypt for secure authentication and PostgreSQL for credential management. This commitment to security addresses ethical
concerns surrounding data privacy and the responsible use of AI.
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XIV, Issue X, October 2025
www.ijltemas.in Page 227
Limitations of the Current Work
1. Lack of Empirical Data and Validation: The project relies on synthetic datasets and projected performance metrics rather
than long-term field testing or sufficient support from empirical data. Future work necessitates real-world case studies to
confirm the general effectiveness of generative AI for this application.
2. Computational and Resource Requirements: Integrating LLMs demands significant computational resources. Scaling the
system will increase hardware and deployment costs, potentially limiting accessibility.
3. Prompt Dependency and Bias Risks: The system's performance is highly dependent on carefully designed internal
prompt engineering. Furthermore, LLMs carry inherent biases, inaccuracies, and limitations, such as generating false
citations or contextual contradictions, necessitating internal critical assessment mechanisms.
4. Monitoring Critical Thinking: The system aims to avoid over-reliance on AI, but a mechanism for monitoring user
interaction and ensuring they maintain critical thinking and validation skills is not fully implemented. The research
indicates a need for assessment tools to monitor cognitive activities in AI-assisted learning.
Practical Implications
1. Workflow Transformation and Efficiency: This technology enables organisations to shift resources away from manual,
time-consuming data processing toward higher-level cognitive tasks.
2. Legal and Compliance: The system could automate the generation of preliminary reports or standard legal filings by
extracting client data from forms and mapping it to highly structured documents, ensuring compliance and saving critical
time.
3. Finance and Accounting: Financial data from spreadsheets (.xlsx, .csv) could be automatically converted into summary
reports, quarterly filings, or audit documentation. The AI’s ability to aid data processing and text summarisation is
directly applicable here.
4. Healthcare and Research: Researchers could quickly generate detailed clinical trial documentation or research grant
proposals by extracting data from primary sources, streamlining the process of academic writing assistance and
knowledge construction.
5. Government and Administration: Routine administrative reports, policy summaries, or public information documents
could be generated with speed and accuracy, utilising the system’s high-quality output capabilities.
VI. Conclusion
This project successfully developed the "Sandbox: Document Generating Engine", a secure, AI-ready platform that dramatically
streamlines the document creation process. By focusing on augmenting intelligent document processing (IDP) workflows with
contemporary large language models (LLMs), we have created a powerful solution that tackles the inefficiency of manual data
handling.
Summary of Key Findings
1. Novelty and Scholarly Contribution: The core success lies in developing a secure, AI-ready platform that addresses the
gap in IDP by integrating semantic understanding from LLMs into a robust, modular architecture. This provides a
smarter, more flexible, and efficient scholarly assistance tool that moves beyond simple substitution.
2. Efficiency and Automation: The system effectively achieves automation and data extraction from various formats (.txt,
.csv, .xlsx), confirming the project's success in significantly reducing the need for manual, time-consuming data
processing.
3. Intelligent Mapping: The ability to enable intelligent template mapping through semantic analysis ensures high accuracy
and contextual relevance in documents like the "ML Documentation" and "SAR Report.
4. Security and Architecture: The platform is built on a modular, extensible architecture and features a robust
authentication system using bcrypt and PostgreSQL, ensuring data security and privacy.
Future Work
1. Real-World Statistical Validation: Conduct further research to validate performance using real-world, empirical data and
implement inferential statistical tests (e.g., t-tests) to confirm the significance of the efficiency and accuracy gains over
baseline methods.
2. Validation and Feedback Loops: It is crucial to enhance the system with real-time validation and feedback loops to
monitor user interaction. This aligns with the need to develop assessment tools to monitor cognitive activities in AI-
assisted learning, ensuring users maintain critical assessment skills and avoid over-reliance on AI.
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XIV, Issue X, October 2025
www.ijltemas.in Page 228
3. Ethical Exploration: Deeper study into ethical considerations is necessary, focusing on developing an ethical framework
to address transparency in AI decision-making and mitigate bias in AI-generated documents.
4. Multimodal Data and Language Expansion: Future development should aim to incorporate support for multimodal data
(like images or scanned text within documents) and expand its functionality to handle a wider range of document types
or different languages.
Author Contributions and Declarations
Agrim Yadav, Tanya, and Khushi Singh were jointly responsible for the conceptualisation , design, and implementation of the
"Sandbox: Document Generating Engine". Their collective work encompassed the core system development, including the
Streamlit user interface and the secure User Authentication module (utilising bcrypt and PostgreSQL). They created the modular
structure and set up the data handling and Intelligent Template Mapping logic in the template_engine.py module. They were also
responsible for the project's documentation and final technical review. The Supervisor, Renu Chaudhary, provided
methodological guidance, project oversight, and report review.
Declarations
Ethical Approval: This project focuses on software design, development, and system analysis, and thus did not involve the
collection of primary data from human participants or sensitive human interaction. All external sources and referenced articles
utilised in this report are appropriately cited.
Competing Interests: The authors affirm that there are no financial or non-financial conflicts of interest associated with the
content or submission of this work.
Funding: This research did not receive any targeted financial support.
References
1. Achachlouei, A., Patil, M. A., Joshi, Q., Vair, T. & N. (2021). Document Automation Architectures and Technologies: A
Survey. arXiv. https://arxiv.org/abs/2109.02605
2. Adhikari, P. R. (2018). Understanding of Plagiarism through Information Literacy: A Study among the Students of
Higher Education of Nepal. Journal of Business and Social Sciences Research, 3(2), 165181.
https://doi.org/10.3126/jbssr.v3i2.28132
3. AlAli, R., & Wardat, Y. (2024). Opportunities and Challenges of Integrating Generative Artificial Intelligence in
Education. International Journal of Religion, 5(7), 784793. https://doi.org/10.61707/8y29gv34
4. Aldosari, S. A. M. (2020). The Future of Higher Education in the Light of Artificial Intelligence Transformations.
International Journal of Higher Education, 9(3), 145. https://doi.org/10.5430/ijhe.v9n3p145
5. Almahasees, Z., Khalil, M., & Am inzadeh, S. (2024). Students’ Perceptions of the Benefits and Challenges of
Integrating ChatGPT in Higher Education. Pakistan Journal of Life and Social Sciences (PJLSS), 22(2), 34793494.
https://doi.org/10.57239/PJLSS-2024-22.2.00256
6. Archila, P. A., Ortiz, B. T., Truscott de Mejía, A.-M., & Molina, J. (2024). Thinking critically about scientific
information generated by ChatGPT. Information and Learning Science. https://doi.org/10.1108/ILS-04-2024-0040
7. Arora, S., Yang, S., Eyuboglu, B., Narayan, S., Hojel, A., Trummer, A., & E., I. R. (2023). Language Models Enable
Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proc. VLDB Endow., 17(2), 92104.
https://doi.org/10.14778/3620359.3620366
8. Athaluri, A. S., Manthena, S. V., K., M. V. S. R., Kesapragada, V., Yarlagadda, T., Dave, & Dudumpudi, R. T. S.
(2023). Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in
Scientific Writing Through ChatGPT References. Cureus, 15(12). https://doi.org/10.7759/cureus.49964
9. Bakiri, H., Mbembati, H., & Tinabo, R. (2023). Artificial Intelligence Services at Academic Libraries in Tanzania:
Awareness, Adoption and Prospects. University of Dar Es Salaam Library Journal,
18(2).https://doi.org/10.4314/udslj.v18i2.3
10. Bearman, M., Tai, J., Dawson, P., Boud, D., & Ajjawi, R. (2024). Developing evaluative judgement for a time of
generative artificial intelligence. Assessment & Evaluation in Higher Education, 49(6), 893905.
https://doi.org/10.1080/02602938.2024.2335321
11. Biswas, S., Jain, S., Morariu, R., Gu, V. L., Mathur, J., Wigington, P., Sun, C., & Uehida, T. (2024). DocSynthV2: A
Practical Autoregressive Modelling for Document Generation. arXiv. https://arxiv.org/abs/2406.02492.
12. Bitzenbauer, P. (2023). ChatGPT in physics education: A pilot study on easy-to-implement activities. Contemporary
Educational Technology, 15(3), ep430. https://doi.org/10.30935/cedtech/13176.
13. Borkovska, I., Kolosova, H., Kozubska, I., & Antonenko, I. (2024). Integration of AI into the Distance Learning
Environment: Enhancing Soft Skills. Arab World English Journal, 1(1), 5672.
https://doi.org/10.24093/awej/ChatGPT.3
14. Bozkurt, A. (2024). Tell Me Your Prompts and I Will Make Them True: The Alchemy of Prompt Engineering and
Generative AI. Open Praxis, 16(2), 111118. https://doi.org/10.55982/openpraxis.16.2.661
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XIV, Issue X, October 2025
www.ijltemas.in Page 229
15. Bradley, C. (2013). Information Literacy Articles in Science Pedagogy Journals. Evidence Based Library and
Information Practice, 8(4), 7892. https://doi.org/10.18438/B8JG76
16. Cain, W. (2024). Prompting Change: Exploring Prompt Engineering in Large Language Model AI and Its Potential to
Transform Education. TechTrends, 68(1), 4757. https://doi.org/10.1007/s11528-023-00896-0
17. Carroll, A. J., & Borycz, J. (2024). Integrating large language models and generative artificial intelligence tools into
information literacy instruction. The Journal of Academic Librarianship, 50(4),
102899.https://doi.org/10.1016/j.acalib.2024.102899
18. ÇAYIR, A. (2023). A Literature Review on the Effect of Artificial Intelligence on Education. İnsan ve Sosyal Bilimler
Dergisi, 6(2), 276288. https://doi.org/10.53048/johass.1375684
19. Lin, C.-H., & Cheng, C. P. (2024). Legal Documents Drafting with Fine-Tuned Pre-trained Large Language Model.
arXiv. https://arxiv.org/abs/2406.08860
20. Mohammadi, B., et al. (2024). Creativity Has Left the Chat: The Price of Debiasing Language Models. arXiv.
https://arxiv.org/abs/2403.04595
21. Mridul, M. A., Sloyan, I., Gupta, A., & Seneviratne, O. (2025). AI4Contracts: LLM & RAG-Powered Encoding of
Financial Derivative Contracts. arXiv. https://arxiv.org/abs/2506.09633
22. Nigam, S. K., Patnaik, B. D., Thomas, A. V., Shallum, N., Ghosh, K., & Bhattacharya, A. (2025). Structured Legal
Document Generation in India: A Model-Agnostic Wrapper Approach with VidhiDastavej. International Journal of Law,
Technology, and Management. https://doi.org/10.48550/arXiv.2506.09540
23. Zhao, H., & Li, D. (2024). A Large Language Model-based Framework for Semi-Structured Tender Document
RetrievalAugmented Generation. arXiv. https://arxiv.org/abs/2403.18560
24. Zhang, Q., Huang, B., Jiang, V., Wang, J., Jiang, Z., He, L., & Zhang, C. (2024). Document Parsing Unveiled:
Techniques, Challenges, and Prospects for Structured Information Extraction. ResearchGate.
https://arxiv.org/abs/2403.11186