
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Despite the wealth of Argo data and its global coverage, raw datasets face a significant accessibility barrier. Argo
data access generally requires proficiency in scientific programming languages such as Python or MATLAB,
familiarity with NetCDF file structures and OPeNDAP protocols, and domain-specific understanding of
oceanographic conventions including the Julian Day reference system and quality control flag interpretations.
This technical threshold effectively excludes policymakers, environmental managers, educators, and early-career
researchers who would benefit from ocean data-driven insights. Traditional methods of increasing data
accessibility have relied on web portals with form-driven query interfaces such as the Argo Data Management
portal and the Coriolis data centre. While these systems provide structured access, they require users to
understand data schemas, specify explicit parameter ranges, and manually formulate queries—a workflow that
is difficult for non-technical users.
Recent developments in Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) have
opened the possibility of building intelligent data exploration interfaces (Lewis et al., 2020). RAG architectures
address a fundamental limitation of LLMs—their inability to access domain-specific, recent factual knowledge
beyond their training data—by fusing language generation with external retrieval mechanisms. This approach
enables systems to ground their responses in real data rather than relying solely on parametric knowledge,
resulting in substantial reductions in hallucination and increased factual accuracy in domain-specific applications
(Shuster et al., 2021).
This paper presents FloatChat RAG, an AI-powered conversational system that democratizes access to Argo
oceanographic data through an intuitive natural language chat interface. The key contributions of this work are:
(1) an end-to-end data pipeline that streams Argo NetCDF data from NOAA's THREDDS servers via OPeNDAP
and processes it into structured SQL and semantic vector formats; (2) a ChromaDB-based vector retrieval system
with sentence transformer embeddings for semantic metadata search; (3) a LangChain-based agentic framework
with nine specialized tools for semantic search, SQL retrieval, geographic and temporal filtering, and
visualization generation; (4) reliability mechanisms including multi-key API rotation, deterministic fallback
routes, and response caching; and (5) an interactive Streamlit-based chat interface with inline Plotly
visualizations and tool transparency features.
LITERATURE REVIEW
Oceanographic Data Management Systems
Standardization efforts in oceanographic data management have been extensive. NetCDF, supported by Unidata
and CF (Climate and Forecast) conventions, serves as the de facto archival format for gridded and profile-based
ocean data (Rew and Davis, 1990; Eaton et al., 2011). The OPeNDAP framework advanced remote data access
by enabling subset retrieval from NetCDF files hosted on THREDDS servers without requiring full file
downloads (Gallagher et al., 2005). Several platforms provide direct Argo data access: the Argo GDAC supports
FTP and HTTP access to the global float archive (Argo Data Management Team, 2023), Euro-Argo ERIC offers
map-based spatial filtering, and the Argovis platform provides a RESTful API for querying by location, time,
and platform (Tucker et al., 2020). However, all existing systems require users to write structured queries with
explicit parameters and offer no natural language understanding or conversational interaction capabilities.
Retrieval-Augmented Generation
Lewis et al. (2020) introduced RAG as a framework integrating a pre-trained parametric language model with a
non-parametric retrieval index accessed through a neural retriever. This method demonstrated substantial
improvements in factual accuracy and reduced hallucination compared to purely generative models. Subsequent
work extended RAG approaches into naive RAG, advanced RAG, and modular RAG (Gao et al., 2024). Dense
vector retrieval using embedding models like Sentence-BERT (Reimers and Gurevych, 2019), stored in
specialized vector databases such as ChromaDB, has emerged as the leading retrieval paradigm. While RAG has
been applied to various domain-specific knowledge extraction tasks, no prior work has applied RAG to
interactive, conversational exploration of structured oceanographic observation data.