Floatchat RAG: An AI-Powered Conversational System for Argo Oceanographic Data Exploration Using Retrieval-Augmented Generation
Article Sidebar
Main Article Content
Oceanographic research involves massive volumes of heterogeneous data produced by autonomous profiling floats. The Argo program, one of the world's largest ocean observation efforts, generates datasets in NetCDF format containing temperature, salinity, and pressure measurements at varying ocean depths. However, accessing and querying this data requires specialized knowledge of scientific programming, data formats, and oceanographic conventions, creating barriers for non-technical users. This paper presents FloatChat RAG, an AI-powered conversational system that uses Retrieval-Augmented Generation (RAG) to enable natural language exploration of Argo float data. The system processes Argo NetCDF files streamed via OPeNDAP from NOAA's THREDDS servers into a SQLite relational database and generates semantic vector embeddings stored in ChromaDB using the all-MiniLM-L6-v2 sentence transformer model. A LangChain-based tool-calling agent, powered by Google's Gemini large language model, interprets user queries and autonomously selects from nine specialized tools spanning semantic search, structured SQL retrieval, geographic and temporal filtering, and interactive Plotly visualization generation. The system incorporates reliability mechanisms including API key rotation, deterministic fallback routes, and response caching. Evaluation on a proof-of-concept dataset from Indian Ocean Argo floats demonstrates 93.3% tool selection accuracy across 30 test queries, 100% factual correctness on deterministic queries, a semantic search precision@5 of 1.00, and a 0% hallucination rate. The system bridges the gap between raw oceanographic data and actionable insights through an intuitive Streamlit chat interface.
Downloads
References
Argo Data Management Team, “Argo User’s Manual V3.41,” IFREMER, 2023. https://doi.org/10.13155/29825
Argo Science Team, “Argo: The Global Array of Profiling Floats,” CLIVAR Exchanges, vol. 5, no. 4, pp. 2–3, 2000.
B. Eaton et al., “NetCDF Climate and Forecast (CF) Metadata Conventions, Version 1.6,” 2011. Available: https://cfconventions.org/
J. Gallagher, N. Potter, T. Sgouros, S. Flierl, and S. Hankin, “The Data Access Protocol — DAP 2.0,” in Proc. ESA-ESO-NASA-NSF Conf. on Astronomical Data Analysis Software and Systems, 2005.
Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” arXiv preprint arXiv:2312.10997, 2024. https://doi.org/10.48550/arXiv.2312.10997
S. Hoyer and J. Hamman, “xarray: N-D labeled arrays and datasets in Python,” J. Open Research Software, vol. 5, no. 1, p. 10, 2017. https://doi.org/10.5334/jors.148
P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” in Proc. EMNLP, pp. 3982–3992, 2019. https://doi.org/10.18653/v1/D19-1410
R. Rew and G. Davis, “NetCDF: An interface for scientific data access,” IEEE Computer Graphics and Applications, vol. 10, no. 4, pp. 76–82, 1990. https://doi.org/10.1109/38.56302
D. Roemmich et al., “The Argo Program: Observing the global ocean with profiling floats,” Oceanography, vol. 22, no. 2, pp. 34–43, 2009. https://doi.org/10.5670/oceanog.2009.36
T. Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools,” Advances in Neural Information Processing Systems, vol. 36, 2024.
K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval Augmentation Reduces Hallucination in Conversation,” in Findings of the ACL: EMNLP 2021, pp. 3784–3803, 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.320
T. Tucker, D. Giglio, M. Scanderbeg, and S. S. P. Shen, “Argovis: A Web Application for Fast Delivery, Visualization, and Analysis of Argo Data,” J. Atmospheric and Oceanic Technology, vol. 37, no. 3, pp. 401–416, 2020. https://doi.org/10.1175/JTECH-D-19-0041.1

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in our journal are licensed under CC-BY 4.0, which permits authors to retain copyright of their work. This license allows for unrestricted use, sharing, and reproduction of the articles, provided that proper credit is given to the original authors and the source.