A Survey on Hybrid Caching Techniques to Reduce Latency in Large Language Model Systems
Article Sidebar
Main Article Content
Large Language Models (LLM) have vast applications in diverse fields such as text summarization and generation, generative and conversational Artificial Intelligence (AI) and Natural Language Processing tasks. However, generation of content for each real-time task causes high computational cost and latency in LLMs.
To address this drawback, the most effective solution proposed was - caching. Caching mechanisms were introduced to reuse a response instead of computing it for each redundant task. This survey explores various caching strategies from traditional key-value based techniques to the advanced hybrid strategies.
The paper highlights the effectiveness of caching techniques in improving the overall performance of LLM systems. Through this survey hybrid caching mechanism is found to be most useful with an estimate of 15-25% reduction in latency and 10-20% improvement compared to traditional caching mechanisms.
Downloads
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Zhou, Z., Ning, X., Hong, K., Fu, T., Xu, J., Li, S., ... & Wang, Y. (2024). A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294.
Bang, F. (2023, December). Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023) (pp. 212-218).
Markatos, E. P. (2001). On caching search engine query results. Computer Communications, 24(2), 137-143.
Chen, G., Chen, G., Wu, D., Liu, Q., Zhang, L., & Fan, X. (2021, July). An improved Simhash algorithm based malicious mirror website detection method. In Journal of Physics: Conference Series (Vol. 1971, No. 1, p. 012067). IOP Publishing.
Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35, 16344-16359.
Xie, Y., & O'hallaron, D. (2001). Locality in search engine queries and its implications for caching (No. CMUCS01128).
Mookerjee, V. S., & Tan, Y. (2002). Analysis of a least recently used cache management policy for web browsers. Operations Research, 50(2), 345-357.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186).
Reimers, N., & Gurevych, I. (2019, November). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 3982-3992).
Broder, A. Z. (1997, June). On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) (pp. 21-29). IEEE.
Charikar, M. S. (2002, May). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380-388).
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE transactions on big data, 7(3), 535-547.
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020, November). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 6769-6781).
Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., ... & Kurzweil, R. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175.
Liu, Y., Wu, J., He, Y., Gong, R., Xia, J., Li, L., ... & Li, K. (2025). Efficient inference for large reasoning models: A survey. arXiv preprint arXiv:2503.23077.
Haqiq, K., Jahan, M. V., Farimani, S. A., & Masoom, S. M. F. (2025). MinCache: A hybrid cache system for efficient chatbots with hierarchical embedding matching and LLM. Future Generation Computer Systems, 170, 107822.

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in our journal are licensed under CC-BY 4.0, which permits authors to retain copyright of their work. This license allows for unrestricted use, sharing, and reproduction of the articles, provided that proper credit is given to the original authors and the source.