Design and Implementation of a Lightweight Neural Controller for LLM Agent Systems
Article Sidebar
Main Article Content
Large language model (LLM) agents typically consume thousands of tokens on scaffolding system prompts, tool schemas, and conversation history before producing a single useful word. NEXUS (Neural EXecution & Understanding Substrate) is a 6.29-million-parameter neural controller that sits between a frozen LLM and its execution environment, replacing that token overhead with compact vector signals.
The controller has five subsystems that run together (1) Protocol Cortex, which writes task descriptions directly into the LLM’s key-value cache so the model behaves as though it received detailed instructions without those instructions; (2) Belief Engine, a recurrent state-space model that tracks what the agent currently believes about its environment; (3) Resource Router, tool-selection classifier that uses explicit state-machine logic to guarantee valid tool calls; (4) Drift Sentinel, a lightweight monitor that detects when the agent’s output begins drifting off-task; and (5) Adapter Switch, which selects among small, low-rank weight updates (LoRA adapters) to specialize the LLM for different sub-tasks on the fly.
We make three separate claims. First, we describe the architecture and the training recipe for all five components. Second, we report a deployment result: an open-source Model Context Protocol (MCP) server, nexus-mcp-oss, which achieves 72.86% fewer tokens delivered to the LLM in production through heuristic text-level compression (distinct from the KV-cache injection mechanism). Third, we present a controlled evaluation of the KV-cache injection mechanism itself, in which the Protocol Cortex is trained end-to-end with a frozen TinyLlama 1.1B and reaches a held-out perplexity of 8.91 versus 26,607 for an untrained baseline 2,987-fold improvement and 30–77 times lower perplexity than Prefix-Tuning, ActAdd, and LLMLingua at matched compression. Three of the four trained components converge on the synthetic benchmark; trained checkpoints, and benchmark data are publicly available. We are explicit about scope: the gains reported here are measured as token efficiency and predictive (perplexity) quality, and we set out in Sections 8.5 and 9.5 a concrete plan to test whether these efficiency gains carry through to downstream task quality: reasoning, planning, coding assistance, and multi-agent coordination.
Downloads
References
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing reasoning and acting in language models. ICLR 2023.
Shinn, N., Cassano, F., Labash, B., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023.
Jiang, H., Wu, Q., Luo, X., Li, D., Lin, C.-Y., Yang, Y., & Qiu, X. (2023). LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
Chevalier, A., Wettig, A., Ajith, A., & Chen, D. (2023). Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020.
Anthropic. (2024). Prompt Caching (API feature). https://www.anthropic.com/news/prompt-caching
Gu, A., & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. NeurIPS 2022.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. ICLR 2022.
Anthropic. (2024). Model Context Protocol. https://modelcontextprotocol.io
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. NeurIPS 2023.
Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., Zhao, S., Hong, L., Tian, R., Xie, R., Zhou, J., Gerstein, M., Li, D., Liu, Z., & Sun, M. (2023). ToolLLM: Facilitating large language models to master 16000+ real-world APIs. ICLR 2024.
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. UIST 2023.
Gravitas, S. (2023). AutoGPT: An autonomous GPT-4 experiment. https://github.com/Significant-Gravitas/AutoGPT
Turner, A., Thiergart, L., Udell, D., Leech, G., Mini, U., & MacDiarmid, M. (2023). Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248.
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, B., Ren, R., Pan, A., Yin, P., Mazeika, M., Dombrowski, A. K., Goel, S., Li, N., Byun, M., Wang, Z., Mallen, A., Schwinn, L., Bhatt, U., Steinhardt, J., Fredrikson, M., & Hendrycks, D. (2023). Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405.
Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. ACL-IJCNLP 2021.
Fu, D., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., & Ré, C. (2023). Hungry hungry hippos: Towards language modeling with state space models. ICLR 2023.
Poli, M., Massaroli, S., Nguyen, E., Fu, D., Dao, T., Baccus, S., Bengio, Y., Ermon, S., & Ré, C. (2023). Hyena hierarchy: Towards larger convolutional language models. ICML 2023.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. Advances in Neural Information Processing Systems, 36.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. NeurIPS 2017.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022
Z. Zhang et al. “H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.” NeurIPS 2023.
G. Xiao et al. “Efficient Streaming Language Models with Attention Sinks (StreamingLLM).” ICLR 2024.
Z. Liu et al. “Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time.” NeurIPS 2023.
J. Mu, X. Li, N. Goodman. “Learning to Compress Prompts with Gist Tokens.” NeurIPS 2023.
A. Gu et al. “Efficiently Modeling Long Sequences with Structured State Spaces (S4).” ICLR 2022.
A. Gu, T. Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv:2312.00752, 2023.
Z. Xi et al. “The Rise and Potential of Large Language Model Based Agents: A Survey.” arXiv:2309.07864, 2023.
Langay, B. B. (2026). Agent Workspace: Browser-Based Multi-Tool Integration and Sidecar Control for Autonomous LLM Agents. Master of Computer Science Thesis, Anhui University of Technology, Ma'anshan, China.

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in our journal are licensed under CC-BY 4.0, which permits authors to retain copyright of their work. This license allows for unrestricted use, sharing, and reproduction of the articles, provided that proper credit is given to the original authors and the source.