Smarter in Secret: Using Small LLMs for Secure Software Development
AI-assistance in coding - without relying on the internet?
A few weeks ago, I was working on a tedious task of categorizing a large number of research papers—specifically, 4,407 papers. Instead of doing this work manually, I asked Cursor to make a web application for me with a graphical interface. It took the agent about 5 minutes to complete the task, including two revisions from my part. It would have taken me the better part of a week to accomplish the same task, starting with learning Flask…
The productivity boost of AI assistance is clear in a developer setting. However, these tools are not always available. In a classified setting, connections to the internet are often a no-no. Allowing an agent free rein to send code to the cloud may even constitute a crime.
In a previous post, I looked into a paper discussing the use case for small language models (SLMs) in healthcare. Running models locally, AI assistance can be used while upholding core principles for patient confidentiality. However, it’s not difficult to see other domains with similar demands for confidentiality, but AI assistance would significantly boost productivity.
Enabling AI assistance in an air-gapped setting, where local models are run, is an interesting problem—one where a solution could be a significant power-up for developers in the defense and security industries. Hence, I dove into the question: Can smaller models assist developers the way massive cloud-hosted LLMs like ChatGPT or Claude can?
The short answer is: yes, and they’re improving fast.
Why Small Models Matter
While large language models (LLMs) dominate headlines and developer tools, they come with significant operational baggage. These models demand substantial computational resources, require persistent internet connections, and often raise eyebrows—or red flags—when it comes to handling sensitive code. In many professional contexts, sending proprietary or mission-critical code to a third-party API is generally not an option.
And yet, the productivity gains offered by LLM-based tools, such as Cursor, GitHub Copilot, or ChatGPT, are undeniable. They’re transforming development workflows, reducing time spent on boilerplate, debugging, and even design. That gap, between what’s possible and what’s permissible, is where Small Language Models (SLMs) are beginning to shine.
These smaller models, often with fewer than 10 billion parameters and sometimes as few as one billion, are designed to be light enough to run on local machines or even embedded devices. With efficient quantization and pruning, some can run on a laptop GPU or even a modern smartphone. In doing so, they offer a unique trade-off: lower raw performance in exchange for total control over data, environment, and compute.
But trade-offs only make sense if the results are usable. So the critical question is: Can these models support real software development tasks?
Small Models, Real Performance
Recent research has begun to show that, under the right conditions, small models can indeed perform surprisingly well on serious coding tasks. One of the most comprehensive evaluations comes from Souza et al. (2025), who tested multiple SLMs on 280 real Codeforces problems, ranging from simple string manipulation to algorithmically complex challenges with Elo ratings up to 2100. These are not artificial benchmarks—they're the same problems solved by programmers worldwide.
The results were striking. The PHI-4 14B model achieved a pass@3 rate of 63.6% on Python solutions alone, and this climbed to 73.6% when combining multiple language outputs. These are results that are far better than many assumed possible from open, locally runnable systems. Other models, such as LLaMA 3.2 3B, DeepSeek-Coder 6.7B, and Qwen-Code 1.8B, also posted credible results, sufficient to support debugging, refactoring, and simple code generation workflows.
Notably, most failures were not due to reasoning breakdowns but small implementation bugs, such as off-by-one errors, incorrect edge case handling, or forgetting to import a required library. These are the kinds of mistakes that human developers also make (that’s why the model gets them wrong) and that can often be caught with test cases or linting. In other words, small models are getting the structure right, even if they occasionally miss the polish.
Teaching Models to Think: The CodePLAN Approach
One promising research direction for improving small model reasoning could be the CodePLAN framework proposed by Liu et al. (2024). Instead of trying to cram more raw code examples into smaller models, CodePLAN teaches them to reason step-by-step, mirroring how developers break down problems before coding.
The technique is elegant: a larger model (like GPT-4) is used to generate structured solution plans alongside the actual code. These plans are high-level outlines—"First define the input, then loop through the list, then check for duplicates"—that describe the logic of the solution. During training, the smaller model is tasked with learning to both produce this plan and use it to generate the corresponding code.
By explicitly training the model to articulate how to solve the problem before attempting the code, the researchers observed significant performance improvements. Compared to conventional instruction tuning, CodePLAN improved pass@1 accuracy by more than 130% on the APPS benchmark. And unlike chain-of-thought prompting during inference, which can be cumbersome and slow, this planning step is internalized during training. The result is a compact model that thinks more like a seasoned engineer, without needing more tokens at runtime.
This planning-first method might be especially important for small models, which have limited capacity and often struggle with multi-step reasoning. By giving them an internal scaffold for breaking problems down, CodePLAN acts like a set of mental training wheels—until the model can ride on its own.
Why This Matters for Secure Development
The ability to run these kinds of intelligent assistants entirely offline opens up important new possibilities. In a secure facility with no outbound internet access, developers can still benefit from AI assistance—such as autocomplete, refactoring suggestions, simple tests, and even architectural scaffolding—without ever compromising confidentiality.
This isn’t just theoretical. Tools like Cursor, LM Studio, and Ollama are already making it easy to run open models, such as CodeLLaMA, DeepSeek-Coder, or Phi-3, locally with GPU support, chat interfaces, and IDE integrations. Developers can download a model, quantize it for speed, fine-tune it on internal codebases, and deploy it in complete isolation from the internet. There are also options for integrating them into a local development environment, even if some, like Cursor, still require cloud connection to function.
For organizations that operate under strict regulatory regimes, this is a new ball game, a paradigm shift. It allows them to reclaim the productivity gains of generative AI without violating their internal policies, or worse, the law.
From Labs to Local Environments
We're also seeing broader momentum within the ecosystem. Surveys like those by Cheng et al. (2024), Wang et al. (2024), and Subramanian et al. (2025) catalogue dozens of SLMs now purpose-built for niche tasks, including code generation. Many are instruction-tuned, some support multiple languages, and a growing number are license-permissive. There's even a surge in tiny models—those with fewer than 2B parameters—designed to run on edge devices, mobile IDEs, or behind the scenes as AI pair programmers.
This represents a significantly different vision from the cloud-dependent AI services of recent years. It’s distributed, open, modifiable—and yes, smaller. However, it’s also becoming increasingly powerful by the day.
This topic is definitely something I will be exploring further.
References
Cheng, C., Liu, S., Zhong, Z., Zhang, Y., Mao, X., Yang, S., and Zheng, S., 2024. A Survey of Small Language Models. arXiv preprint arXiv:2410.20011.
Liu, Y., Liu, J., Wang, Z., Wang, R., Huang, M., Wu, Z., Zhou, H., Zhao, W., Ma, J. and Huang, X., 2024. CodePLAN: Enhancing Code Generation via Prompt-Aware Solution Planning. arXiv preprint arXiv:2406.00515.
Souza, D., Gheyi, R., Albuquerque, L., Soares, G. and Ribeiro, M., 2025. Code Generation with Small Language Models: A Deep Evaluation on Codeforces. arXiv preprint arXiv:2504.07343.
Subramanian, S., Elango, V. and Gungor, M., 2025. Small Language Models (SLMs) Can Still Pack a Punch: A Survey. arXiv preprint arXiv:2501.05465.
Sun, Z., Lyu, C., Li, B., Wan, Y., Zhang, H., Li, G., and Jin, Z., 2024. Enhancing Code Generation Performance of Smaller Models by Distilling the Reasoning Ability of LLMs. arXiv preprint arXiv:2403.13271.
Wang, F., Zhang, Z., Zhang, X., Wu, Z., Mo, T., Lu, Q., Wang, W., Li, R., Xu, J., Tang, X., He, Q., Ma, Y., Huang, M. and Wang, S., 2024. A Comprehensive Survey of Small Language Models in the Era of Large Language Models. arXiv preprint arXiv:2411.03350.



