Edge inference matures: how choosing power mode and backend makes up to a 74% difference in tiny models

🕒 Published on Zendoric: June 28, 2026 · 09:00
A thorough benchmark on the Jetson Orin Nano Super 8GB reveals that 25 W mode is the sweet spot for continuous inference: 43% faster than 15 W with better energy efficiency than MAXN. And the choice between llama.cpp and Ollama can matter more than the hardware itself.
By Zendoric · June 28, 2026.
The Jetson Orin Nano Super 8GB costs around 250 dollars. It can run eight small language models in real time, with an NVIDIA Ampere GPU, 1,024 CUDA cores and 8 GB of unified LPDDR5 memory at 204.8 GB/s. Yuvraj Singh has just published one of the most complete edge inference benchmarks seen to date: 8 LLM models, 4 power modes, two inference backends and thousands of logged requests with energy-per-token metrics rather than speed alone. The result is a practical roadmap for those deploying AI at the network edge.
**The 25 W mode is the real turning point**
The most actionable conclusion of the study is that the 25 W mode (nvpmodel -m 1, GPU at 820 MHz, CPU at 1,420 MHz) dominates across every dimension that matters in production: it delivers between 35% and 47% more tokens per second than the 15 W mode, while at the same time improving energy efficiency (tok/J) by between 1% and 7% over the 15 W mode itself, and by between 9% and 23% over the MAXN mode. This happens because MAXN raises clocks up to 1,020 MHz on the GPU and 1,728 MHz on the CPU, consuming between 6.36 and 10.64 W on the VDD_CPU_GPU_CV rail, but inference of small models in single-user mode is limited by memory bandwidth, not by compute capacity. Pushing the clock beyond 820 MHz adds raw speed but at an energy cost that exceeds the gain in generated tokens.
For always-on systems this has direct consequences: MAXN mode may be appropriate when prefill latency matters more than the energy bill (it reduces time to first token by between 31% and 38% relative to 15 W), but for continuous text generation the rule is clear: 25 W is the optimal trade-off. The 7 W mode is useful for efficiency research or battery-powered deployments, but it requires restarting the device between model loads due to memory constraints.
**The choice of backend matters as much as the hardware**
The second most significant finding is the gap between llama.cpp and Ollama. On sub-1B transformer models, llama.cpp beats Ollama by between 36% and 74% in throughput, with a proportional advantage in energy efficiency. The most extreme difference affects Liquid AI's LFM2.5-350M model: at 25 W, llama.cpp generates 115.4 tok/s versus Ollama's 27.5, a 4.2-fold difference. In tok/J terms, LFM2.5-350M reaches 17.16 tok/J under llama.cpp and barely 6.39 under Ollama in that same mode.
The technical reason is not a configuration error: Ollama loaded all models with 100% offload to GPU (confirmed via ollama ps), so no layer fell back to the CPU. The difference reflects inefficiencies in Ollama's CUDA kernels for SSM (State Space Models) architectures, which are the basis of Liquid AI's LFM models. The author uses Ollama v0.24.0, the only version compatible with all models on JetPack R36.4.7 at the time of testing; more recent versions of Ollama, which use a more up-to-date llama.cpp fork, could close part of that gap. The methodological caveat is important: Ollama's results are specific to that software state and should be revisited periodically.
Where the difference practically vanishes is on Qwen3-0.6B and Llama3.2-1B, where both backends deliver nearly identical results (between 1% and 6% difference). For well-supported conventional transformer architectures, Ollama can be a reasonable choice if its model-management ecosystem justifies the abstraction cost.
**Which model to choose and why 101 MB is enough for many cases**
The pure-speed champion is SmolLM2-135M-Instruct: 165.2 tok/s and 29.62 tok/J at 25 W, in a 101 MB GGUF file quantized to Q4_K_M. Its peak consumption in that configuration is around 5.6 W, which, as Singh notes, makes it operable from a USB-C power bank. This is no trivial detail: it means there are local-assistance, form-processing or voice-interface use cases that can run without a dedicated power supply.
In the ~1B parameter class, Liquid AI's LFM2.5-1.2B leads in speed (54.1 tok/s at 25 W, 15% faster than Llama3.2-1B and 33% faster than Gemma3-1B) in the smallest footprint of its category (698 MB). However, Gemma3-1B shows greater total energy efficiency (118.5 tok/J versus LFM2.5-1.2B's 116.2) thanks to more moderate consumption during decode (6.82 W versus 8.52 W). The choice between the two depends on the system's energy budget, not just on speed.
**What this study says about the state of the art in edge AI**
As sector context, the proliferation of quality sub-2B models —SmolLM2, LFM2.5, Qwen3, Gemma3— is transforming the edge inference space quietly but consistently. Two years ago, running an instruction-following model on a low-power device required severe compromises in quality. Today, a chip that costs 250 dollars can generate more than 165 tokens per second with models that, on specific tasks, deliver usable results.
What makes this benchmark valuable is not just which model wins, but how it is built: reproducible methodology, raw data published on Hugging Face with all tegrastats logs, open source code on GitHub, and a rigorous separation between prefill energy and decode energy to compute tok/J honestly. Tok/J separated by phase is a significant methodological advance over benchmarks that simply divide tokens by average watts, which artificially inflates efficiency on prefill-heavy workloads.
The practical implication for those designing edge AI pipelines is concrete: before scaling to more expensive hardware or adding nodes, it is worth auditing the device's power mode and the inference backend. In this study, both decisions combined can quadruple the efficiency obtained from the same chip. That is optimization headroom that requires no new hardware.