Running LLMs on Jetson: OpenClaw Benchmark

Running large language models (LLMs) on the Edge AI side is no longer just possible , it’s becoming practical. NVIDIA Jetson series devices, in particular, hold a significant position in this space by offering high performance with low power consumption.
In this article, we’ll cover:
- My experience running models on Jetson devices with OpenClaw
- Performance of different backends (vLLM, Ollama, llama.cpp)
- The impact of quantization and model selection
- Tool calling capabilities of models
- And why the Jetson platform matters
Model Deployment with OpenClaw
OpenClaw simplifies the deployment process by enabling models to be spun up with a single command. It provides multi-backend support, allowing different inference engines to be used within the same unified framework, while its Jetson compatible architecture ensures smooth operation on edge devices. In addition, it makes benchmarking straightforward, enabling quick comparisons across models and backends.
In this setup, three different backends were used. vLLM stands out with its high throughput and server-side optimizations, Ollama offers ease of use and fast setup for rapid experimentation, and llama.cpp provides flexibility with its CPU/GPU hybrid design and strong support for quantization, making it well-suited for resource-constrained environments.
Tool Calling and Model Selection
Why Is It Necessary?
OpenClaw’s architecture isn’t built solely on “text generation.” The model is expected to:
- Generate function calls
- Return structured output (JSON, etc.)
- Understand tool invocation formats
Therefore, models without tool calling support create significant limitations in real-world usage scenarios within OpenClaw. Such models cannot trigger tools, cannot integrate into agent pipelines, and in practice remain just a “chat model.” This is a major disadvantage that limits the system from a functional perspective.
The goal of this benchmark was not only to measure raw speed, but also to evaluate models that can be used in real agent and tool-driven systems. For this reason, models with tool calling support, strong instruction-following capabilities, and the ability to generate structured output were preferred in model selection.
What Does It Do?
- API calling (e.g., weather, database)
- Computation or code execution
- File/system operations
- Building agent-based systems
In short: LLM becomes not just a conversational system, but an action-taking system .
Benchmark Results
📊 All Results
| Backend | Product | Model | Params | Tok/s | Quant |
|---|---|---|---|---|---|
| vLLM | AGX Thor | gemma-4-31B-it | 31B | 2.61 | bf16 |
| vLLM | AGX Thor | Nemotron-30B | 30B | 7.76 | – |
| vLLM | AGX Thor | Qwen3.5-27B | 27B | 2.35 | – |
| Ollama | AGX Thor | Qwen3.5-35B-A3B | 35B | 38 | Q5_K_M |
| Ollama | AGX Thor | Qwen3 30B A3B | 30B | 34 | – |
| Ollama | AGX Orin | Falcon-H1 3B | 3B | 29 | FP16 |
| Ollama | AGX Orin | Qwen3.5 4B | 4B | 21 | – |
| Ollama | AGX Orin | Phi-3.5 MoE 42B | 42B | 8 | – |
| Ollama | Orin NX 8G | Qwen3.5 9B | 9B | 14 | – |
| Ollama | Orin NX 8G | Nemotron Nano 9B v2 | 9B | 14 | – |
| llama.cpp | DGX Spark | Gemma-4 31B bf16 | 31B | 3.7 | bf16 |
| llama.cpp | DGX Spark | Gemma-4 31B int8 | 31B | 6.5 | AWQ int8 |
| llama.cpp | DGX Spark | Gemma-4 31B int4 | 31B | 10.6 | AWQ int4 |
| llama.cpp | DGX Spark | Gemma-4 26B-A4B MoE | 26B | 23.7 | bf16 |
Conclusion
- Running LLMs at the edge with Jetson is now realistic
- Tool calling support is the critical feature that elevates these systems to real agent architectures
- Right backend + quantization = maximum efficiency

