Running LLMs on Jetson: OpenClaw Benchmark

Running large language models (LLMs) on the Edge AI side is no longer just possible , it’s becoming practical. NVIDIA Jetson series devices, in particular, hold a significant position in this space by offering high performance with low power consumption.

In this article, we’ll cover:

  • My experience running models on Jetson devices with OpenClaw
  • Performance of different backends (vLLM, Ollama, llama.cpp)
  • The impact of quantization and model selection
  • Tool calling capabilities of models
  • And why the Jetson platform matters

Model Deployment with OpenClaw

OpenClaw simplifies the deployment process by enabling models to be spun up with a single command. It provides multi-backend support, allowing different inference engines to be used within the same unified framework, while its Jetson compatible architecture ensures smooth operation on edge devices. In addition, it makes benchmarking straightforward, enabling quick comparisons across models and backends.

In this setup, three different backends were used. vLLM stands out with its high throughput and server-side optimizations, Ollama offers ease of use and fast setup for rapid experimentation, and llama.cpp provides flexibility with its CPU/GPU hybrid design and strong support for quantization, making it well-suited for resource-constrained environments.

Tool Calling and Model Selection

Why Is It Necessary?
OpenClaw’s architecture isn’t built solely on “text generation.” The model is expected to:

  • Generate function calls
  • Return structured output (JSON, etc.)
  • Understand tool invocation formats

Therefore, models without tool calling support create significant limitations in real-world usage scenarios within OpenClaw. Such models cannot trigger tools, cannot integrate into agent pipelines, and in practice remain just a “chat model.” This is a major disadvantage that limits the system from a functional perspective.

The goal of this benchmark was not only to measure raw speed, but also to evaluate models that can be used in real agent and tool-driven systems. For this reason, models with tool calling support, strong instruction-following capabilities, and the ability to generate structured output were preferred in model selection.

What Does It Do?

  • API calling (e.g., weather, database)
  • Computation or code execution
  • File/system operations
  • Building agent-based systems

In short: LLM becomes not just a conversational system, but an action-taking system .

Benchmark Results

📊 All Results

Backend Product Model Params Tok/s Quant
vLLM AGX Thor gemma-4-31B-it 31B 2.61 bf16
vLLM AGX Thor Nemotron-30B 30B 7.76
vLLM AGX Thor Qwen3.5-27B 27B 2.35
Ollama AGX Thor Qwen3.5-35B-A3B 35B 38 Q5_K_M
Ollama AGX Thor Qwen3 30B A3B 30B 34
Ollama AGX Orin Falcon-H1 3B 3B 29 FP16
Ollama AGX Orin Qwen3.5 4B 4B 21
Ollama AGX Orin Phi-3.5 MoE 42B 42B 8
Ollama Orin NX 8G Qwen3.5 9B 9B 14
Ollama Orin NX 8G Nemotron Nano 9B v2 9B 14
llama.cpp DGX Spark Gemma-4 31B bf16 31B 3.7 bf16
llama.cpp DGX Spark Gemma-4 31B int8 31B 6.5 AWQ int8
llama.cpp DGX Spark Gemma-4 31B int4 31B 10.6 AWQ int4
llama.cpp DGX Spark Gemma-4 26B-A4B MoE 26B 23.7 bf16

Conclusion

  • Running LLMs at the edge with Jetson is now realistic
  • Tool calling support is the critical feature that elevates these systems to real agent architectures
  • Right backend + quantization = maximum efficiency