Ollama LLM: Get Up and Running with Large Language Models

What is Ollama?

To run the Ollama LLM, an isolated environment is created, ensuring no conflicts with other programs. This environment includes model weights, configuration files, and necessary dependencies. The model is downloaded locally and stored in the `~/.ollama` directory, allowing for offline use. This also provides strict data security, which is valuable for developers as your prompts remain private.

 

Ollama consists of a client and a server. The client is where the user interacts via the terminal. The server, written in “Go,” is implemented as a backend service with a REST API.

 

Ollama architecture

Figure 1: Ollama architecture

The server can be started through one of several methods: command line, desktop application (based on the Electron framework), or Docker. The client and server communicate using HTTP.

 

Ollama uses llama.cpp to execute LLM text generation. Llama.cpp is a library optimized for running LLaMA models on the CPU. Its lightweight structure allows large language models to run even on low-end devices. It is platform-independent and supports 4-bit and 8-bit quantization to reduce memory consumption and processing costs.

What are GGML and GGUF?

GGML is a model file format. Its primary purpose is to efficiently store and run large language models. It can store quantized versions of models (e.g., 4-bit or 8-bit) and is suitable for CPU-based inference. GGUF is a more advanced and flexible version of GGML. It offers better performance and, unlike GGML, is compatible with all models, not just Llama models. It is suitable for both CPU and GPU.

 

 

GGUF

Figure 2: GGUF Model File Format

How Does Ollama Work?

The user starts the conversation via CLI (ollama run llama3.2). The CLI client sends an HTTP request to the ollama-http-server to get information about the model and reads the local “manifests” metadata file. If it doesn’t exist, the CLI client sends a request to the ollama-http-server to pull the model, and the server downloads it from the remote repository to the local machine.

 

The CLI first sends an empty message to the /api/generate endpoint on the ollama-http-server, and the server performs some internal channel handling (in Go, channels are used). The formal conversation begins: CLI sends a request to the /api/chat endpoint on the ollama-http-server. The ollama-http-server relies on the llama.cpp engine to load the model and perform inference (since llama.cpp also serves as an HTTP server).

Conclusion

Ollama is a powerful tool that allows users to run large language models efficiently and securely. Its isolated environment structure, local storage, and data security features stand out. The GGML and GGUF formats enhance model performance, while the llama.cpp library ensures high performance even on low-end devices. Ollama’s client-server architecture and HTTP-based communication make it easy for users to interact. These features make Ollama an indispensable tool for developers.