llama.cpp: The Complete Guide to the Most Important Open-Source LLM Inference Engine

In March 2023, one developer — Georgi Gerganov — wrote a single C++ file that could run Meta's LLaMA model on a MacBook. That file became llama.cpp, and it fundamentally changed how the world runs large language models.
Three years later, llama.cpp has grown into the most-starred LLM inference project on GitHub with 97,500+ stars, 15,370+ forks, and 1,502 contributors. It powers everything from Ollama to the NVIDIA RTX AI Garage, and it has become the de facto standard for running LLMs locally on consumer hardware.
This isn't a wrapper, a UI, or a framework — it's the raw C/C++ inference engine that everything else is built on.
Key Stats
| Metric | Value |
|---|---|
| GitHub Stars | 97,500+ |
| Forks | 15,370+ |
| Contributors | 1,502 |
| Created | March 2023 |
| Organization | ggml-org |
| Language | C++ (97%+) |
| License | MIT |
| Releases | 5,000+ |
| Open Issues | 1,241 |
| Supported Models | 70+ architectures |
| Backends | 15+ hardware platforms |
Why llama.cpp Matters
Zero Dependencies
Plain C/C++ implementation without any external dependencies. No Python, no PyTorch, no CUDA toolkit required for basic inference. This makes it uniquely portable.
Universal Hardware Support
llama.cpp runs on virtually any hardware:
- Apple Silicon: First-class citizen via ARM NEON, Accelerate, and Metal
- x86: AVX, AVX2, AVX512, and AMX support
- RISC-V: RVV, ZVFH, ZFH extensions
- NVIDIA GPUs: Custom CUDA kernels
- AMD GPUs: Via HIP
- Intel GPUs: Via SYCL
- Qualcomm: Hexagon backend (in progress)
- WebGPU: Browser inference (in progress)
Aggressive Quantization
1.5-bit to 8-bit integer quantization lets you run models that would normally require 80GB+ of VRAM on a laptop with 8GB of RAM. This single feature democratized local LLM inference.
The GGUF Standard
llama.cpp defined the GGUF (GGML Unified Format) file format, which has become the universal format for sharing quantized models. Hugging Face hosts thousands of GGUF models, and every major model gets a GGUF conversion within hours of release.
Supported Model Architectures
llama.cpp supports 70+ model architectures, including:
Text-Only Models
| Category | Models |
|---|---|
| Meta | LLaMA, LLaMA 2, LLaMA 3 |
| Mistral | Mistral 7B, Mixtral MoE |
| Gemma, Gemma 2, Gemma 3 | |
| Microsoft | Phi, Phi-2, Phi-3, Phi-4, PhiMoE |
| Alibaba | Qwen, Qwen 2, Qwen 2.5, QwQ |
| DeepSeek | DeepSeek, DeepSeek V2/V3, DeepSeek R1 |
| Cohere | Command-R, Command-R+ |
| AI21 | Jamba |
| Databricks | DBRX |
| IBM | Granite models |
| Falcon | Falcon series |
| Code | Starcoder, CodeShell, Refact |
| Open-Source | OLMo, OLMoE, GPT-NeoX, Pythia, GPT-2, BERT, Bloom |
| Chinese | Chinese LLaMA, Baichuan, Yi, Aquila |
| Classic | Koala, StableLM, MPT, Mamba |
| Specialized | Bitnet b1.58, Flan T5, Open Elm, GritLM |
Multimodal Models (Vision + Language)
- LLaVA 1.5/1.6, MobileVLM, Obsidian, Llama 3.2 Vision
- Gemma 3 Vision, SmolVLM, Pixtral, Qwen 2 VL/2.5 VL
- Mistral Small 3.1, InternVL 2.5, Phi 3/3.5/4 Vision
- And many more
Supported Backends
| Backend | Hardware | Status |
|---|---|---|
| Metal | Apple GPU | ✅ Stable |
| CUDA | NVIDIA GPU | ✅ Stable |
| HIP | AMD GPU | ✅ Stable |
| Vulkan | Cross-platform GPU | ✅ Stable |
| SYCL | Intel GPU | ✅ Stable |
| MUSA | Moore Threads GPU | ✅ Stable |
| CANN | Ascend NPU | ✅ Stable |
| OpenCL | Cross-platform | ✅ Stable |
| BLAS | CPU (optimized) | ✅ Stable |
| BLIS | CPU (BLAS-like) | ✅ Stable |
| ZenDNN | AMD CPU | ✅ Stable |
| RPC | Remote procedure call | ✅ Stable |
| IBM zDNN | IBM Z processors | ✅ Stable |
| WebGPU | Browser | 🔄 In Progress |
| Hexagon | Qualcomm Snapdragon | 🔄 In Progress |
| VirtGPU | Virtual GPU | ✅ Stable |
A unique feature is CPU+GPU hybrid inference: if your model is too large for your GPU VRAM, llama.cpp transparently splits layers between CPU and GPU.
Core Tools
llama-cli
The primary command-line interface for text generation:
# Run a local model
llama-cli -m model.gguf
# Download and run from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
# Conversation mode with chat template
llama-cli -m model.gguf -cnv
llama-server
A lightweight, OpenAI API-compatible HTTP server:
# Start server on port 8080
llama-server -m model.gguf --port 8080
# With speculative decoding for faster inference
llama-server -m model.gguf -md draft.gguf
Supports chat completions, embeddings, reranking, custom grammars, JSON structured output, and up to 4 concurrent requests.
llama-bench
Benchmarking tool for measuring inference performance across different configurations.
llama-perplexity
Evaluate model quality by computing perplexity scores on datasets.
The Ecosystem
llama.cpp has spawned a massive ecosystem:
IDE Extensions
- llama.vscode — VS Code extension for FIM (Fill-in-the-Middle) completions
- llama.vim — Vim/Neovim plugin for FIM completions
Hugging Face Integration
- GGUF-my-repo — Convert any model to GGUF format online
- GGUF-my-LoRA — Convert LoRA adapters to GGUF
- GGUF-editor — Edit GGUF metadata in your browser
- Inference Endpoints — Host llama.cpp in the cloud via Hugging Face
Infrastructure & Deployment
- Ollama — The most popular llama.cpp wrapper for local inference
- GPUStack — Manage GPU clusters for running LLMs
- Paddler — LLMOps platform for hosting and scaling
- LLMKube — Kubernetes operator with multi-GPU support
- llama-swap — Transparent proxy with automatic model switching
- Kalavai — Crowdsourced LLM deployment
NVIDIA Collaboration
Support for the GPT-OSS model with native MXFP4 format was added in collaboration with NVIDIA's RTX AI Garage.
llama.cpp vs Alternative Inference Engines
| Feature | llama.cpp | Ollama | vLLM | MLX |
|---|---|---|---|---|
| Stars | 97.5K | ~120K | ~50K | ~25K |
| Language | C++ | Go + llama.cpp | Python | Python/C++ |
| Focus | Engine/Library | UX Wrapper | Production Serving | Apple Silicon |
| Quantization | ✅ 1.5-8bit | ✅ via GGUF | ✅ AWQ/GPTQ | ✅ Limited |
| Model Format | GGUF | GGUF (via llama.cpp) | SafeTensors | MLX format |
| Apple Silicon | ✅ First-class | ✅ Uses llama.cpp | ❌ CUDA-focused | ✅ Native |
| NVIDIA GPU | ✅ CUDA | ✅ CUDA | ✅ Optimized | ❌ |
| OpenAI API | ✅ llama-server | ✅ Built-in | ✅ Built-in | ❌ |
| Concurrency | Basic (4 slots) | Basic | ✅ PagedAttention | Basic |
| Single-user | ✅ Best latency | ✅ Fast | Heavier overhead | ✅ Fast |
| Multi-user | Limited | Limited | ✅ Best throughput | Limited |
| Dependencies | None | Minimal | PyTorch ecosystem | Apple frameworks |
| Use Case | Edge/Embedded/Library | Desktop/Easy setup | Production/Scale | Mac development |
When to Choose Each
- llama.cpp: When you need the raw engine — maximum portability, minimal dependencies, embedding in other applications, or running on edge/exotic hardware. Also the best choice when you need the lowest single-user latency.
- Ollama: When you want the easiest possible setup — one command to pull and run any model. Built on top of llama.cpp.
- vLLM: When you need production-grade multi-user serving — high throughput, PagedAttention, tensor parallelism, and scaling to hundreds of concurrent users.
- MLX: When you're exclusively on Apple Silicon and want native Apple framework integration for ML research and development.
Quick Start
Install via Package Manager
# macOS
brew install llama.cpp
# nixpkgs
nix profile install nixpkgs#llama-cpp
# Windows
winget install llama.cpp
Run a Model
# Download and run from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
# Or use a local file
llama-cli -m ./my-model.gguf
# Start an API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF
Build from Source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
Frequently Asked Questions
What is GGUF?
GGUF (GGML Unified Format) is the model file format used by llama.cpp. It's a single-file format that contains model weights, tokenizer, and metadata. Most models on Hugging Face are available in GGUF format.
Can I run it without a GPU?
Yes. llama.cpp was originally designed for CPU-only inference. With quantization, you can run 7B models on a laptop with 8GB of RAM.
What's the relationship with Ollama?
Ollama uses llama.cpp as its inference backend. Ollama adds a user-friendly CLI, model registry, and API server on top.
Does it support fine-tuned models?
Yes. Any model based on a supported architecture works, including LoRA fine-tunes and merged models.
How fast is it?
Performance depends on hardware and quantization. On an M2 MacBook Pro, you can expect 30-50+ tokens/second for 7B models at Q4_K_M quantization.
Conclusion
llama.cpp is arguably the single most important open-source project in the local AI revolution. By providing a zero-dependency, pure C/C++ inference engine that runs on everything from a Raspberry Pi to a multi-GPU server, Georgi Gerganov and 1,500+ contributors have democratized access to large language models.
With 97,500+ stars, 5,000+ releases, and support for 70+ model architectures across 15+ hardware backends, llama.cpp isn't just a library — it's the foundation layer on which the entire local LLM ecosystem is built. Ollama, LM Studio, GPT4All, Jan, and countless other tools all depend on it.
Whether you're building an edge AI application, prototyping with local models, or deploying a private inference server, llama.cpp is the engine that makes it possible.
