llama.cpp: The Complete Guide to the Most Important Open-Source LLM Inference Engine

In March 2023, one developer — Georgi Gerganov — wrote a single C++ file that could run Meta's LLaMA model on a MacBook. That file became llama.cpp, and it fundamentally changed how the world runs large language models.

Three years later, llama.cpp has grown into the most-starred LLM inference project on GitHub with 97,500+ stars, 15,370+ forks, and 1,502 contributors. It powers everything from Ollama to the NVIDIA RTX AI Garage, and it has become the de facto standard for running LLMs locally on consumer hardware.

This isn't a wrapper, a UI, or a framework — it's the raw C/C++ inference engine that everything else is built on.

Key Stats

Metric	Value
GitHub Stars	97,500+
Forks	15,370+
Contributors	1,502
Created	March 2023
Organization	ggml-org
Language	C++ (97%+)
License	MIT
Releases	5,000+
Open Issues	1,241
Supported Models	70+ architectures
Backends	15+ hardware platforms

Why llama.cpp Matters

Zero Dependencies

Plain C/C++ implementation without any external dependencies. No Python, no PyTorch, no CUDA toolkit required for basic inference. This makes it uniquely portable.

Universal Hardware Support

llama.cpp runs on virtually any hardware:

Apple Silicon: First-class citizen via ARM NEON, Accelerate, and Metal
x86: AVX, AVX2, AVX512, and AMX support
RISC-V: RVV, ZVFH, ZFH extensions
NVIDIA GPUs: Custom CUDA kernels
AMD GPUs: Via HIP
Intel GPUs: Via SYCL
Qualcomm: Hexagon backend (in progress)
WebGPU: Browser inference (in progress)

Aggressive Quantization

1.5-bit to 8-bit integer quantization lets you run models that would normally require 80GB+ of VRAM on a laptop with 8GB of RAM. This single feature democratized local LLM inference.

The GGUF Standard

llama.cpp defined the GGUF (GGML Unified Format) file format, which has become the universal format for sharing quantized models. Hugging Face hosts thousands of GGUF models, and every major model gets a GGUF conversion within hours of release.

Supported Model Architectures

llama.cpp supports 70+ model architectures, including:

Text-Only Models

Category	Models
Meta	LLaMA, LLaMA 2, LLaMA 3
Mistral	Mistral 7B, Mixtral MoE
Google	Gemma, Gemma 2, Gemma 3
Microsoft	Phi, Phi-2, Phi-3, Phi-4, PhiMoE
Alibaba	Qwen, Qwen 2, Qwen 2.5, QwQ
DeepSeek	DeepSeek, DeepSeek V2/V3, DeepSeek R1
Cohere	Command-R, Command-R+
AI21	Jamba
Databricks	DBRX
IBM	Granite models
Falcon	Falcon series
Code	Starcoder, CodeShell, Refact
Open-Source	OLMo, OLMoE, GPT-NeoX, Pythia, GPT-2, BERT, Bloom
Chinese	Chinese LLaMA, Baichuan, Yi, Aquila
Classic	Koala, StableLM, MPT, Mamba
Specialized	Bitnet b1.58, Flan T5, Open Elm, GritLM

Multimodal Models (Vision + Language)

LLaVA 1.5/1.6, MobileVLM, Obsidian, Llama 3.2 Vision
Gemma 3 Vision, SmolVLM, Pixtral, Qwen 2 VL/2.5 VL
Mistral Small 3.1, InternVL 2.5, Phi 3/3.5/4 Vision
And many more

Supported Backends

Backend	Hardware	Status
Metal	Apple GPU	✅ Stable
CUDA	NVIDIA GPU	✅ Stable
HIP	AMD GPU	✅ Stable
Vulkan	Cross-platform GPU	✅ Stable
SYCL	Intel GPU	✅ Stable
MUSA	Moore Threads GPU	✅ Stable
CANN	Ascend NPU	✅ Stable
OpenCL	Cross-platform	✅ Stable
BLAS	CPU (optimized)	✅ Stable
BLIS	CPU (BLAS-like)	✅ Stable
ZenDNN	AMD CPU	✅ Stable
RPC	Remote procedure call	✅ Stable
IBM zDNN	IBM Z processors	✅ Stable
WebGPU	Browser	🔄 In Progress
Hexagon	Qualcomm Snapdragon	🔄 In Progress
VirtGPU	Virtual GPU	✅ Stable

A unique feature is CPU+GPU hybrid inference: if your model is too large for your GPU VRAM, llama.cpp transparently splits layers between CPU and GPU.

Core Tools

llama-cli

The primary command-line interface for text generation:

# Run a local model
llama-cli -m model.gguf

# Download and run from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Conversation mode with chat template
llama-cli -m model.gguf -cnv

llama-server

A lightweight, OpenAI API-compatible HTTP server:

# Start server on port 8080
llama-server -m model.gguf --port 8080

# With speculative decoding for faster inference
llama-server -m model.gguf -md draft.gguf

Supports chat completions, embeddings, reranking, custom grammars, JSON structured output, and up to 4 concurrent requests.

llama-bench

Benchmarking tool for measuring inference performance across different configurations.

llama-perplexity

Evaluate model quality by computing perplexity scores on datasets.

The Ecosystem

llama.cpp has spawned a massive ecosystem:

IDE Extensions

llama.vscode — VS Code extension for FIM (Fill-in-the-Middle) completions
llama.vim — Vim/Neovim plugin for FIM completions

Hugging Face Integration

GGUF-my-repo — Convert any model to GGUF format online
GGUF-my-LoRA — Convert LoRA adapters to GGUF
GGUF-editor — Edit GGUF metadata in your browser
Inference Endpoints — Host llama.cpp in the cloud via Hugging Face

Infrastructure & Deployment

Ollama — The most popular llama.cpp wrapper for local inference
GPUStack — Manage GPU clusters for running LLMs
Paddler — LLMOps platform for hosting and scaling
LLMKube — Kubernetes operator with multi-GPU support
llama-swap — Transparent proxy with automatic model switching
Kalavai — Crowdsourced LLM deployment

NVIDIA Collaboration

Support for the GPT-OSS model with native MXFP4 format was added in collaboration with NVIDIA's RTX AI Garage.

llama.cpp vs Alternative Inference Engines

Feature	llama.cpp	Ollama	vLLM	MLX
Stars	97.5K	~120K	~50K	~25K
Language	C++	Go + llama.cpp	Python	Python/C++
Focus	Engine/Library	UX Wrapper	Production Serving	Apple Silicon
Quantization	✅ 1.5-8bit	✅ via GGUF	✅ AWQ/GPTQ	✅ Limited
Model Format	GGUF	GGUF (via llama.cpp)	SafeTensors	MLX format
Apple Silicon	✅ First-class	✅ Uses llama.cpp	❌ CUDA-focused	✅ Native
NVIDIA GPU	✅ CUDA	✅ CUDA	✅ Optimized	❌
OpenAI API	✅ llama-server	✅ Built-in	✅ Built-in	❌
Concurrency	Basic (4 slots)	Basic	✅ PagedAttention	Basic
Single-user	✅ Best latency	✅ Fast	Heavier overhead	✅ Fast
Multi-user	Limited	Limited	✅ Best throughput	Limited
Dependencies	None	Minimal	PyTorch ecosystem	Apple frameworks
Use Case	Edge/Embedded/Library	Desktop/Easy setup	Production/Scale	Mac development

When to Choose Each

llama.cpp: When you need the raw engine — maximum portability, minimal dependencies, embedding in other applications, or running on edge/exotic hardware. Also the best choice when you need the lowest single-user latency.
Ollama: When you want the easiest possible setup — one command to pull and run any model. Built on top of llama.cpp.
vLLM: When you need production-grade multi-user serving — high throughput, PagedAttention, tensor parallelism, and scaling to hundreds of concurrent users.
MLX: When you're exclusively on Apple Silicon and want native Apple framework integration for ML research and development.

Quick Start

Install via Package Manager

# macOS
brew install llama.cpp

# nixpkgs
nix profile install nixpkgs#llama-cpp

# Windows
winget install llama.cpp

Run a Model

# Download and run from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Or use a local file
llama-cli -m ./my-model.gguf

# Start an API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Build from Source

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

Frequently Asked Questions

What is GGUF?

GGUF (GGML Unified Format) is the model file format used by llama.cpp. It's a single-file format that contains model weights, tokenizer, and metadata. Most models on Hugging Face are available in GGUF format.

Can I run it without a GPU?

Yes. llama.cpp was originally designed for CPU-only inference. With quantization, you can run 7B models on a laptop with 8GB of RAM.

What's the relationship with Ollama?

Ollama uses llama.cpp as its inference backend. Ollama adds a user-friendly CLI, model registry, and API server on top.

Does it support fine-tuned models?

Yes. Any model based on a supported architecture works, including LoRA fine-tunes and merged models.

How fast is it?

Performance depends on hardware and quantization. On an M2 MacBook Pro, you can expect 30-50+ tokens/second for 7B models at Q4_K_M quantization.

Conclusion

llama.cpp is arguably the single most important open-source project in the local AI revolution. By providing a zero-dependency, pure C/C++ inference engine that runs on everything from a Raspberry Pi to a multi-GPU server, Georgi Gerganov and 1,500+ contributors have democratized access to large language models.

With 97,500+ stars, 5,000+ releases, and support for 70+ model architectures across 15+ hardware backends, llama.cpp isn't just a library — it's the foundation layer on which the entire local LLM ecosystem is built. Ollama, LM Studio, GPT4All, Jan, and countless other tools all depend on it.

Whether you're building an edge AI application, prototyping with local models, or deploying a private inference server, llama.cpp is the engine that makes it possible.

llama.cpp: The Complete Guide to the Most Important Open-Source LLM Inference Engine

llama.cpp: The Complete Guide to the Most Important Open-Source LLM Inference Engine

Key Stats

Why llama.cpp Matters

Zero Dependencies

Universal Hardware Support

Aggressive Quantization

The GGUF Standard

Supported Model Architectures

Text-Only Models

Multimodal Models (Vision + Language)

Supported Backends

Core Tools

llama-cli

llama-server

llama-bench

llama-perplexity

The Ecosystem

IDE Extensions

Hugging Face Integration

Infrastructure & Deployment

NVIDIA Collaboration

llama.cpp vs Alternative Inference Engines

When to Choose Each

Quick Start

Install via Package Manager

Run a Model

Build from Source

Frequently Asked Questions

What is GGUF?

Can I run it without a GPU?

What's the relationship with Ollama?

Does it support fine-tuned models?

How fast is it?

Conclusion

Resources

Tags

Claude Code Best Practice: The Complete Guide to Mastering Agentic Coding

Paperclip: The Complete Guide to Open-Source Orchestration for Zero-Human Companies

Crawlee Python: The Complete Guide to Web Scraping and Browser Automation