LLMFit: The Complete Guide to Finding Which LLMs Run on Your Hardware

Hundreds of models. One command. LLMFit is a Rust-powered terminal tool that right-sizes LLM models to your system's RAM, CPU, and GPU. It detects your hardware, scores each model across quality, speed, fit, and context dimensions, and tells you which ones will actually run well on your machine. With 12,900+ stars, multi-GPU support, MoE architectures, Ollama/llama.cpp integration, and a beautiful TUI — this is the tool every local AI developer needs.

LLMFit on GitHub

What Is LLMFit?

A terminal tool (TUI + CLI + REST API) that answers the question: "Which LLMs can I actually run on my hardware?"

Language: Rust
License: MIT
Stars: 12,900+ ⭐
Forks: 719
Contributors: 30
Releases: 50

Hardware Detection

LLMFit probes your system automatically:

Hardware	Detection Method
NVIDIA	Multi-GPU via `nvidia-smi`, VRAM aggregation
AMD	`rocm-smi`
Intel Arc	Discrete VRAM via `sysfs`, integrated via `lspci`
Apple Silicon	Unified memory via `system_profiler`
Ascend NPU	`npu-smi`
Backend	Auto-detects CUDA, Metal, ROCm, SYCL, CPU ARM/x86, Ascend

Multi-Dimensional Scoring

Each model scores 0–100 on four dimensions:

Dimension	What It Measures
Quality	Parameter count, model family, quantization penalty, task alignment
Speed	Estimated tok/s from GPU bandwidth, params, quantization
Fit	Memory utilization efficiency (sweet spot: 50–80%)
Context	Context window capability vs target

Weights vary by use-case: Chat weights Speed higher (0.35), Reasoning weights Quality higher (0.55).

Key Features

Dynamic Quantization

Walks Q8_0 → Q2_K hierarchy, picking the highest quality that fits. If nothing fits at full context, retries at half context.

MoE Support

Mixtral, DeepSeek-V2/V3 detected automatically. Only active experts counted for VRAM — Mixtral 8x7B drops from 23.9 GB to ~6.6 GB.

Speed Estimation

Memory-bandwidth-bound formula: (bandwidth_GB_s / model_size_GB) × 0.55. Covers ~80 GPUs across NVIDIA, AMD, Apple Silicon.

Run Modes

GPU — Model fits in VRAM
MoE — Expert offloading (active in VRAM, inactive in RAM)
CPU+GPU — Partial GPU offload
CPU — Full system RAM

Three Interfaces

TUI (Default)

llmfit

Interactive terminal UI: system specs at top, models in scrollable table sorted by composite score. Each row shows score, tok/s, best quantization, run mode, memory usage, use-case category.

Plan Mode (p) — Inverts the question: "What hardware is needed for this model?" Shows min/recommended VRAM/RAM/CPU cores, feasible run paths, upgrade deltas.

CLI Mode

llmfit --cli                           # Table of all models
llmfit fit --perfect -n 5              # Top 5 perfect fits
llmfit search "llama 8b"               # Search by name/provider/size
llmfit recommend --json --use-case coding --limit 3
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192

REST API

llmfit serve --host 0.0.0.0 --port 8787
curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"

Runtime Integration

Ollama

Auto-detects installed models, downloads new ones from TUI. Supports remote instances via OLLAMA_HOST.

llama.cpp

Maps HuggingFace models to GGUF repos, downloads into local cache, detects installed GGUF files.

LLMFit vs Alternatives

Category: This tool is a local LLM hardware compatibility analyzer.

Feature	LLMFit	llm-checker	Ollama
Focus	Hardware→Model matching	Actual model benchmarking	LLM runtime
Stars	12.9K ⭐	1.4K ⭐	164K ⭐
License	MIT	Other	MIT
Language	Rust	JavaScript	Go
TUI	✅ Interactive	❌ CLI only	❌
REST API	✅ `llmfit serve`	❌	✅
Multi-GPU	✅ VRAM aggregation	❌	✅
MoE Support	✅ Expert offloading	❌ Dense only	✅
Apple Silicon	✅ Unified memory	✅	✅
Intel Arc	✅	❌	❌
Ascend NPU	✅	❌	❌
Dynamic Quantization	✅ Q8_0→Q2_K	❌	✅ Auto
Speed Estimation	✅ GPU bandwidth-based	✅ Real benchmark	❌
Multi-Dim Scoring	✅ Quality/Speed/Fit/Context	❌	❌
Plan Mode	✅ "What hardware do I need?"	❌	❌
Use-Case Filtering	✅ Coding/Reasoning/Chat/etc.	❌	❌
Ollama Integration	✅ Detect + Download	✅ Pull + Benchmark	N/A
llama.cpp Integration	✅ GGUF download	❌	Built-in
JSON Output	✅	✅	✅
Themes	✅ 6 built-in	❌	❌
Model Database	✅ HuggingFace (hundreds)	✅	✅ Registry
Agent Skill	✅ OpenClaw	❌	❌
Runs Models	❌ Recommends	✅ Via Ollama	✅ Core function

When to choose LLMFit: You want to know which models fit your hardware before downloading anything. Multi-dimensional scoring, dynamic quantization, MoE support, Plan mode, and a beautiful TUI. The recommender, not the runner.

When to choose llm-checker: You want to actually benchmark models by running them via Ollama. Real throughput numbers instead of estimates. Simpler but slower (needs to download and run each model).

When to choose Ollama: You want to run LLMs locally, not choose them. Ollama is the runtime; LLMFit integrates with it to tell you what to run.

Quick Start

# macOS/Linux
brew install alexsjones/tap/llmfit
# or
cargo install llmfit

# Windows
winget install llmfit

# Run
llmfit          # Interactive TUI
llmfit --cli    # Classic CLI

Conclusion

LLMFit solves the "which model should I run?" problem that every local AI user faces. Instead of guessing, downloading, and discovering your hardware can't handle it — LLMFit scores hundreds of models against your actual RAM, CPU, and GPU in seconds. The multi-dimensional scoring (quality, speed, fit, context), dynamic quantization selection, MoE-aware memory estimation, and Plan mode ("what hardware do I need?") make it the most sophisticated hardware→model matching tool available. Built in Rust, shipping as a single binary with 50 releases and 12.9K stars.

Explore LLMFit on GitHub

LLMFit: The Complete Guide to Finding Which LLMs Run on Your Hardware

LLMFit: The Complete Guide to Finding Which LLMs Run on Your Hardware

What Is LLMFit?

Hardware Detection

Multi-Dimensional Scoring

Key Features

Dynamic Quantization

MoE Support

Speed Estimation

Run Modes

Three Interfaces

TUI (Default)

CLI Mode

REST API

Runtime Integration

Ollama

llama.cpp

LLMFit vs Alternatives

Quick Start

Conclusion

Resources

Tags

Claude Code Best Practice: The Complete Guide to Mastering Agentic Coding

Paperclip: The Complete Guide to Open-Source Orchestration for Zero-Human Companies

Crawlee Python: The Complete Guide to Web Scraping and Browser Automation