LLMFit: The Complete Guide to Finding Which LLMs Run on Your Hardware
Hundreds of models. One command. LLMFit is a Rust-powered terminal tool that right-sizes LLM models to your system's RAM, CPU, and GPU. It detects your hardware, scores each model across quality, speed, fit, and context dimensions, and tells you which ones will actually run well on your machine. With 12,900+ stars, multi-GPU support, MoE architectures, Ollama/llama.cpp integration, and a beautiful TUI — this is the tool every local AI developer needs.
What Is LLMFit?
A terminal tool (TUI + CLI + REST API) that answers the question: "Which LLMs can I actually run on my hardware?"
- Language: Rust
- License: MIT
- Stars: 12,900+ ⭐
- Forks: 719
- Contributors: 30
- Releases: 50
Hardware Detection
LLMFit probes your system automatically:
| Hardware | Detection Method |
|---|---|
| NVIDIA | Multi-GPU via nvidia-smi, VRAM aggregation |
| AMD | rocm-smi |
| Intel Arc | Discrete VRAM via sysfs, integrated via lspci |
| Apple Silicon | Unified memory via system_profiler |
| Ascend NPU | npu-smi |
| Backend | Auto-detects CUDA, Metal, ROCm, SYCL, CPU ARM/x86, Ascend |
Multi-Dimensional Scoring
Each model scores 0–100 on four dimensions:
| Dimension | What It Measures |
|---|---|
| Quality | Parameter count, model family, quantization penalty, task alignment |
| Speed | Estimated tok/s from GPU bandwidth, params, quantization |
| Fit | Memory utilization efficiency (sweet spot: 50–80%) |
| Context | Context window capability vs target |
Weights vary by use-case: Chat weights Speed higher (0.35), Reasoning weights Quality higher (0.55).
Key Features
Dynamic Quantization
Walks Q8_0 → Q2_K hierarchy, picking the highest quality that fits. If nothing fits at full context, retries at half context.
MoE Support
Mixtral, DeepSeek-V2/V3 detected automatically. Only active experts counted for VRAM — Mixtral 8x7B drops from 23.9 GB to ~6.6 GB.
Speed Estimation
Memory-bandwidth-bound formula: (bandwidth_GB_s / model_size_GB) × 0.55. Covers ~80 GPUs across NVIDIA, AMD, Apple Silicon.
Run Modes
- GPU — Model fits in VRAM
- MoE — Expert offloading (active in VRAM, inactive in RAM)
- CPU+GPU — Partial GPU offload
- CPU — Full system RAM
Three Interfaces
TUI (Default)
llmfit
Interactive terminal UI: system specs at top, models in scrollable table sorted by composite score. Each row shows score, tok/s, best quantization, run mode, memory usage, use-case category.
Plan Mode (p) — Inverts the question: "What hardware is needed for this model?" Shows min/recommended VRAM/RAM/CPU cores, feasible run paths, upgrade deltas.
CLI Mode
llmfit --cli # Table of all models
llmfit fit --perfect -n 5 # Top 5 perfect fits
llmfit search "llama 8b" # Search by name/provider/size
llmfit recommend --json --use-case coding --limit 3
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192
REST API
llmfit serve --host 0.0.0.0 --port 8787
curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"
Runtime Integration
Ollama
Auto-detects installed models, downloads new ones from TUI. Supports remote instances via OLLAMA_HOST.
llama.cpp
Maps HuggingFace models to GGUF repos, downloads into local cache, detects installed GGUF files.
LLMFit vs Alternatives
Category: This tool is a local LLM hardware compatibility analyzer.
| Feature | LLMFit | llm-checker | Ollama |
|---|---|---|---|
| Focus | Hardware→Model matching | Actual model benchmarking | LLM runtime |
| Stars | 12.9K ⭐ | 1.4K ⭐ | 164K ⭐ |
| License | MIT | Other | MIT |
| Language | Rust | JavaScript | Go |
| TUI | ✅ Interactive | ❌ CLI only | ❌ |
| REST API | ✅ llmfit serve | ❌ | ✅ |
| Multi-GPU | ✅ VRAM aggregation | ❌ | ✅ |
| MoE Support | ✅ Expert offloading | ❌ Dense only | ✅ |
| Apple Silicon | ✅ Unified memory | ✅ | ✅ |
| Intel Arc | ✅ | ❌ | ❌ |
| Ascend NPU | ✅ | ❌ | ❌ |
| Dynamic Quantization | ✅ Q8_0→Q2_K | ❌ | ✅ Auto |
| Speed Estimation | ✅ GPU bandwidth-based | ✅ Real benchmark | ❌ |
| Multi-Dim Scoring | ✅ Quality/Speed/Fit/Context | ❌ | ❌ |
| Plan Mode | ✅ "What hardware do I need?" | ❌ | ❌ |
| Use-Case Filtering | ✅ Coding/Reasoning/Chat/etc. | ❌ | ❌ |
| Ollama Integration | ✅ Detect + Download | ✅ Pull + Benchmark | N/A |
| llama.cpp Integration | ✅ GGUF download | ❌ | Built-in |
| JSON Output | ✅ | ✅ | ✅ |
| Themes | ✅ 6 built-in | ❌ | ❌ |
| Model Database | ✅ HuggingFace (hundreds) | ✅ | ✅ Registry |
| Agent Skill | ✅ OpenClaw | ❌ | ❌ |
| Runs Models | ❌ Recommends | ✅ Via Ollama | ✅ Core function |
When to choose LLMFit: You want to know which models fit your hardware before downloading anything. Multi-dimensional scoring, dynamic quantization, MoE support, Plan mode, and a beautiful TUI. The recommender, not the runner.
When to choose llm-checker: You want to actually benchmark models by running them via Ollama. Real throughput numbers instead of estimates. Simpler but slower (needs to download and run each model).
When to choose Ollama: You want to run LLMs locally, not choose them. Ollama is the runtime; LLMFit integrates with it to tell you what to run.
Quick Start
# macOS/Linux
brew install alexsjones/tap/llmfit
# or
cargo install llmfit
# Windows
winget install llmfit
# Run
llmfit # Interactive TUI
llmfit --cli # Classic CLI
Conclusion
LLMFit solves the "which model should I run?" problem that every local AI user faces. Instead of guessing, downloading, and discovering your hardware can't handle it — LLMFit scores hundreds of models against your actual RAM, CPU, and GPU in seconds. The multi-dimensional scoring (quality, speed, fit, context), dynamic quantization selection, MoE-aware memory estimation, and Plan mode ("what hardware do I need?") make it the most sophisticated hardware→model matching tool available. Built in Rust, shipping as a single binary with 50 releases and 12.9K stars.
