Everything you need to evaluate, install, and run BonfyreFPQ and the Bonfyre Stack. Benchmarks, hardware fits with pricing, reproduction commands, and full architecture reference.
Pure C model compression engine. No GPU required. No Python dependencies. No training.
The SLI bridge is live β native .fpq files run inference directly, no decompression step.
Near-lossless per-tensor across all 307 encoded tensors in Wan2.1-T2V-1.3B. Worst tensor: 0.999590.
Error stays controlled after 30 stacked transformer blocks. PSNR 35.97 dB. MSE 1.01e-3.
Cosine holds 0.9976β0.9983 across the full diffusion schedule. No drift amplification.
| Model | Domain | Original | Compressed | Tensors | Avg Cos | Worst Cos | Avg bpw | HF Model |
|---|---|---|---|---|---|---|---|---|
| Wan2.1-T2V-14B | Video diffusion | 54 GB | 27 GB | 402 | 0.999882 | 0.999826 | 4.05 | Download β |
| Phi-4 (14B) | Language model | 28 GB | 28 GB | 162 | 1.000614 | 1.000149 | 4.08 | Download β |
| Whisper Large V3 | Speech recognition | 8.7 GB | 5.8 GB | 998 | 0.999916 | 0.999834 | 4.19 | Download β |
| Whisper Large V3 Turbo | Speech recognition | 1.6 GB | 1.6 GB | 228 | 0.999929 | 0.999858 | 4.18 | Download β |
| Wan2.1-T2V-1.3B | Video diffusion | 5.3 GB | 2.7 GB | 307 | 0.999874 | 0.999590 | β | local |
| SmolLM2-135M | Language model | 101 MB | 258 MB F16 | 211 | 0.999855 | 0.999589 | β | GGUF |
| Gemma 2B-it | Language model | β | β | sampled | 0.99995 | 0.99995 | β | local |
| Whisper base.en | Speech recognition | β | β | sampled | 0.999808 | 0.999763 | β | GGML |
Artifacts on Hugging Face: (1) compatibility safetensors (direct Transformers load), and (2) native .fpq v12 files (rANS entropy-coded, 3.5β5.2Γ smaller, direct inference via SLI bridge).
Loaded the original Wan2.1 model and FPQ-compressed version into the same WanTransformer3DModel architecture. Fed identical synthetic inputs (seed=42, shape [1,16,1,60,104], BF16 on MPS). Compared full forward pass outputs.
| Timestep | Cosine | PSNR (dB) | MSE |
|---|---|---|---|
| t = 0 | 0.99831 | 34.82 | 1.32e-3 |
| t = 100 | 0.99792 | 35.49 | 1.13e-3 |
| t = 500 | 0.99759 | 35.97 | 1.01e-3 |
| t = 900 | 0.99782 | 36.02 | 1.00e-3 |
| t = 999 | 0.99804 | 35.71 | 1.07e-3 |
Identical inputs at each timestep, BF16 on MPS. Cosine range: 0.99759β0.99831. Zero drift.
994-token slice, max-length 512, stride 256. All runs on same hardware, same data.
| Method | PPL | Ξ baseline | Avg Cos | Worst Cos |
|---|---|---|---|---|
| Baseline (FP32) | 14.20 | β | 1.0000 | 1.0000 |
| BonfyreFPQ @3-bit | 14.48 | +1.97% | 0.999783 | 0.999588 |
| HQQ @3-bit (g64) | 32.38 | +128% | β | β |
| COORD @3-bit (v4) | 35.59 | +150% | 0.982761 | 0.982327 |
169 tensors. HQQ via standalone benchmark (group-size 64, axis 1, CPU). Proof pack.
From authors' own papers. Lower PPL = better. FP16 baseline: 5.12.
| Method | Bits | PPL | Ξ FP16 | Source |
|---|---|---|---|---|
| FP16 (baseline) | 16 | 5.12 | β | β |
| AQLM | 3.04 | 5.46 | +6.6% | Egiazarian et al., ICML 2024 |
| SpQR | 2.98 | 6.20 | +21.1% | Dettmers et al., 2023 |
| AWQ | 3 | 6.24 | +21.9% | Lin et al., MLSys 2024 |
| GPTQ | 3.00 | 8.06 | +57.4% | Frantar et al., 2022 |
| HQQ | 3 | Not published on Llama-2 | Badri & Shelor, 2023 | |
bonfyre-fpq quantize model.gguf compressed.gguf --bits 3
Direct runtime inference from .fpq is now working. Load a compressed model and run it immediately β no conversion, no extra RAM, no hacks. The SLI bridge (Spectral Lattice Inference) is fully integrated.
patch_model(hf_model, fpq, resolver) β replaces nn.Linear layers with FPQLinear. No decode step, no weight copy.
FPQ-X evolves BonfyreFPQ from a quantizer into a full compression algebra.
All six operator families are implemented and validated in fpqx_ops.c + fpq_bridge.py.
Low-rank SVD + E8 lattice + 16D RVQ + QJL projection + Ghost correction. The proven foundation delivering 0.999+ cosine across 1,790 tensors.
Learns S = I + ABT via thin SVD of the ratio matrix Q = W/Ε΄ β 1. Captures scaling distortion that additive methods miss. Auto-rollback if cosine doesn't improve.
Per-column linear predictor from the low-rank basis to the quantization residual. Uses already-available L factor to predict and cancel systematic error.
Attention-weighted K-means++ on KV cache vectors. Compresses along the sequence dimension β tokens that attend similarly share one cache atom. Orthogonal to weight quantization.
Profiles each tensor: Ξ·L (low-rank energy), spectral gap, kurtosis, outlier fraction. Decision tree selects which operators to activate and at what rank.
Inner-group quantization that aligns bit boundaries to SIMD lanes. Stores scales per group, enabling vectorized unpacking without scatter/gather overhead.
| Dimension | FPQ v10 | FPQ-X |
|---|---|---|
| Error model | Additive only (W β Ε΄) | Additive Γ Multiplicative + Predictive |
| Per-tensor policy | Same pipeline for all | Ξ profiles Ξ·L, gap, kurtosis β selects operators |
| KV cache | Weight-only quantization | D operator: sequence-axis distillation |
| Hardware awareness | Generic packing | H operator: SIMD-lane-aligned groups |
| Objective | min βW β Ε΄β | min Ξ»RΒ·Rate + Ξ»DΒ·Distortion + Ξ»EΒ·Execution |
| Research basis | Original FPQ design | 9 papers from early 2026 |
# Full A+M+Ξ pipeline
bonfyre-fpqx compress model.safetensors compressed.safetensors --bits 3
# Roundtrip quality test
bonfyre-fpqx roundtrip model.safetensors --bits 3
# Per-tensor compressibility analysis
bonfyre-fpqx profile model.safetensors
# KV cache distillation
bonfyre-fpqx distill cache.safetensors distilled.safetensors --atoms 256
# Hardware-aligned repacking
bonfyre-fpqx pack model.safetensors packed.safetensors --bits 3 --group-size 128
Baseline cosine numbers (bonfyre-kvcache C benchmark). All 9 Python optimizations live in fpq_bridge.py.
| Bits | KV Cosine | Hardware implication |
|---|---|---|
| 5-bit | 0.99996 | ~5.3Γ more context β 8K ctx β 42K in same VRAM |
| 4-bit | 0.99994 | 4Γ more context β recommended for production |
| 3-bit | 0.99990 | 5.3Γ context, some quality loss on long sequences |
High-attention blocks dominate tile assignment. Codebook quality concentrates where the model actually looks.
Ξ-profiler analyzes each K/V layer β kurtosis, spectral gap, outlier fraction β to pick the right bit depth automatically.
One 256-tile codebook learned across 8 sample layers. Skip per-call K-means β compress all layers in amortized O(1).
K-means++ on KV vectors to K atoms (K βͺ N). Tokens that attend similarly share one atom. Bug-free nearest-centroid lookup.
Only the delta vs previous frame is compressed. Each new token costs far less than storing a new frame.
E8 coordinate magnitude as Huffman code length proxy. High-cost blocks get upweighted β rate-quality jointly optimized.
Near-zero blocks (max abs β€ 63) bypass E8 lattice β 7-bit integer round + clamp. Significant throughput win on embedding layers.
Per-row scale vector on each FPQLinear. Applied after SLI matmul: corrects per-output-channel amplitude drift.
ARM NEON 128-bit aligned pre-packing. Eliminates scatter/gather β vectorized unpacking on Apple Silicon and Jetson.
Use individually or compose: patch_kv_cache(adaptive_bits=True, shared_tiles=tiles) activates #4 + #5 simultaneously.
FPQ weight compression + KV cache optimizations change the hardware equation. What used to need cloud GPUs now fits on-device. Use cases shift depending on your hardware budget.
| Device | RAM | Approx. Cost (2026) | Before (BF16) | After (FPQ 4-bit + KV) |
|---|---|---|---|---|
| Raspberry Pi 5 | 8 GB | $80 | TinyLlama only, 512-token ctx, no video | TinyLlama + 2K ctx, Whisper turbo inference, local ASR pipeline |
| Jetson Orin Nano | 8 GB | $250 | Qwen 0.5B only, degraded at >512 ctx | Qwen 0.5B @ 4K ctx Β· NEON packing Β· embeddings + FPQ inference co-resident |
| Apple M1 MacBook (16 GB) | 16 GB unified | $900 refurb | Qwen 0.5B (tight), Wan 1.3B (no headroom) | Wan 1.3B + 4K ctx KV Β· HCP speech + SLI co-resident Β· NEON-packed |
| Apple M2/M3 Max (64 GB) | 64 GB unified | $2,500β3,500 | Phi-4 14B (no KV headroom past 2K) | Wan 14B @ 8K ctx Β· Phi-4 @ 32K ctx with delta KV Β· full pipeline concurrent |
| T4 cloud (16 GB VRAM) | 16 GB VRAM | ~$0.35/hr spot | Qwen 3B, 2K ctx max before OOM | Qwen 3B @ 8K ctx Β· Wan 1.3B + full diffusion sweep Β· shared KV codebook |
| RTX 4090 (24 GB VRAM) | 24 GB VRAM | $1,600 GPU / ~$0.50/hr cloud | Wan 1.3B (tight), Phi-4 14B doesn't fit | Wan 1.3B @ 32K ctx Β· Phi-4 14B fits Β· adaptive bits saves ~30% KV RAM |
| RTX 6000 Ada (48 GB VRAM) | 48 GB VRAM | ~$1.10/hr (RunPod) | Wan 14B (tight), long video sequences OOM | Wan 14B Β· 287 SLI layers @ 5-timestep sweep Β· multi-second video KV cached |
Local ASR with Whisper turbo. TinyLlama inference. Bonfyre pipeline binaries. Edge transcription kiosk.
Qwen 0.5B + 4K ctx. Wan 1.3B video. HCP speech + SLI co-resident. NEON packing. Full local pipeline.
Wan 14B @ 8K ctx. Phi-4 @ 32K ctx with delta KV. Full Bonfyre pipeline + inference concurrent. Production-grade local stack.
Burst GPU for video generation, large model inference, SLI sweeps. Use FPQ to fit bigger models on cheaper instances. T4 now handles what used to need A100.
TinyLlama 1.1B β stays in .fpq at runtime, no decode step. BF16 copy never materializes.
4-bit KV compression in same VRAM budget. Delta encoding makes each new token incremental.
H-operator pre-packing on Apple Silicon and Jetson β vectorized unpacking, no scatter/gather.
FPQ model + HCP speech + vector search + pipeline can run concurrently on a 16 GB Mac.
Every number from a real run. Raw logs, scripts, and CSVs in the repo. β FPQ overview
BF16 safetensors (drop-in) + native .fpq v12 files (rANS, direct SLI inference).
Qwen PPL (v8 vs v4 vs HQQ), Whisper roundtrip, CSV, PNG chart, reproduction commands.
View proof pack βFull benchmark report: version progression, weight tables, KV cache, speed optimization, binary sizes.
View benchmarks doc βPure C11: main.c, fpq_codec.c, ggml_reader.c, fpq.h. Builds with make on macOS/Linux.
View source βApple M-series, after 6 optimization passes (P0βP6). All real. FPQ benchmarks β
| Metric | Before (P0) | After P5 | Improvement |
|---|---|---|---|
| Single embed | ~600 ms | 237 ms | 2.5Γ |
| 10-file batch embed | ~6,000 ms | 386 ms | 15.5Γ |
| Pipeline (6 stages) | 76 ms | 8 ms | 9.5Γ |
| Tag inference | ~150 ms | 6 ms | 25Γ |
| Hash hex | ~100 ns | ~10 ns | ~10Γ |
| Artifact struct | 1,076 bytes | 536 bytes | 2Γ cache density |
| Vector file (384-dim) | 6.4 KB JSON | 1,544 bytes VECF | 4.2Γ smaller |
bonfyre-moq replaces the Node.js MoQ relay with a pure C11 implementation.
Built on ngtcp2 + nghttp3 + OpenSSL 3 + SQLite. Ships with four live extension modules:
inline inference, RL self-tuning, gossip mesh, and lightweight consensus β all in the same binary.
inference.c + inference_onnx.cEntropy-based scoring (score 0β100) on every forwarded MoQ object. Weak-linked ONNX Runtime via dlopen β falls back to entropy estimator when ONNX Runtime is absent. Same hook as bonfyre-embed.
optimizer.cBackground RL agent tunes relay buffer size (16 KBβ256 KB) and path policy (round-robin vs least-loaded) every 2 s. Reward signal driven by real relay metrics. Exposes current params for use by the forwarding loop.
mesh.c + bonfyre-mesh.hUDP multicast gossip beacon on 239.0.0.57:7942. Background thread maintains a live peer table (peer_info[]). Enables distributed relay clusters without a central registry. Shared peer table feeds the consensus module.
consensus.cSimulated Raft: returns the stable leader peer from the mesh table. Used to shard MoQ PUBLISH_NAMESPACE announcements and SUBSCRIBE routing across a relay cluster without split-brain.
# Build relay + all extension modules
make bonfyre-moq
# Run relay (MoQ + mesh + optimizer + inference)
./bonfyre-moq --host 127.0.0.1 --port 4443 \
--runtime-dir /tmp/bonfyre-moq \
--db /tmp/bonfyre-moq/relay.db
# Smoke-test all extension modules
make test-bonfyre # inference + optimizer + mesh + consensus
48 separate binaries. Not a monolith. Not a framework. Each is a standalone Unix process.
Each binary does one thing. Compose with pipes, files, or the pipeline binary. bonfyre-media-prep audio.wav | bonfyre-transcribe | bonfyre-brief
Every binary runs as its own process. No shared memory. If one crashes, nothing else does. Separate processes clean up on exit.
Whisper via libwhisper (Homebrew). LLM via llama-completion subprocess. SQLite via system library.
Every binary is standalone. Use one or all. ~2.1 MB total disk.
Each is standalone β you don't need to understand the whole system.
Run models directly from .fpq files. patch_model() replaces nn.Linear with FPQLinear. TinyLlama: 155 layers, cos=0.997, top-1=97.3%. bonfyre-oss.
patch_model(hf_model, fpq, resolver)
4Γ more context tokens in same VRAM. 9 optimization passes. Works with nn.Linear and FPQLinear simultaneously.
patch_kv_cache(model, bits=4, adaptive_bits=True)
Quantize LLM weights to 3-bit with 0.9999+ cosine. E8 lattice snap + ΞΌ-law warp + 16D RVQ. 42 KB binary.
bonfyre-quant benchmark model.gguf --bits 3
Replace Strapi's 500 MB with a 287 KB binary. Dynamic schemas, token auth, REST API. bonfyre-cms.
bonfyre-cms serve --port 8800
Local speech for public or private audio. Live proof. bonfyre-intake.
bonfyre-transcribe run audio.wav
Audio β transcript β summary β quality score β pricing β deliverable. 5β8 ms per stage. bonfyre-pipeline.
bonfyre-pipeline run --input audio.mp3
Embed docs + NEON SIMD cosine search. Replace $250/mo Pinecone. bonfyre-embed.
bonfyre-embed --insert-db my.db
Pure C WebTransport relay with inline AI scoring, RL self-tuning, gossip peer discovery, and consensus leader election. Replaces Node.js moq-edge.
./bonfyre-moq --port 4443 --db relay.db
Drop-in replacement. Set OPENAI_API_BASE=http://localhost:8787. 53 KB binary.
bonfyre-proxy serve --port 8787
Every number on this site comes from scripts in the repo.
# Weight roundtrip β Wan2.1 (307 tensors)
./bonfyre-fpq roundtrip-v9 ~/.local/share/models/wan2.1-t2v-1.3b/diffusion_pytorch_model.safetensors --bits 3
# Compress GGUF (llama.cpp compatible)
./bonfyre-fpq quantize model.gguf compressed.gguf --bits 3
# Compress safetensors
./bonfyre-fpq quantize model.safetensors compressed.safetensors --bits 3
# Perplexity benchmark
python3 perplexity_benchmark.py --model Qwen/Qwen2.5-0.5B --bits 3 --mode v8
# DiT forward-pass comparison
python3 scripts/wan_dit_compare.py
# SLI bridge inference test
python3 test_sli_plus_kvcache.py --device mps
# Full Bonfyre pipeline
git clone https://github.com/Nickgonzales76017/bonfyre.git && cd bonfyre && make
time ./bin/bonfyre-pipeline run --input audio.wav
# Run tests
make test # 167/167 tests
All scripts, logs, and CSVs: 10-Code/BonfyreFPQ/ Β· Proof pack: results/2026-04-10-proof-pack/
Build from source in under 60 seconds.
# From source (recommended)
git clone https://github.com/Nickgonzales76017/bonfyre.git
cd bonfyre
make # builds 2 libraries + 47 binaries
make install # copies to ~/.local/bin
# One command (macOS / Linux)
curl -fsSL https://raw.githubusercontent.com/Nickgonzales76017/bonfyre/main/install.sh | sh
# BonfyreFPQ / bonfyre-oss
git clone https://github.com/Nickgonzales76017/bonfyre-oss.git
cd bonfyre-oss && make
Requirements: C11 compiler (gcc or clang), SQLite3 dev headers, zlib. Optional: ONNX Runtime (embed), FreeSWITCH (tel), PyTorch (SLI bridge).
Full source. Real benchmarks. Reproduction scripts included.