50 composable C11 binaries that replace your entire backend stack. CMS, transcription, telephony, auth, payments, vector search, compression β each under 70 KB.
The first live proof corpus now traces public YouTube sources through Bonfyre's transient audio path into derived transcripts, handoff briefs, proof bundles, and in-app search without hosting the original media.
git clone https://github.com/Nickgonzales76017/bonfyre.git && cd bonfyre && make
The Shift Handoff app now starts from linked public videos, processes them transiently through Bonfyre, removes the downloaded media, and keeps only derived transcripts, briefs, and proof bundles. The rough edges are visible on purpose.
BonfyreFPQ is a pure C model compression engine that reduces neural network weight files while preserving output quality. No GPU required. No Python dependencies. No training. Just better math.
Near-lossless per-tensor representation across all 307 encoded tensors in Wan2.1-T2V-1.3B. Worst tensor: 0.999590.
Error stays controlled after 30 stacked transformer blocks. PSNR 35.97 dB. MSE 1.01e-3.
Cosine holds 0.9976β0.9983 across the full diffusion schedule. No drift amplification.
| Model | Domain | Original | Compressed | Tensors | Avg Cos | Worst Cos | Avg bpw | HF Model |
|---|---|---|---|---|---|---|---|---|
| Wan2.1-T2V-14B | Video diffusion | 54 GB | 27 GB | 402 | 0.999882 | 0.999826 | 4.05 | Download β |
| Phi-4 (14B) | Language model | 28 GB | 28 GB | 162 | 1.000614 | 1.000149 | 4.08 | Download β |
| Whisper Large V3 | Speech recognition | 8.7 GB | 5.8 GB | 998 | 0.999916 | 0.999834 | 4.19 | Download β |
| Whisper Large V3 Turbo | Speech recognition | 1.6 GB | 1.6 GB | 228 | 0.999929 | 0.999858 | 4.18 | Download β |
| Wan2.1-T2V-1.3B | Video diffusion | 5.3 GB | 2.7 GB | 307 | 0.999874 | 0.999590 | β | local |
| SmolLM2-135M | Language model | 101 MB | 258 MB F16 | 211 | 0.999855 | 0.999589 | β | GGUF |
| Gemma 2B-it | Language model | β | β | sampled | 0.99995 | 0.99995 | β | local |
| Whisper base.en | Speech recognition | β | β | sampled | 0.999808 | 0.999763 | β | GGML |
All tests at 3-bit (FPQ3). Artifacts are published in two tracks on Hugging Face: (1) compatibility safetensors (direct Transformers load, larger files), and (2) native .fpq files (much smaller, but not yet directly usable in standard Transformers inference).
Production bar: compressed artifacts only count as success if they run inference directly with no offline decompression step and no quality regression.
Current blocker: until native .fpq closes that runtime gap, this is a storage result, not a finished inference innovation.
Verified example: Qwen2.5-3B safetensors = 6.18 GB total vs Qwen2.5-3B native .fpq = 692 MB total (~8.9x smaller).
Loaded the original Wan2.1 model and the FPQ-compressed version into the same WanTransformer3DModel architecture. Fed identical synthetic inputs (seed=42, shape [1,16,1,60,104], BF16 on MPS). Compared full forward pass outputs.
| Timestep | Cosine | PSNR (dB) | MSE |
|---|---|---|---|
| t = 0 | 0.99831 | 34.82 | 1.32e-3 |
| t = 100 | 0.99792 | 35.49 | 1.13e-3 |
| t = 500 | 0.99759 | 35.97 | 1.01e-3 |
| t = 900 | 0.99782 | 36.02 | 1.00e-3 |
| t = 999 | 0.99804 | 35.71 | 1.07e-3 |
Identical inputs at each timestep, BF16 on MPS. Cosine range: 0.99759β0.99831. Zero drift.
994-token slice, max-length 512, stride 256. All runs on same hardware, same data.
| Method | PPL | Ξ baseline | Avg Cos | Worst Cos |
|---|---|---|---|---|
| Baseline (FP32) | 14.20 | β | 1.0000 | 1.0000 |
| BonfyreFPQ @3-bit | 14.48 | +1.97% | 0.999783 | 0.999588 |
| HQQ @3-bit (g64) | 32.38 | +128% | β | β |
| COORD @3-bit (v4) | 35.59 | +150% | 0.982761 | 0.982327 |
169 tensors quantized. HQQ run via standalone benchmark script (group-size 64, axis 1, CPU). Reproduced from proof pack.
All numbers from the authors' own papers. Lower PPL = better. FP16 baseline: 5.12.
| Method | Bits | PPL | Ξ FP16 | Source |
|---|---|---|---|---|
| FP16 (baseline) | 16 | 5.12 | β | β |
| AQLM | 3.04 | 5.46 | +6.6% | Egiazarian et al., ICML 2024 |
| SpQR | 2.98 | 6.20 | +21.1% | Dettmers et al., 2023 |
| AWQ | 3 | 6.24 | +21.9% | Lin et al., MLSys 2024 |
| GPTQ | 3.00 | 8.06 | +57.4% | Frantar et al., 2022 |
| HQQ | 3 | Not published on Llama-2 | Badri & Shelor, 2023 | |
AQLM, SpQR, AWQ, and GPTQ numbers are from their published Llama-2-7B tables (AQLM Table 2, AWQ Table 4). All use WikiText-2 validation. AQLM is currently the best published result at 3-bit.
Per-model cosine and PPL numbers in the models table above. All artifacts on Hugging Face.
bonfyre-fpq quantize model.gguf compressed.gguf --bits 3
A 14B video model that needed 54 GB of disk now fits in 27 GB. Same outputs. No retraining. No calibration data. No GPU required for compression. Compressed models are standard BF16 safetensors β load them exactly like the originals.
Most compression methods look good at the weight level but degrade when outputs are actually measured. FPQ is the first to prove β through real end-to-end inference β that compression error doesn't accumulate across deep transformer stacks or iterative diffusion processes.
Current status: compatibility safetensors are inference-ready today. Native .fpq is the smallest storage path but is not yet direct-inference-ready, which means the core product gap is still open. Both are published on Hugging Face with explicit naming.
Extreme compression regime. FP16 baseline: 5.12.
| Method | Bits | PPL | Ξ FP16 | Source |
|---|---|---|---|---|
| AQLM | 2.02 | 6.59 | +28.7% | AQLM Table 1 |
| QuIP# | 2.02 | 8.22 | +60.5% | Tseng et al., 2024 |
At 2-bit, even state-of-the-art methods show 29β61% PPL degradation. BonfyreFPQ targets the 3β4 bit regime where near-lossless is achievable.
| Bits | PPL | Ξ baseline | Avg Cos |
|---|---|---|---|
| FP32 | 11.95 | β | 1.000 |
| 4-bit | 14.77 | +23.6% | 0.9999 |
| 3-bit | 17.89 | +49.7% | 0.9997 |
KV cache is harder than weights β errors compound across 24 layers Γ every token. 4-bit recommended. 3-bit degrades.
# Weight roundtrip β Wan2.1 (307 tensors, ~15 min on M-series)
./bonfyre-fpq roundtrip-v9 ~/.local/share/models/wan2.1-t2v-1.3b/diffusion_pytorch_model.safetensors --bits 3
# Compress to GGUF (llama.cpp compatible)
./bonfyre-fpq quantize model.gguf compressed.gguf --bits 3
# Compress safetensors (PyTorch/diffusers)
./bonfyre-fpq quantize model.safetensors compressed.safetensors --bits 3
# Perplexity benchmark
python3 perplexity_benchmark.py --model Qwen/Qwen2.5-0.5B --bits 3 --mode v8
# DiT forward-pass comparison (requires PyTorch + diffusers)
python3 scripts/wan_dit_compare.py
All scripts, logs, and CSV artifacts live in 10-Code/BonfyreFPQ/. Proof pack with raw logs: results/2026-04-10-proof-pack/
FPQ-X evolves BonfyreFPQ from a quantizer into a full compression algebra. Instead of compressing tensors, FPQ-X compresses information flow β optimizing the joint objective of rate, distortion, and hardware execution cost.
Low-rank SVD + E8 lattice + 16D RVQ + QJL projection + Ghost correction. The proven foundation delivering 0.999+ cosine across 1,790 tensors.
Learns S = I + ABT via thin SVD of the ratio matrix Q = W/Ε΄ β 1. Captures scaling distortion that additive methods miss. Auto-rollback if cosine doesn't improve.
Per-column linear predictor from the low-rank basis to the quantization residual. At decode time, uses the already-available L factor to predict and cancel systematic error.
Attention-weighted K-means++ on KV cache vectors. Compresses along the sequence dimension β tokens that attend similarly share one cache atom. Orthogonal to weight quantization.
Profiles each tensor: Ξ·L (low-rank energy), spectral gap, kurtosis, outlier fraction. Decision tree selects which operators to activate and at what rank β no blanket compression.
Inner-group quantization that aligns bit boundaries to hardware SIMD lanes. Stores scales per group instead of per-channel, enabling vectorized unpacking without scatter/gather overhead.
| Dimension | FPQ v10 | FPQ-X |
|---|---|---|
| Error model | Additive only (W β Ε΄) | Additive Γ Multiplicative + Predictive |
| Per-tensor policy | Same pipeline for all | Ξ profiles Ξ·L, gap, kurtosis β selects operators |
| KV cache | Weight-only quantization | D operator: sequence-axis distillation |
| Hardware awareness | Generic packing | H operator: SIMD-lane-aligned groups |
| Objective | min βW β Ε΄β | min Ξ»RΒ·Rate + Ξ»DΒ·Distortion + Ξ»EΒ·Execution |
| Research basis | Original FPQ design | 9 papers from early 2026 |
# Full A+M+Ξ pipeline β compress and write output
bonfyre-fpqx compress model.safetensors compressed.safetensors --bits 3
# Encode+decode roundtrip β measure quality (no output file)
bonfyre-fpqx roundtrip model.safetensors --bits 3
# Per-tensor compressibility analysis β see which operators activate
bonfyre-fpqx profile model.safetensors
# KV cache distillation β sequence-axis compression
bonfyre-fpqx distill cache.safetensors distilled.safetensors --atoms 256
# Hardware-aligned repacking
bonfyre-fpqx pack model.safetensors packed.safetensors --bits 3 --group-size 128
Side-by-side against the industry incumbents. These numbers are real.
| Deepgram | OpenAI Whisper API | Bonfyre + HCP | |
|---|---|---|---|
| Cost | $0.006/min | $0.006/min | $0 / minute |
| Current public proof | Not run here | Not run here | 3 linked YouTube handoff sources, 0.5303-0.6887 confidence, 0.027-0.041 realtime factor |
| Model size | Cloud (N/A) | Cloud (N/A) | 29 MB default (tiny q5_0) / 44 MB (base q4_0) / 24 MB (tiny q4_0) |
| Quality visibility | Cloud summary | Cloud summary | Segment counts, confidence, realtime factor, proof JSON, and source trace exposed in the app |
| Post-process overhead | N/A (cloud) | N/A (cloud) | <1% of decode time (unified FFT, -O3) |
| Privacy | Cloud β data leaves device | Cloud β data leaves device | 100% local, offline, private |
| Internet required | Yes | Yes | No |
| Output formats | JSON, SRT | JSON, SRT, VTT | JSON + HCP metrics, TXT, SRT, VTT, meta.json |
| Novel algorithm | Proprietary cloud | Whisper (standard) | HCP quad-channel spectral + KIEL-CC Kalman + unified E-T Gate/formant + bigram/trigram semantic + morphological logit bias + context-seeded re-decode + quantization (q4_0/q5_0) |
| Strapi | Express + Prisma | Bonfyre | |
|---|---|---|---|
| Install size | ~500 MB | ~200 MB | ~2.1 MB |
| Dependencies | Node + 400 packages | Node + 80 packages | libc + SQLite |
| Startup time | 30β120 sec | 2β5 sec | < 50 ms |
| Idle memory | ~200 MB | ~80 MB | 15 MB |
| Build step | npm install (2 min) | npm install (45 sec) | make (8 sec) |
| Runtime | Node.js 18+ | Node.js 18+ | None (static binary) |
| Binaries | 1 monolith | 1 monolith | 50 composable |
Each is a standalone entry point β you don't need to understand the whole system.
Replace Strapi's 500 MB install with a 287 KB binary. Dynamic schemas, token auth, REST API. Repo: bonfyre-cms.
bonfyre-cms serve --port 8800
Local speech path for public or private audio: media prep, transcription, cleaning, paragraphs, and proof artifacts. The live Shift Handoff app shows the current public-origin results, including where transcription still needs pressure. Live proof. Repo: bonfyre-intake.
bonfyre-transcribe run audio.wav
Shrink JSON payloads to 9.3% with O(1) random field access. Near Shannon limit with arithmetic coding. Repo: bonfyre-core. Library repo.
liblambda-tensors
Audio β transcript β summary β quality score β pricing β packaged deliverable. One command, 5β8 ms per stage. Repo: bonfyre-pipeline.
bonfyre-pipeline run --input audio.mp3
Embed documents + NEON SIMD cosine search. Replace $250/mo Pinecone β local, 5 ms queries. Repo: bonfyre-embed.
bonfyre-embed --insert-db my.db
Auth, payments, metering, API keys, rate limiting, telephony β composable binaries. ~240 KB total. Umbrella repo. Telephony repo.
bonfyre-api + auth + pay + gate + tel
Drop-in replacement for OpenAI endpoints. Set OPENAI_API_BASE=http://localhost:8787 and existing code just works β transcription via HCP, completions via bonfyre-brief. 53 KB binary, localhost only.
bonfyre-proxy serve --port 8787
Quantize LLM weights to 3-bit with 0.9999+ cosine similarity and zero perplexity loss. E8 lattice snap + ΞΌ-law warp + 16D RVQ. Qwen 0.5B: PPL 12.07 vs 11.95 baseline (+0.9%). 42 KB binary.
bonfyre-quant benchmark model.gguf --bits 3
Bonfyre is no longer a single opaque repo. These are the public entry vectors linked from the live product surface.
The top-level router for architecture, comparisons, and the full system story.
bonfyre
Core substrate, hashing, canonicalization, compression helpers, and the shared C runtime library.
bonfyre-core
Transcription, ingest, media prep, transcript cleanup, paragraphization, and transcript-family workflows.
bonfyre-intake
Standalone JSON compression and family-aware tensor substrate.
liblambda-tensors
The single-process pipeline surface for audio-to-invoice and other end-to-end flows.
bonfyre-pipeline
ONNX-backed embeddings and local vector search for document workflows.
bonfyre-embed
Dynamic schemas, REST API, token auth, and compact content operations in one binary.
bonfyre-cms
FreeSWITCH-based telephony, mock call flows, SMS/MMS, and verification without Twilio lock-in.
bonfyre-tel
Bonfyre works as a high-performance companion backend for WordPress.
Use WordPress as the experience layer. Use Bonfyre as the tiny local-first engine behind search, media, AI workflows, packaging, auth, and monetization.
Use WordPress for themes, editors, plugins, and admin workflows.
Use Bonfyre for the heavy lifting: transcription, vector search, structured compression, packaging, metering, auth, pricing, and output generation.
Turn episode audio into draft blog posts, summaries, and quotes — automatically.
Index posts by meaning, not just keywords. Replace bloated search plugins.
Create editorial summaries and action items from long transcripts or notes for editors.
Back premium features or content tiers without plugin sprawl.
Produce PDFs, EPUBs, and downloadable guides from WordPress content.
Index docs, FAQs, uploads, and help content for fast semantic retrieval.
WordPress handles presentation. Bonfyre handles auth, metering, file packaging, and deliverables.
For agencies and consultants: raw call audio into organized, quality-scored client packets.
Enrich old WordPress content with topics, categories, and semantic clusters.
Turn one long post or transcript into snippets, email copy, and social-ready assets.
WordPress as public frontend. Bonfyre as semantic index + artifact pipeline for PDFs and transcripts.
Quoting and billing workflows for agencies — from proof bundles to invoices.
Upload raw voice notes, publish cleaned, structured, summarized versions.
Local-first transcription and search without cloud APIs, billing, or vendor lock-in.
Use WordPress as editor/admin, then Bonfyre to emit alternate site outputs, packages, and feeds.
| WordPress need | Bonfyre binaries |
|---|---|
| Smarter CMS / data layer | bonfyre-cms, bonfyre-api, bonfyre-index |
| Audio → article workflow | bonfyre-media-prep, bonfyre-transcribe, bonfyre-brief, bonfyre-pack |
| Semantic search | bonfyre-embed, bonfyre-vec, bonfyre-query |
| Premium content / subscriptions | bonfyre-auth, bonfyre-gate, bonfyre-meter, bonfyre-pay |
| Offers / quoting / deliverables | bonfyre-offer, bonfyre-render, bonfyre-emit, bonfyre-pack |
| Repurposing / multi-format output | bonfyre-render, bonfyre-emit, bonfyre-distribute |
You don't need to understand the binaries. Bonfyre is a behind-the-scenes engine that takes messy business input — calls, files, notes, recordings — and turns it into something useful, organized, and ready to use.
You keep using familiar tools on the front end. Bonfyre handles the hard part behind the scenes.
Every number below comes from a real run on this machine. Raw logs, scripts, and CSVs are in the repo. β Back to FPQ overview
Inference-ready track: BF16 safetensors (drop-in, no special loader). Native .fpq track is published separately as an unfinished storage path until direct inference works without that extra runtime gap.
Qwen perplexity (v8 vs v4 vs HQQ), Whisper roundtrip, CSV, PNG chart, reproduction commands.
View proof pack βForward pass metrics, per-channel analysis, timestep sweep, timing data. Machine-readable.
View comparison script βFull benchmark report: version progression, weight tables, KV cache, speed optimization, binary sizes.
View benchmarks doc βPython script to reproduce Qwen PPL results. Supports v4/v8 modes, configurable tokens/stride.
View script βPure C11 engine: main.c, fpq_codec.c, ggml_reader.c, fpq.h. Builds with make on macOS/Linux.
View source βFull 307-tensor v9 roundtrip log showing per-tensor cosine, adaptive rank, E8/RVQ diagnostics.
View log βApple M-series, measured after 5 optimization passes (P0–P5). All numbers are real. See also: FPQ compression benchmarks.
# Public-source proof path (transient media, retained artifacts)
git clone https://github.com/Nickgonzales76017/hcp-whisper.git
cd hcp-whisper && make
./hcp-whisper -m models/ggml-tiny.en-q5_0.bin -f your-audio.wav --output-json
# Inspect confidence, realtime factor, and retained derived artifacts
# Run the full test suite (167 tests)
make test # hcp-whisper: 167/167 tests
# Bonfyre pipeline (5-8 ms per stage)
git clone https://github.com/Nickgonzales76017/bonfyre.git
cd bonfyre && make
time ./bin/bonfyre-pipeline run --input audio.wav
Current public proof set: Nursing School Explained and AHRQ Patient Safety YouTube sources linked inside the Shift Handoff app. Bonfyre downloads source media transiently, processes it, deletes the local media copy, and publishes only derived artifacts.
Every pass shipped. Every test passes. 167 tests across hcp-whisper + 2 libraries + 47 binaries.
| Metric | Before (P0) | After P5 | Improvement |
|---|---|---|---|
| Single embed | ~600 ms (Python) | 237 ms | 2.5Γ |
| 10-file batch embed | ~6,000 ms | 386 ms | 15.5Γ |
| Pipeline (6 stages) | 76 ms | 8 ms | 9.5Γ |
| Tag inference | ~150 ms (Python) | 6 ms | 25Γ |
| Hash hex conversion | ~100 ns (snprintf) | ~10 ns (LUT) | ~10Γ |
| Artifact struct | 1,076 bytes | 536 bytes | 2Γ cache density |
| Operator lookup | O(n) linear | O(1) FNV hash | algorithmic |
| Token generation | O(nΒ²) strlen loop | O(n) tracked offset | algorithmic |
| Vector file (384-dim) | 6.4 KB JSON | 1,544 bytes VECF | 4.2Γ smaller |
| Public proof confidence | not measured here | 0.5303-0.6887 across 3 linked handoff videos | visible, not hidden |
| HCP pipeline | N/A | spectral + KIEL-CC + E-T Gate + formant + logit bias | <1% overhead (unified FFT) |
| Flagged segments | undetected | 6 / 43 in the current public-origin proof set | shown in proof JSON |
| Duplicate code | 34 copies | 1 each (libbonfyre) | eliminated |
48 separate binaries. Not a monolith. Not a framework. Each is a standalone Unix process.
Each binary does one thing. Compose them with pipes, files, or the pipeline binary. bonfyre-media-prep audio.wav | bonfyre-transcribe | bonfyre-brief
Every binary runs as its own process. No shared memory. If one crashes, nothing else does. 15-minute audio files process without leaks β separate processes clean up on exit.
Whisper via libwhisper (Homebrew). LLM via llama-completion as a subprocess. SQLite via system library. No static megabinary.
Not an LLM runner
Ollama, LocalAI, and LM Studio serve LLM inference. Bonfyre is a content processing pipeline β it uses models as tools inside a larger workflow, not as the product itself.
Not a framework
No SDK, no plugins, no config DSL. Each binary reads files or stdin, writes files or stdout. Compose them however you want β shell scripts, Makefiles, GitHub Actions.
Not a monolith
47 separate executables, each 34β287 KB. Use one binary for one job, or chain 10 into a pipeline. No coupling β swap, skip, or replace any stage.
Every binary declares its behavioral class. Transform binaries are pure β same inputs, same outputs, cacheable.
Every binary is standalone. Use one or use all. ~2.1 MB total disk.
Build from source in under 60 seconds.
# From source (recommended)
git clone https://github.com/Nickgonzales76017/bonfyre.git
cd bonfyre
make # builds 2 libraries + 47 binaries
make install # copies to ~/.local/bin
# One command (macOS / Linux)
curl -fsSL https://raw.githubusercontent.com/Nickgonzales76017/bonfyre/main/install.sh | sh
Requirements: C11 compiler (gcc or clang), SQLite3 dev headers, zlib. Optional: ONNX Runtime (for embed), FreeSWITCH (for tel).
Real end-user applications powered by Bonfyre binaries. Each runs a hybrid architecture: WASM client-side preview + GitHub Actions server-side pipeline. Drag-and-drop a file, watch it process.
Every app uses the same pattern: Git hooks run Bonfyre pipelines on commit. GitHub Actions process server-side. 22 KB WASM module gives instant client-side previews.
One repo. 50 binaries. ~2.1 MB. 167 tests. 5 optimization passes.