Bonfyre — Technical Documentation & Benchmarks

BonfyreFPQ — Functional Polar Quantization

Compress any model ~4×. Run inference directly on compressed weights.

Pure C model compression engine. No GPU required. No Python dependencies. No training. The SLI bridge is live — native .fpq files run inference directly, no decompression step.

✓ Runtime gap closed — patch_model() and go. TinyLlama: 155 layers, cos=0.997, top-1=97.3%

0.9999 per-weight cosine

0.9976 output cosine (30 layers)

1,790 tensors compressed

54→27 GB Wan2.1-T2V-14B

15 models on Hugging Face

Three layers of correctness — all verified

Layer 1 — Weight Space

cos ≈ 0.9999

Near-lossless per-tensor across all 307 encoded tensors in Wan2.1-T2V-1.3B. Worst tensor: 0.999590.

Layer 2 — Network Propagation

cos ≈ 0.9976

Error stays controlled after 30 stacked transformer blocks. PSNR 35.97 dB. MSE 1.01e-3.

Layer 3 — System Behavior

stable × all timesteps

Cosine holds 0.9976–0.9983 across the full diffusion schedule. No drift amplification.

Models compressed — real files, real numbers

Model	Domain	Original	Compressed	Tensors	Avg Cos	Worst Cos	Avg bpw	HF Model
Wan2.1-T2V-14B	Video diffusion	54 GB	27 GB	402	0.999882	0.999826	4.05	Download →
Phi-4 (14B)	Language model	28 GB	28 GB	162	1.000614	1.000149	4.08	Download →
Whisper Large V3	Speech recognition	8.7 GB	5.8 GB	998	0.999916	0.999834	4.19	Download →
Whisper Large V3 Turbo	Speech recognition	1.6 GB	1.6 GB	228	0.999929	0.999858	4.18	Download →
Wan2.1-T2V-1.3B	Video diffusion	5.3 GB	2.7 GB	307	0.999874	0.999590	—	local
SmolLM2-135M	Language model	101 MB	258 MB F16	211	0.999855	0.999589	—	GGUF
Gemma 2B-it	Language model	—	—	sampled	0.99995	0.99995	—	local
Whisper base.en	Speech recognition	—	—	sampled	0.999808	0.999763	—	GGML

Artifacts on Hugging Face: (1) compatibility safetensors (direct Transformers load), and (2) native .fpq v12 files (rANS entropy-coded, 3.5–5.2× smaller, direct inference via SLI bridge).

Output-Level Proof — Wan2.1-T2V-1.3B DiT Forward Pass

Loaded the original Wan2.1 model and FPQ-compressed version into the same WanTransformer3DModel architecture. Fed identical synthetic inputs (seed=42, shape [1,16,1,60,104], BF16 on MPS). Compared full forward pass outputs.

0.99759

Cosine Similarity

35.97 dB

PSNR

1.01e-3

MSE

6.18s → 6.40s

Inference Time

Per-channel: ch0 cos=0.9960, ch1 cos=0.9993, ch2 cos=0.9952, ch3 cos=0.9980
Max absolute error: 0.138 (on a ±0.45 std output range)
Relative error: 6.97% — well within the visually safe zone for video generation

Diffusion timestep sweep — Wan2.1-T2V-1.3B

Timestep	Cosine	PSNR (dB)	MSE
t = 0	0.99831	34.82	1.32e-3
t = 100	0.99792	35.49	1.13e-3
t = 500	0.99759	35.97	1.01e-3
t = 900	0.99782	36.02	1.00e-3
t = 999	0.99804	35.71	1.07e-3

Identical inputs at each timestep, BF16 on MPS. Cosine range: 0.99759–0.99831. Zero drift.

Perplexity benchmark — Qwen 2.5 0.5B, WikiText-2

994-token slice, max-length 512, stride 256. All runs on same hardware, same data.

Method	PPL	Δ baseline	Avg Cos	Worst Cos
Baseline (FP32)	14.20	—	1.0000	1.0000
BonfyreFPQ @3-bit	14.48	+1.97%	0.999783	0.999588
HQQ @3-bit (g64)	32.38	+128%	—	—
COORD @3-bit (v4)	35.59	+150%	0.982761	0.982327

169 tensors. HQQ via standalone benchmark (group-size 64, axis 1, CPU). Proof pack.

Published 3-bit benchmarks — Llama-2-7B, WikiText-2

From authors' own papers. Lower PPL = better. FP16 baseline: 5.12.

Method	Bits	PPL	Δ FP16	Source
FP16 (baseline)	16	5.12	—	—
AQLM	3.04	5.46	+6.6%	Egiazarian et al., ICML 2024
SpQR	2.98	6.20	+21.1%	Dettmers et al., 2023
AWQ	3	6.24	+21.9%	Lin et al., MLSys 2024
GPTQ	3.00	8.06	+57.4%	Frantar et al., 2022
HQQ	3	Not published on Llama-2		Badri & Shelor, 2023

What's inside

Low-Rank SVD

Global structure extraction

E8 Lattice

Optimal 8D quantization

16D RVQ

Structured residual correction

Ghost Head

Rank-1 error correction

GGUF format support (llama.cpp compatible)

Reads & dequantizes

F32, F16, Q4_0, Q5_0, Q8_0
Q4_K, Q5_K, Q6_K

Writes

GGUF v3 F16 — direct llama.cpp load
Preserves all metadata verbatim

Quick start

bonfyre-fpq quantize model.gguf compressed.gguf --bits 3

Input formats:
GGUF (llama.cpp, whisper.cpp)
Safetensors (HuggingFace)
GGML (legacy whisper)

Output formats:
GGUF F16 → llama.cpp direct load
BF16 safetensors → PyTorch/diffusers
Preserves all metadata + tokenizer

SLI Bridge — Direct Runtime Inference

Direct runtime inference from .fpq is now working. Load a compressed model and run it immediately — no conversion, no extra RAM, no hacks. The SLI bridge (Spectral Lattice Inference) is fully integrated.

Results (TinyLlama, 155 SLI layers):
• 97.3% top-1 agreement vs original
• 0.997 cosine similarity
• 2/5 text matches (identical output)
(Full logs & proof pack)

Usage: patch_model(hf_model, fpq, resolver) — replaces nn.Linear layers with FPQLinear. No decode step, no weight copy.

FPQ-X — Generalized Compression Algebra · All 6 Operators Live

Six operators. One compiler. Rate–distortion–execution optimized.

FPQ-X evolves BonfyreFPQ from a quantizer into a full compression algebra. All six operator families are implemented and validated in fpqx_ops.c + fpq_bridge.py.

✓ A — Additive ✓ M — Multiplicative Row Scale ✓ Π — Predictive ✓ D — Distilled ✓ Λ — Adaptive Policy ✓ H — NEON Packing

𝒯(x,c,h,t) = (B + R + P) ⊙ S + Π(x,c,h,t) + Δ_seq(c,t)

A = Additive core · M = Multiplicative manifold · Π = Predictive restoration · D = Sequence distillation

Six operator families — each derived from 2026 published research

Additive

Inherited from FPQ v10

Low-rank SVD + E8 lattice + 16D RVQ + QJL projection + Ghost correction. The proven foundation delivering 0.999+ cosine across 1,790 tensors.

Multiplicative

Low-rank scaling manifold

Learns S = I + AB^T via thin SVD of the ratio matrix Q = W/Ŵ − 1. Captures scaling distortion that additive methods miss. Auto-rollback if cosine doesn't improve.

Derived from: LoRDS, WaterSIC

Predictive

Context-conditioned restoration

Per-column linear predictor from the low-rank basis to the quantization residual. Uses already-available L factor to predict and cancel systematic error.

Derived from: EchoKV, MoBiQuant

Distilled

Sequence-axis compression

Attention-weighted K-means++ on KV cache vectors. Compresses along the sequence dimension — tokens that attend similarly share one cache atom. Orthogonal to weight quantization.

Derived from: KVSculpt, KV-CoRE

Adaptive

Per-tensor policy selection

Profiles each tensor: η_L (low-rank energy), spectral gap, kurtosis, outlier fraction. Decision tree selects which operators to activate and at what rank.

Derived from: KV-CoRE, MoBiQuant

Hardware

Kernel-aligned packing

Inner-group quantization that aligns bit boundaries to SIMD lanes. Stores scales per group, enabling vectorized unpacking without scatter/gather overhead.

Derived from: InnerQ, High-Rate QMM

The FPQ-X encode pipeline

1. Λ Profile

→

2. BWA Prune

→

3. A Encode (v9)

→

4. M Scale

→

5. Π Predict

Each stage has automatic quality rollback — if an operator doesn't improve cosine by >1e-7, it's disabled for that tensor.

FPQ v10 vs FPQ-X

Dimension	FPQ v10	FPQ-X
Error model	Additive only (W ≈ Ŵ)	Additive × Multiplicative + Predictive
Per-tensor policy	Same pipeline for all	Λ profiles η_L, gap, kurtosis → selects operators
KV cache	Weight-only quantization	D operator: sequence-axis distillation
Hardware awareness	Generic packing	H operator: SIMD-lane-aligned groups
Objective	min ‖W − Ŵ‖	min λ_R·Rate + λ_D·Distortion + λ_E·Execution
Research basis	Original FPQ design	9 papers from early 2026

bonfyre-fpqx CLI


          # Full A+M+Π pipeline

          bonfyre-fpqx compress model.safetensors compressed.safetensors --bits 3


          # Roundtrip quality test

          bonfyre-fpqx roundtrip model.safetensors --bits 3


          # Per-tensor compressibility analysis

          bonfyre-fpqx profile model.safetensors


          # KV cache distillation

          bonfyre-fpqx distill cache.safetensors distilled.safetensors --atoms 256


          # Hardware-aligned repacking

          bonfyre-fpqx pack model.safetensors packed.safetensors --bits 3 --group-size 128

Research foundation — 9 papers synthesized

LoRDS

Multiplicative low-rank scaling

arXiv:2601.22716

WaterSIC

Activation-aware rate–distortion

arXiv:2603.04956

EchoKV

Predictive KV reconstruction

arXiv:2603.22910

KVSculpt

Attention-weighted cache distillation

arXiv:2603.27819

KV-CoRE

Data-dependent compressibility

arXiv:2602.05929

InnerQ

Hardware-aligned inner quantization

arXiv:2602.23200

MoBiQuant

Token-adaptive mixed precision

arXiv:2602.20191

High-Rate QMM

Activation-weighted matrix multiply

arXiv:2601.17187

Codebook Opt.

Optimal codebook initialization

arXiv:2602.06557

KV Cache Compression — 9 Optimizations

Baseline cosine numbers (bonfyre-kvcache C benchmark). All 9 Python optimizations live in fpq_bridge.py.

Bits	KV Cosine	Hardware implication
5-bit	0.99996	~5.3× more context — 8K ctx → 42K in same VRAM
4-bit	0.99994	4× more context — recommended for production
3-bit	0.99990	5.3× context, some quality loss on long sequences

#3 Attention-Weighted Tiles

High-attention blocks dominate tile assignment. Codebook quality concentrates where the model actually looks.

#4 Per-Layer Adaptive Bits

Λ-profiler analyzes each K/V layer — kurtosis, spectral gap, outlier fraction — to pick the right bit depth automatically.

#5 Cross-Layer Shared Codebook

One 256-tile codebook learned across 8 sample layers. Skip per-call K-means — compress all layers in amortized O(1).

#6 D-Operator Distillation

K-means++ on KV vectors to K atoms (K ≪ N). Tokens that attend similarly share one atom. Bug-free nearest-centroid lookup.

#7 Delta Encoding

Only the delta vs previous frame is compressed. Each new token costs far less than storing a new frame.

#8 Huffman PMF Weighting

E8 coordinate magnitude as Huffman code length proxy. High-cost blocks get upweighted — rate-quality jointly optimized.

#9 LT_SMALL_INT Fast Path

Near-zero blocks (max abs ≤ 63) bypass E8 lattice — 7-bit integer round + clamp. Significant throughput win on embedding layers.

#10 M-Operator Row Scale

Per-row scale vector on each FPQLinear. Applied after SLI matmul: corrects per-output-channel amplitude drift.

#11 H-Operator NEON Packing

ARM NEON 128-bit aligned pre-packing. Eliminates scatter/gather — vectorized unpacking on Apple Silicon and Jetson.

Use individually or compose: patch_kv_cache(adaptive_bits=True, shared_tiles=tiles) activates #4 + #5 simultaneously.

Hardware Fits & Pricing

What hardware runs what — and what it costs.

FPQ weight compression + KV cache optimizations change the hardware equation. What used to need cloud GPUs now fits on-device. Use cases shift depending on your hardware budget.

Device	RAM	Approx. Cost (2026)	Before (BF16)	After (FPQ 4-bit + KV)
Raspberry Pi 5	8 GB	$80	TinyLlama only, 512-token ctx, no video	TinyLlama + 2K ctx, Whisper turbo inference, local ASR pipeline
Jetson Orin Nano	8 GB	$250	Qwen 0.5B only, degraded at >512 ctx	Qwen 0.5B @ 4K ctx · NEON packing · embeddings + FPQ inference co-resident
Apple M1 MacBook (16 GB)	16 GB unified	$900 refurb	Qwen 0.5B (tight), Wan 1.3B (no headroom)	Wan 1.3B + 4K ctx KV · HCP speech + SLI co-resident · NEON-packed
Apple M2/M3 Max (64 GB)	64 GB unified	$2,500–3,500	Phi-4 14B (no KV headroom past 2K)	Wan 14B @ 8K ctx · Phi-4 @ 32K ctx with delta KV · full pipeline concurrent
T4 cloud (16 GB VRAM)	16 GB VRAM	~$0.35/hr spot	Qwen 3B, 2K ctx max before OOM	Qwen 3B @ 8K ctx · Wan 1.3B + full diffusion sweep · shared KV codebook
RTX 4090 (24 GB VRAM)	24 GB VRAM	$1,600 GPU / ~$0.50/hr cloud	Wan 1.3B (tight), Phi-4 14B doesn't fit	Wan 1.3B @ 32K ctx · Phi-4 14B fits · adaptive bits saves ~30% KV RAM
RTX 6000 Ada (48 GB VRAM)	48 GB VRAM	~$1.10/hr (RunPod)	Wan 14B (tight), long video sequences OOM	Wan 14B · 287 SLI layers @ 5-timestep sweep · multi-second video KV cached

Budget tiers — what opens up at each price point

Under $100

Raspberry Pi 5

Local ASR with Whisper turbo. TinyLlama inference. Bonfyre pipeline binaries. Edge transcription kiosk.

$250 – $1,000

Jetson Orin / M1 Mac

Qwen 0.5B + 4K ctx. Wan 1.3B video. HCP speech + SLI co-resident. NEON packing. Full local pipeline.

$2,500 – $3,500

M2/M3 Max

Wan 14B @ 8K ctx. Phi-4 @ 32K ctx with delta KV. Full Bonfyre pipeline + inference concurrent. Production-grade local stack.

Cloud spot ($0.35–$1.10/hr)

T4 / RTX 4090 / RTX 6000

Burst GPU for video generation, large model inference, SLI sweeps. Use FPQ to fit bigger models on cheaper instances. T4 now handles what used to need A100.

Weight footprint

~2.2 GB → 1.1 GB

TinyLlama 1.1B — stays in .fpq at runtime, no decode step. BF16 copy never materializes.

KV context scaling

8K → 32K tokens

4-bit KV compression in same VRAM budget. Delta encoding makes each new token incremental.

ARM throughput

NEON 128-bit

H-operator pre-packing on Apple Silicon and Jetson — vectorized unpacking, no scatter/gather.

Co-residency

Inference + pipeline

FPQ model + HCP speech + vector search + pipeline can run concurrently on a 16 GB Mac.

FPQ Compression Benchmarks

Every number from a real run. Raw logs, scripts, and CSVs in the repo. ↑ FPQ overview

0.999882Avg cosine — Wan2.1-T2V-14B (402 tensors)

0.999916Avg cosine — Whisper Large V3 (998 tensors)

1,790Tensors compressed across 15 models

+1.97%PPL degradation — Qwen 0.5B @3-bit

54→27 GBWan2.1-T2V-14B (50% compression)

28 GBPhi-4 14B — near-lossless (cos 1.000614)

8.7→5.8 GBWhisper Large V3 (33% compression)

0.999826Worst-case tensor cosine (Wan2.1-T2V-14B)

4.05–4.19Bits per weight range @3-bit

0.99759DiT output cosine (30 transformer blocks)

+128%HQQ @3-bit PPL (65× worse than FPQ)

15 modelsAvailable on Hugging Face →

Artifact links

Hugging Face Model Hub

BF16 safetensors (drop-in) + native .fpq v12 files (rANS, direct SLI inference).

Wan2.1-T2V-14B (54→27 GB) → Phi-4 14B (28 GB) → Whisper Large V3 (8.7→5.8 GB) → Whisper Large V3 Turbo (1.6 GB) →

Proof Pack (2026-04-10)

Qwen PPL (v8 vs v4 vs HQQ), Whisper roundtrip, CSV, PNG chart, reproduction commands.

View proof pack →

BENCHMARKS.md

Full benchmark report: version progression, weight tables, KV cache, speed optimization, binary sizes.

View benchmarks doc →

BonfyreFPQ Source

Pure C11: main.c, fpq_codec.c, ggml_reader.c, fpq.h. Builds with make on macOS/Linux.

View source →

System Benchmarks

Apple M-series, after 6 optimization passes (P0–P6). All real. FPQ benchmarks ↑

5–8 msPer-stage latency (was 76 ms)

9.3%Lambda Tensors compression (N=10K)

237 msONNX embed (was 600 ms Python)

6 msfastText inference (was 150 ms Python)

536 bytesArtifact struct (was 1,076)

5 msSIMD exact vector search

15.5×Batch embed speedup (10 files)

~10×Hash hex (LUT vs snprintf)

Optimization passes

P0 — Foundation

Pure C rewrite

Python → C11, ONNX multi-thread, VECF binary format, -O3 -march=native -flto

P1 — Tokenizer

Trie + inline DB

Hash table → trie tokenizer, --insert-db zero-file-I/O embed path

P2 — SIMD

Batch + cosine

NEON SIMD cosine, batch embed, libbonfyre shared runtime

P3 — Native

fastText in C

Pure C fastText, libbonfyre → 29 binaries, DB connection pooling

P4 — Architecture

Hardening pass

FNV hash registry, SHA-256 dedup, PGO targets, TCP_NODELAY, SIGPIPE

P5 — Datatype

10 syntax wins

Hex LUT ~10×, struct 1076→536, O(n²)→O(n), raw syscalls, switch dispatch

P6 — Runtime RL

Self-tuning relay

RL agent tunes buffer size and path policy live. Gossip mesh + consensus on top of bonfyre-moq.

Metric	Before (P0)	After P5	Improvement
Single embed	~600 ms	237 ms	2.5×
10-file batch embed	~6,000 ms	386 ms	15.5×
Pipeline (6 stages)	76 ms	8 ms	9.5×
Tag inference	~150 ms	6 ms	25×
Hash hex	~100 ns	~10 ns	~10×
Artifact struct	1,076 bytes	536 bytes	2× cache density
Vector file (384-dim)	6.4 KB JSON	1,544 bytes VECF	4.2× smaller

BonfyreTel — MoQ/WebTransport Relay

Pure C relay. No Node. No JS runtime overhead.

bonfyre-moq replaces the Node.js MoQ relay with a pure C11 implementation. Built on ngtcp2 + nghttp3 + OpenSSL 3 + SQLite. Ships with four live extension modules: inline inference, RL self-tuning, gossip mesh, and lightweight consensus — all in the same binary.

✓ MoQ-Transport draft-14 ✓ Inline AI Inference ✓ RL Self-Tuning ✓ Gossip Mesh ✓ Consensus / Leader Election ✓ Zero-Copy Object Forwarding

Four live extension modules

Inline AI Inference

inference.c + inference_onnx.c

Entropy-based scoring (score 0–100) on every forwarded MoQ object. Weak-linked ONNX Runtime via dlopen — falls back to entropy estimator when ONNX Runtime is absent. Same hook as bonfyre-embed.

ext_score_object(path, data, len, &tag)

RL Self-Tuning (P6)

optimizer.c

Background RL agent tunes relay buffer size (16 KB–256 KB) and path policy (round-robin vs least-loaded) every 2 s. Reward signal driven by real relay metrics. Exposes current params for use by the forwarding loop.

ext_get_relay_buf_size() / ext_get_relay_path_policy()

Gossip Mesh

mesh.c + bonfyre-mesh.h

UDP multicast gossip beacon on 239.0.0.57:7942. Background thread maintains a live peer table (peer_info[]). Enables distributed relay clusters without a central registry. Shared peer table feeds the consensus module.

ext_mesh_start() / ext_mesh_stop()

Lightweight Consensus

consensus.c

Simulated Raft: returns the stable leader peer from the mesh table. Used to shard MoQ PUBLISH_NAMESPACE announcements and SUBSCRIBE routing across a relay cluster without split-brain.

ext_consensus_leader() → const char *

Relay internals

QUIC / WebTransport (ngtcp2+nghttp3)
↳ MoQ stream demux → ext_score_object() (inline inference, every object)
↳ zero-copy forward → subscriber fan-out (buf size from RL optimizer)
↳ PUBLISH_NAMESPACE → consensus leader routes → mesh peer table
SQLite stream log · SIGTERM graceful drain · kqueue/epoll event loop

Build & run


          # Build relay + all extension modules

          make bonfyre-moq


          # Run relay (MoQ + mesh + optimizer + inference)

          ./bonfyre-moq --host 127.0.0.1 --port 4443 \

                         --runtime-dir /tmp/bonfyre-moq \

                         --db /tmp/bonfyre-moq/relay.db


          # Smoke-test all extension modules

          make test-bonfyre  # inference + optimizer + mesh + consensus

C11No Node.js

4Extension modules

RLSelf-tuning (P6)

MeshGossip peer discovery

RaftConsensus leader

Architecture

48 separate binaries. Not a monolith. Not a framework. Each is a standalone Unix process.

Unix philosophy

Each binary does one thing. Compose with pipes, files, or the pipeline binary. bonfyre-media-prep audio.wav | bonfyre-transcribe | bonfyre-brief

Process isolation

Every binary runs as its own process. No shared memory. If one crashes, nothing else does. Separate processes clean up on exit.

Dynamic linking

Whisper via libwhisper (Homebrew). LLM via llama-completion subprocess. SQLite via system library.

Pipeline DAG

Audio in → ingest → media-prep → transcribe → transcript-clean → paragraph → brief → proof → pack → distribute
↳ embed → vec (semantic search)
↳ tag + tone (enrichment)
↳ render → emit (HTML/PDF/EPUB/RSS)
↳ moq (WebTransport relay · RL optimizer · gossip mesh · consensus)

Five layers

Surfacecms · api · auth · pipeline · cli · transcript-family · project · tel · proxy

Valueoffer · gate · meter · ledger · finance · outreach · pay · pack · distribute

Transformmedia-prep · transcribe · transcript-clean · paragraph · brief · proof · embed · narrate · render · emit · mfa-dict · weaviate-index · repurpose · segment · clips · speechloop · tone · tag · canon · query

Substrateingest · hash · index · compress · stitch · graph · runtime · queue · sync

Librarieslibbonfyre (runtime, FNV registry, SHA-256) · liblambda-tensors (Huffman, arithmetic coding)

All 50 Binaries

Every binary is standalone. Use one or all. ~2.1 MB total disk.

Substrate (9 binaries)

bonfyre-ingest 35 KB — intake + type detection

bonfyre-hash 34 KB — SHA-256 content addressing

bonfyre-index 68 KB — SQLite artifact index + FTS

bonfyre-compress 34 KB — zstd family-aware compression

bonfyre-stitch 34 KB — DAG materializer

bonfyre-graph 51 KB — Merkle-DAG artifact graph

bonfyre-runtime 34 KB — process lifecycle

bonfyre-queue 34 KB — persistent job queue

bonfyre-sync 34 KB — cross-instance replication

Transform (22 binaries)

bonfyre-media-prep 34 KB — audio normalization

bonfyre-transcribe 34 KB — speech-to-text (Whisper)

bonfyre-transcript-clean 34 KB — remove filler words

bonfyre-paragraph 35 KB — structure paragraphs

bonfyre-brief 34 KB — summary + action items

bonfyre-proof 34 KB — quality scoring

bonfyre-embed 52 KB — ONNX embeddings, trie tokenizer, batch

bonfyre-vec 35 KB — SIMD cosine vector search

bonfyre-narrate 68 KB — verified TTS: 6-layer fidelity

bonfyre-render 34 KB — template rendering

bonfyre-emit 34 KB — HTML/PDF/EPUB/RSS output

bonfyre-mfa-dict 34 KB — pronunciation dictionary

bonfyre-weaviate-index 34 KB — Weaviate vector search

bonfyre-transcript-family 34 KB — full transcription chain

bonfyre-repurpose 34 KB — content repurposing

bonfyre-segment 50 KB — speaker segmentation

bonfyre-clips 35 KB — audio clip extraction

bonfyre-speechloop 34 KB — live speech loop

bonfyre-tone 34 KB — tone/sentiment

bonfyre-tag 35 KB — topic tagging (native fastText)

bonfyre-quant 42 KB — v8 RLF weight quantization

bonfyre-kvcache 42 KB — KV cache compression

Surface (9 binaries)

bonfyre-cms 287 KB — CMS + Lambda Tensors

bonfyre-api 69 KB — HTTP gateway + dashboard

bonfyre-auth 35 KB — user auth + sessions

bonfyre-pipeline 52 KB — unified pipeline (5-8 ms/stage)

bonfyre 34 KB — unified CLI dispatcher

bonfyre-project 34 KB — project scaffolding

bonfyre-tel 68 KB — FreeSWITCH telephony

bonfyre-moq ~120 KB — MoQ/WebTransport relay + inference + RL + mesh + consensus

bonfyre-canon 35 KB — canonical artifact format

bonfyre-proxy 53 KB — OpenAI-compatible API shim

Value (9 binaries)

bonfyre-offer 34 KB — dynamic pricing

bonfyre-gate 34 KB — API key tiers

bonfyre-meter 34 KB — usage tracking

bonfyre-ledger 34 KB — financial records

bonfyre-finance 51 KB — bundle pricing

bonfyre-outreach 51 KB — outreach tracking

bonfyre-pay 35 KB — invoicing + payments

bonfyre-pack 34 KB — deliverable packaging

bonfyre-distribute 34 KB — email/Slack/webhooks

Libraries

libbonfyre 64 KB — runtime contract, FNV hash, SHA-256

liblambda-tensors 72 KB — structural JSON compression

How-To: Pick Your Entry Point

Each is standalone — you don't need to understand the whole system.

Direct .fpq Inference (SLI Bridge)

Run models directly from .fpq files. patch_model() replaces nn.Linear with FPQLinear. TinyLlama: 155 layers, cos=0.997, top-1=97.3%. bonfyre-oss.

patch_model(hf_model, fpq, resolver)
5 min to try

KV Cache Compression

4× more context tokens in same VRAM. 9 optimization passes. Works with nn.Linear and FPQLinear simultaneously.

patch_kv_cache(model, bits=4, adaptive_bits=True)
2 min to try

Model Quantization (v8 RLF)

Quantize LLM weights to 3-bit with 0.9999+ cosine. E8 lattice snap + μ-law warp + 16D RVQ. 42 KB binary.

bonfyre-quant benchmark model.gguf --bits 3
5 min to try

Lightweight CMS

Replace Strapi's 500 MB with a 287 KB binary. Dynamic schemas, token auth, REST API. bonfyre-cms.

bonfyre-cms serve --port 8800
2 min to try

Local Transcription + HCP

Local speech for public or private audio. Live proof. bonfyre-intake.

bonfyre-transcribe run audio.wav
5 min to try

Audio-to-Invoice Pipeline

Audio → transcript → summary → quality score → pricing → deliverable. 5–8 ms per stage. bonfyre-pipeline.

bonfyre-pipeline run --input audio.mp3
2 min to try

Semantic Vector Search

Embed docs + NEON SIMD cosine search. Replace $250/mo Pinecone. bonfyre-embed.

bonfyre-embed --insert-db my.db
5 min to try

MoQ Relay + Mesh

Pure C WebTransport relay with inline AI scoring, RL self-tuning, gossip peer discovery, and consensus leader election. Replaces Node.js moq-edge.

./bonfyre-moq --port 4443 --db relay.db
make bonfyre-moq to build

OpenAI-Compatible API

Drop-in replacement. Set OPENAI_API_BASE=http://localhost:8787. 53 KB binary.

bonfyre-proxy serve --port 8787
1 min to try