29 KiB
LAC — Lo Audio Codec
Lossless audio codec for internal use. Target compression is FLAC-class (~50% of raw). Integer-only, bit-exact, streaming-oriented.
Scope
- Input: signed integer PCM passed as
i32with|sample| ≤ 2²³ − 1. 8-bit, 16-bit, 20-bit, and 24-bit sources are all valid without conversion — they compress at the bit cost of their actual values, not a 24-bit ceiling. - Sample rate: caller-specified; not encoded in the stream. The container or transport carries it.
- Channels: mono per encoded stream. Stereo is two independent mono streams — for example, two QUIC streams over a shared connection, one per channel. No cross-channel joint coding.
- Frames: independently decodable. No cross-frame state; a lost or corrupt frame never affects subsequent decodes.
Pipeline
samples → LPC analysis → residuals → partitioned Rice → frame bytes
↓
(inverse for decode)
Three encoder-side choices, searched per frame:
- LPC order: the reference encoder tries a sparse grid
{0, 2, 4, 6, 8, 10, 12, 16, 20, 24, 28, 32}with a 2-order early-out once cost stops improving. Order 0 is verbatim (residuals equal the raw samples). The wire format permits any order in[0, 32]. - Coefficient shift
∈ [0, 5]: widens the Q-format of the stored predictor coefficients from Q15 (range[−1, 1)) out to Q10 (range[−32, 32)) so low-frequency / narrow-resonance content doesn't clamp|a[1]|near 2. Chosen deterministically per order as the smallest shift that avoids clamping. - Rice partition order
∈ [0, 7]: splits the residual stream into2^partition_orderequal partitions, each with its own Rice parameterk ∈ [0, 23]chosen by convex descent.
Levinson-Durbin runs once up to order 32 into a flat stack-allocated
buffer (LpcLevels) and the per-order coefficients are consulted by
slice reference; the order search itself does no heap allocation.
Intended use
- QUIC streaming — one reliable stream per audio channel. Frames fit the per-stream framing (length-prefixed or datagram-mapped) without modification.
- Offline file playback — a container pairs the channel streams by timestamp; each stream decodes independently.
Frame size guidance
Frame size is a latency-vs-compression knob chosen at the application
layer. The codec accepts any frame_sample_count in [1, 65535], but
the LPC/Rice search amortises better on larger frames (shared header,
more samples per fitted coefficient vector). Concrete defaults:
| Use case | Frame size | Latency at 48 kHz | Notes |
|---|---|---|---|
| Real-time voice, tight latency | 160 @ 16 kHz (10 ms) | — | matches WebRTC/Opus 10 ms mode |
| Real-time voice, balanced | 320 @ 16 kHz (20 ms) | — | default for MCU workload in tests/mcu_mix.rs |
| Game/conf streaming | 960 @ 48 kHz (20 ms) | 20 ms | one QUIC datagram per frame fits typical MTUs |
| Music streaming | 2048 @ 48 kHz (43 ms) | 43 ms | compression benefit flattens past this |
| Offline archival | 4096 @ 48 kHz (85 ms) | — | tightest LPC fit; default in tests/corpus.rs, matches FLAC's default blocksize for apples-to-apples compression comparison |
Partition orders that evenly divide the frame size dominate the search
cost. Power-of-two frame sizes (256, 512, 1024, 2048, 4096) unlock every
partition_order ∈ [0, 7]; 960 and 2880 (common WebRTC rates) allow
orders up to 6 and 5 respectively; prime sizes like 137 collapse to
partition_order = 0. Prefer power-of-two frame sizes unless a
container format constrains the choice.
Structure
lac/
├── Cargo.toml
├── README.md ← you are here
├── Specification.md ← wire format specification
├── corpus/ ← test WAVs (speech + music), LFS-tracked via .gitattributes
├── src/
│ ├── lib.rs ← public API and project-wide constants
│ ├── bit_io.rs ← MSB-first bit reader/writer
│ ├── lpc.rs ← Levinson-Durbin, LpcLevels flat buffer, residuals/synthesis
│ ├── rice.rs ← zigzag + partitioned Rice coding, convex-descent k
│ ├── frame.rs ← frame header, encode_frame, decode_frame
│ └── test_signals.rs ← integer-only sine LUT for float-free test inputs
├── tests/
│ ├── corpus.rs ← compression ratio + FLAC comparison on real audio
│ ├── synthetic.rs ← bit-depth + pathological-content round-trips, no corpus needed
│ ├── latency.rs ← P50/P95/P99/max encode+decode latency, peak heap, alloc count
│ └── mcu_mix.rs ← end-to-end MCU workload (decode → mix → re-encode)
├── benches/
│ ├── codec.rs ← nightly #[bench] harness (encode, decode, compute_residuals)
│ └── compare-flac.sh ← diagnostic shell script: wall-clock flac encode across corpus
└── fuzz/
├── fuzz_targets/
│ ├── decode_arbitrary.rs ← decoder robustness under arbitrary bytes
│ └── roundtrip_arbitrary.rs ← encoder/decoder self-consistency
└── dict/
├── decode_arbitrary.dict ← libFuzzer dict: sync word + field boundary constants
└── roundtrip_arbitrary.dict ← libFuzzer dict: sample-value boundaries
See Specification.md for the normative wire format.
Public API
Every sample is an i32 with magnitude bounded by 2²³ − 1. Narrower
integer sources go through unchanged:
use lac::{encode_frame, decode_frame};
// 16-bit microphone PCM → just widen with `i32::from`. Do NOT shift
// left by 8 to "align" to 24-bit: that multiplies residual magnitudes
// by 256 and costs 8 extra bits per residual in the Rice payload. The
// codec compresses at the bit cost of the actual sample magnitudes,
// not a 24-bit ceiling.
let pcm_16: Vec<i16> = /* from microphone */ Vec::new();
let samples: Vec<i32> = pcm_16.iter().map(|&s| i32::from(s)).collect();
let bytes = encode_frame(&samples);
let recovered: Vec<i32> = decode_frame(&bytes)?;
assert_eq!(recovered, samples);
# Ok::<(), lac::DecodeError>(())
For 24-bit PCM, samples are already in range — pass through directly.
For 8-bit PCM, i32::from(s as i8) (signed) or the equivalent from your
unsigned-offset-128 source.
Round-trip is bit-exact: decode_frame(encode_frame(s)) == s for every
valid s.
Buffer-reusing API for hot loops
For the MCU re-encode fanout and QUIC senders that own a per-channel
scratch buffer, use encode_frame_into / decode_frame_into to
target a caller-owned Vec<u8> / Vec<i32> instead of allocating
fresh on each call:
use lac::{encode_frame_into, decode_frame_into};
let mut encoded = Vec::new(); // one buffer per channel, reused across frames
let mut decoded = Vec::new();
for frame_samples in frames_iter() {
encode_frame_into(&frame_samples, &mut encoded);
// … send `encoded` …
}
for incoming_bytes in incoming_iter() {
decode_frame_into(&incoming_bytes, &mut decoded)?;
// … consume `decoded` …
}
# fn frames_iter() -> impl Iterator<Item = Vec<i32>> { std::iter::empty() }
# fn incoming_iter() -> impl Iterator<Item = Vec<u8>> { std::iter::empty() }
# Ok::<(), lac::DecodeError>(())
Both _into variants clear the destination at entry and retain its
capacity, so steady-state usage makes zero allocations past the first
frame.
Output size expectations
For realistic audio (speech, music, ambient), compressed frames land around 15-55 % of raw sample bytes (speech near the low end, music near the high end). Callers reusing a scratch buffer can safely preallocate to 1× raw and take the extension cost only on the rare adversarial frame.
For untrusted input — payloads where residuals might be crafted to
maximise Rice output — the worst-case expansion bound is ~17× raw: at
the Rice k = 23 ceiling, each codeword is up to 535 bits (511 unary
zeros + terminator + 23 remainder), or ~67 bytes per residual. A
pipeline that must pre-size a bounded output buffer for arbitrary
input can use samples.len() * 68 bytes as a loose upper bound. The
encoder never exceeds this.
Error recovery
On decode failure the caller substitutes frame_sample_count zeros
(silence) for the frame period. The count is recoverable from the
frame itself as long as the header parsed, even if the bitstream
body then failed — call parse_header on the same buffer:
use lac::{decode_frame, parse_header};
const SESSION_DEFAULT_FRAME: usize = 320; // negotiated at session start
let bytes = Vec::<u8>::new();
let samples = match decode_frame(&bytes) {
Ok(s) => s,
Err(_) => {
let count = parse_header(&bytes)
.map(|(h, _)| h.frame_sample_count as usize)
.unwrap_or(SESSION_DEFAULT_FRAME);
vec![0i32; count]
}
};
When the header itself fails (BadSyncWord, InvalidPredictionOrder,
InvalidPartitionOrder, InvalidCoefficientShift, or Truncated
below 7 bytes), the frame length is unknowable and the caller must
fall back to a session-level default.
Concurrency
LAC's encode and decode APIs are pure functions with no shared state —
no globals, no internal Mutex, no unsafe. All public types are
Send + Sync. Calls on different threads never contend with each
other, and each call's scratch buffers are owned (stack or the
caller-supplied Vec).
The intended deployment shape for multi-channel and multi-stream
workloads is one thread or task per channel. The codec itself does
no threading: scheduling is left to the application so it can pick
whichever executor fits (tokio for async servers, rayon for data-
parallel workloads, std::thread for straight-ahead concurrency).
MCU re-encode fanout with stdlib primitives only:
use std::thread;
use lac::encode_frame;
let mixes: Vec<Vec<i32>> = Vec::new();
let outgoing: Vec<Vec<u8>> = thread::scope(|s| {
let handles: Vec<_> = mixes
.iter()
.map(|mix| s.spawn(move || encode_frame(mix)))
.collect();
handles.into_iter().map(|h| h.join().unwrap()).collect()
});
Or with rayon, if the project already pulls it in:
// use rayon::prelude::*;
// let outgoing: Vec<Vec<u8>> = mixes.par_iter().map(|m| encode_frame(m)).collect();
The allocator you link against sets the ceiling on multi-core
scaling: glibc malloc has measurable lock contention at tens of
cores, whereas mimalloc / jemalloc keep per-thread caches and scale
further. The codec itself doesn't care which one you pick — it allocs
through the global allocator like any other Rust library.
Input-size caps on untrusted channels
Applications accepting LAC frames from untrusted peers should cap the
per-frame input size at the application layer. The decoder's
per-codeword unary-run bound (spec §4.2) prevents any single codeword
from consuming unbounded CPU, but total decode cost scales with
buffer length; an attacker handed an unbounded payload can force
proportional scan work. Typical real frames are sub-kilobyte; a cap
of 64 KB per frame is comfortably above any legitimate LAC payload
and cheap to enforce at the framing layer (QUIC stream length
field, length-prefixed framing, etc.). The Truncated error fires
naturally when a payload is cut, so a hard cap doesn't break legal
traffic — it just bounds pathological work.
Silence-substitution amplification
Spec §6.1 mandates that callers substitute frame_sample_count zeros
on decode failure. An attacker can craft a tiny frame (~10-byte
header with frame_sample_count = 65535) whose Rice payload is
malformed; the decoder rejects, the caller dutifully emits 65 535
output samples of silence. At 48 kHz mono i32, that's ~256 KB of
zeros per ~10-byte input frame — a ~25 000× amplification.
The output is silence, not attacker-chosen data, so this is a
downstream-resource-exhaustion vector (memory, bandwidth,
re-encode work at an MCU) rather than a data-injection vector.
Mitigation is at the application layer: cap frame_sample_count
to the session's negotiated frame size before invoking the silence
substitution. QUIC / WebRTC sessions already negotiate a frame size
at setup; using that as a hard upper bound on the silence-fill
length collapses the amplification ratio to 1×. An MCU that reads
parse_header(&data).frame_sample_count without validating it
against the session cap inherits the amplification unchanged.
Packet loss & concealment
Frames are independently decodable: losing one frame never corrupts another, regardless of which concealment strategy the application picks. This is a genuine deployment asset on lossy transports (QUIC datagrams, UDP), and the section below walks the plausible strategies in increasing quality order.
Strategy 1: silence substitution (the default)
The baseline decode_frame returns Err on structural failure; the
application substitutes frame_sample_count zeros for the lost frame
period (see parse_header recovery pattern under Public API →
Error recovery). Fast, deterministic, audible as a brief cut —
acceptable for voice up to ~20 ms of loss, jarring beyond that.
Strategy 2: sample-and-hold
Repeat the last successfully decoded sample for the frame period. Zero-cost on the decoder side, preserves DC level so the click at the drop boundary is softer than silence. Quality at 20 ms of loss is better than silence for voice, slightly worse for music (DC hold on a non-stationary signal adds a small transient when the next frame arrives).
// After a successful decode, store the last sample for reuse on loss.
// On loss: fill the gap with that value.
# fn last_decoded_sample() -> i32 { 0 }
# const N: usize = 320;
let conceal = vec![last_decoded_sample(); N];
Strategy 3: linear fade
Interpolate from the last valid sample down to zero over the lost frame period. Removes the DC-hold transient and the "cut to silence" click both. Costs N integer adds per lost frame. Recommended baseline for any application that can afford 2-5 lines of PLC code.
Strategy 4: LPC-coefficient extrapolation
The last successfully decoded frame's AudioFrameHeader carries
the LPC coefficients the encoder chose — available from
parse_header at no extra cost — and the LPC filter is locally
stationary over a 20-40 ms horizon. Run the synthesis formula (§3.6
of Specification.md) forward from the last decoded samples to predict
the missing frame. Quality is best on pitched content (voiced
speech, sustained notes); on transients it degrades gracefully
because the predictor's autoregressive behaviour damps toward zero
over the frame.
Not built into the library — the math is straightforward and the
"right" tuning varies by deployment (how much damping, whether to
blend with sample-and-hold on transients, etc.). See src/lpc.rs's
lpc_synthesize_into for the integer synthesis routine that a
PLC implementation would call.
Multi-frame loss guidance
The strategies above are only useful up to a handful of consecutive lost frames. Rough thresholds at 20 ms frame periods:
| Consecutive lost frames | Effective loss | Verdict |
|---|---|---|
| 1 | 20 ms | Inaudible with fade or LPC extrapolation; brief click with silence or sample-and-hold |
| 2-3 | 40-60 ms | Noticeable glitch; LPC extrapolation minimises but cannot hide it |
| 4-10 | 80-200 ms | Audible dropout. PLC keeps the audio from sounding "broken" but doesn't restore content |
| > 10 | > 200 ms | Treat the stream as broken; reset the receiver's concealment state to avoid droning artifacts, and if possible ask the transport to signal "resync" upstream |
Mid-stream resync on a datagram transport uses the sync word
(0x1ACC) as an alignment anchor: on a string of bad frames,
search the next N bytes of the buffer for the big-endian sequence
\x1a\xcc and retry parse_header from each candidate offset
until one succeeds. The search is O(N); on a 20 ms frame at 48 kHz
there are at most ~180 bytes per frame to scan, so amortised cost
is negligible.
Testing
cargo test # unit tests
cargo test --test corpus --release -- --nocapture # compression vs FLAC, lac_enc_ms
cargo test --test synthetic --release -- --nocapture # bit-depth + pathological content
cargo test --test latency --release -- --nocapture --test-threads=1 # p50/p95/p99 + alloc count
cargo test --test mcu_mix --release -- --nocapture --test-threads=1 # MCU throughput
cargo test --test conformance --release -- --nocapture # byte-level spec conformance
cargo test --test determinism --release # encode byte-equality on repeat
cargo fuzz run decode_arbitrary -- -dict=dict/decode_arbitrary.dict
cargo fuzz run roundtrip_arbitrary -- -dict=dict/roundtrip_arbitrary.dict
cargo bench # nightly bench
benches/compare-flac.sh # flac side of the speed table
Published-crate caveat. Cargo.toml excludes corpus/* and
fuzz/* from the published tarball — they'd blow up crate size and
the audio isn't redistributable under crates.io's constraints anyway.
A user running cargo test against a cargo add lac'd dependency
sees every corpus test pass because the require_corpus! macro
skips missing files silently; the compression-ratio assertions,
FLAC comparisons, latency P99 checks, and MCU throughput checks all
go unrun. The full regression suite requires the git repository
(with LFS pulled). The synthetic, conformance, determinism, and
unit tests run unchanged from either source.
Coverage at a glance:
- Unit — round-trips for every LPC order 0-32 and every partition order
0-7, prime frame lengths that force
partition_order = 0, all-zero frames, full-scale sample magnitudes, malformed-header rejection for every field (sync_word,prediction_order,partition_order,coefficient_shift), truncated bitstreams, and a convex-descent vs exhaustive-searchselect_kdifferential. - Corpus — round-trip + compression-ratio + FLAC subprocess comparison
on a mixed speech and music corpus; asserts ratio ceilings so a codec
regression fails CI; prints LAC encode wall-clock for correlation
against
benches/compare-flac.sh. - Synthetic — deterministic LFSR-driven round-trips at 8/16/20/24-bit source widths and pathological content (all-zero, DC offset, Nyquist square, silence + click, full-scale constant, prime-length frame). No corpus dependency so the tests run on every CI checkout.
- Latency — per-frame encode/decode timing on real speech with a custom tracking allocator for peak-heap and per-frame allocation-count numbers; reports P50/P95/P99/max and asserts P99 < frame period so a real-time regression fails CI.
- MCU — decode → PCM mix → re-encode simulation on real speech for 2/3/5/8/16 participants (continuous speech) plus an 8-participant rotating dominant-speaker variant; asserts MCU egress ≤ SFU-fanout egress.
- Fuzz — libFuzzer targets for decoder robustness and encoder/decoder self-consistency on arbitrary bytes, seeded with dictionaries of the wire-format constants (sync word, field boundaries) and sample-magnitude boundaries (8/16/20/24-bit ceilings).
Measurements
Reference hardware
| Short name | CPU | ISA highlights |
|---|---|---|
| 7840HS | AMD Ryzen 7 7840HS (laptop, 8c/16t, up to 5.1 GHz) | AVX-512 (F/BW/CD/DQ/VL/VNNI/VBMI), BMI2, FMA |
| RPi5 | Raspberry Pi 5 (Cortex-A76 quad, 2.4 GHz) | NEON |
| VF2 | StarFive VisionFive 2 (SiFive U74 quad, 1.5 GHz) | RVV 0.7 (some LLVM autovec, less mature than x86 or NEON) |
Numbers below are measured at default cargo build --release (no
target-cpu=native, no project-level RUSTFLAGS). Empty cells are
awaiting measurement on the listed hardware. FLAC comparison uses both
-5 (the CLI default, what production pipelines typically use) and
-8 (--best, the compression upper bound).
Corpus attribution
The measurements are taken on two publicly-licensed audio corpora
checked into corpus/:
- Speech: the AMI Meeting Corpus
(files named
ES2002a.*), recorded by the AMI Consortium (University of Edinburgh, IDIAP, TNO, Brno University of Technology, University of Sheffield, and partners). Distributed under CC BY 4.0. - Music: Kimiko Ishizaka's recording of J.S. Bach's Goldberg
Variations, BWV 988 (files named
Kimiko Ishizaka - …), from the Open Goldberg Variations project (Robert Douglass, producer). Released under CC0 1.0 — public domain dedication, no attribution legally required, credited here as a courtesy.
Both corpora are used unmodified apart from the file selection described in the tables below.
Compression (hardware-independent, bit-exact across targets)
LAC ratio = LAC encoded / raw PCM. Both codecs use the same 4096-sample
block size on this corpus — LAC's tests/corpus.rs sets
FRAME_SIZE = 4096, which matches FLAC's default blocksize at -5
and -8 for ≤ 48 kHz content, so header and coefficient overhead is
amortised identically on both sides.
| Corpus file | Class | LAC | FLAC -5 | FLAC -8 | LAC / -5 | LAC / -8 |
|---|---|---|---|---|---|---|
ES2002a.Headset-0.wav |
headset speech, 16 kHz / 16-bit | 0.178 | 0.187 | 0.186 | 0.954 | 0.958 |
ES2002a.Mix-Headset.wav |
mixed meeting, 16 kHz / 16-bit | 0.292 | 0.300 | 0.297 | 0.975 | 0.984 |
ES2002a.Array1-01.wav |
array speech, 16 kHz / 16-bit | 0.375 | 0.378 | 0.377 | 0.989 | 0.994 |
| Goldberg Aria (01) | solo piano, 96 kHz / 24-bit | 0.483 | 0.458 | 0.457 | 1.053 | 1.056 |
| Goldberg Variatio 4 (05, fughetta) | solo piano, 96 kHz / 24-bit | 0.514 | 0.483 | 0.481 | 1.065 | 1.067 |
| Goldberg Variatio 16 (17, Ouverture) | solo piano, 96 kHz / 24-bit | 0.512 | 0.479 | 0.478 | 1.068 | 1.070 |
Speech reliably beats FLAC at both levels by a small margin; music
trails by 5-7 % (the Q-format gap at low frequencies, mitigated but
not eliminated by coefficient_shift). FLAC's jump from -5 to -8
buys essentially nothing on this corpus (≤ 0.2 pp of ratio), so the
realistic LAC-vs-FLAC comparison in practice is against -5. Numbers
are byte-identical regardless of hardware because LAC's output is
specified bit-exactly.
Encode wall-clock (ms, full file)
One table per hardware target; each has LAC alongside both FLAC levels
so the speed cost of each quality point is visible. The -5 column is
the most representative real-world comparison.
7840HS (AMD Ryzen 7 7840HS):
| Corpus file | Duration | LAC | FLAC -5 | FLAC -8 |
|---|---|---|---|---|
ES2002a.Headset-0.wav |
~42 min, 16 kHz / 16-bit | 1158 | 221 | 436 |
ES2002a.Array1-01.wav |
~42 min, 16 kHz / 16-bit | 1292 | 226 | 447 |
ES2002a.Mix-Headset.wav |
~42 min, 16 kHz / 16-bit | 1367 | 223 | 469 |
| Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 809 | 272 | 647 |
| Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 2126 | 754 | 1741 |
| Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 3521 | 1166 | 2703 |
RPi5 (Raspberry Pi 5, Cortex-A76 @ 2.4 GHz):
| Corpus file | Duration | LAC | FLAC -5 | FLAC -8 |
|---|---|---|---|---|
ES2002a.Headset-0.wav |
~42 min, 16 kHz / 16-bit | 2856 | 477 | 959 |
ES2002a.Array1-01.wav |
~42 min, 16 kHz / 16-bit | 3249 | 495 | 1096 |
ES2002a.Mix-Headset.wav |
~42 min, 16 kHz / 16-bit | 3363 | 505 | 1132 |
| Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 1904 | 606 | 1570 |
| Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 5201 | 1627 | 4324 |
| Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 9015 | 2572 | 6832 |
VF2 (StarFive VisionFive 2, SiFive U74 quad @ 1.5 GHz):
| Corpus file | Duration | LAC | FLAC -5 | FLAC -8 |
|---|---|---|---|---|
ES2002a.Headset-0.wav |
~42 min, 16 kHz / 16-bit | 29385 | 2355 | 5614 |
ES2002a.Array1-01.wav |
~42 min, 16 kHz / 16-bit | 33231 | 2502 | 6688 |
ES2002a.Mix-Headset.wav |
~42 min, 16 kHz / 16-bit | 34899 | 2548 | 6878 |
| Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 18185 | 3184 | 9278 |
| Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 49811 | 8535 | 25454 |
| Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 88208 | 13544 | 40650 |
LAC is ~5-6× slower than FLAC -5 and ~2-3× slower than FLAC --best
on x86 because libFLAC ships hand-tuned SSE intrinsics for its
autocorrelation kernel and LAC relies on LLVM autovectorization.
End-to-end perf barely changes with target-cpu=native: the kernel does
pick up AVX-512 zmm dot-products, but the frame encode is bottlenecked
elsewhere (Rice k-search and bitstream assembly dominate the remaining
time).
On RPi5 (ARM Cortex-A76, NEON) LAC runs ~2.5× slower than on 7840HS in
absolute terms, but the ratio against FLAC shifts noticeably: LAC is
~6-7× slower than FLAC -5 on speech (wider gap, libFLAC's NEON path
is well-tuned for the 16 kHz / 16-bit content) but only ~3× slower on
96 kHz / 24-bit music (narrower gap — 24-bit content gives libFLAC's
specialization less leverage). Against FLAC --best the music gap
narrows further to ~1.2-1.3×. The 7840HS-vs-RPi5 delta in the LAC
column shows scalar autovec quality is broadly comparable across x86
and ARM backends; the delta in the FLAC columns shows where hand-tuned
intrinsics disappear on a different ISA.
On VF2 (RISC-V SiFive U74, RVV 0.7 — not supported by mainline libFLAC
or LLVM autovec yet) LAC runs ~10× slower than on RPi5. Both codecs
fall back to pure scalar execution; the gap between them widens to
~12-13× on speech and ~6× on music vs FLAC -5, or ~5× / ~2× vs
FLAC --best. Two factors compound: the U74 is a single-issue
in-order core vs the Cortex-A76's dual-issue out-of-order (base IPC is
~2× lower at the ISA-agnostic level), and LLVM's scalar Rust codegen
for RISC-V is less mature than its x86/ARM output — tighter inner
loops in libFLAC's hand-written C survive this better than LAC's
Rust does. The absolute numbers are still useful: even at 88 s to
encode 5 minutes of 96/24 stereo audio, LAC comfortably meets
realtime for streaming use (see the P99 latency table below).
Per-frame encode latency P99 (µs)
All rows use real AMI speech samples. Frame sample count sets the real-time deadline; P99 must stay below that period for the frame to ship inside its own playback slot.
| Test | Frame | Period | 7840HS P99 | RPi5 P99 | VF2 P99 |
|---|---|---|---|---|---|
latency_headset_speech_160 |
160 @ 16 kHz | 10 ms | 20 | 38 | 235 |
latency_headset_speech_320 |
320 @ 16 kHz | 20 ms | 36 | 76 | 499 |
latency_headset_speech_480 |
480 @ 16 kHz | 30 ms | 37 | 81 | 635 |
latency_headset_speech_prime |
503 @ 16 kHz | 31 ms | 23 | 52 | 387 |
latency_array_speech_320 |
320 @ 16 kHz | 20 ms | 42 | 77 | 506 |
latency_mixed_meeting_320 |
320 @ 16 kHz | 20 ms | 43 | 84 | 551 |
P99 headroom is ~400-1300× on 7840HS, ~130-600× on RPi5, and
~36-81× on VF2. Every row on every platform stays comfortably inside
the realtime deadline — even VF2's worst case (mixed_meeting_320 at
551 µs on a 20 ms frame) has 36× margin. LAC meets its streaming
contract on every target tested.
MCU throughput (× realtime on one core)
Realtime multiplier = audio-ms processed per wall-clock-ms, per core.
"20× realtime" means one core sustains twenty simultaneous meetings
of the listed configuration.
| Test | Activity | 7840HS | RPi5 | VF2 |
|---|---|---|---|---|
mcu_mix_1on1_voice (P=2) |
continuous | 279× | 145× | 22× |
mcu_mix_3people_voice (P=3) |
continuous | 193× | 95× | 14× |
mcu_mix_5people_voice (P=5) |
continuous | 120× | 57× | 9× |
mcu_mix_8people_voice (P=8) |
continuous | 77× | 35× | 5× |
mcu_mix_8people_dominant_speaker (P=8) |
rotating speaker | 106× | 43× | 6× |
mcu_mix_16people_voice (P=16) |
continuous | 39× | 17× | 2.5× |
MCU egress byte count as a fraction of SFU fanout egress on 7840HS: 1.00 (P=2, trivially equal), 0.60 (P=3), 0.36 (P=5), 0.22 (P=8 continuous), 0.35 (P=8 dominant-speaker), 0.10 (P=16). The continuous case is the lower bound — SFU fanout scales quadratically in participant count while MCU mix egress scales linearly, so the relative savings grow as the meeting does. The dominant-speaker case inverts that trend slightly: SFU fanout of N-1 near-silent streams is almost free, so the SFU baseline falls faster than the MCU mix cost does. These numbers are byte-accounting, not wall-clock.