lac/README.md
Kamal Tufekcic 7862cb1d9d
All checks were successful
CI / lint (push) Successful in 5s
CI / fuzz-regression (push) Successful in 14s
CI / build (push) Successful in 4s
CI / test (push) Successful in 6m54s
CI / publish (push) Successful in 8s
initial commit
Signed-off-by: Kamal Tufekcic <kamal@lo.sh>
2026-04-23 14:58:32 +03:00

29 KiB
Raw Permalink Blame History

LAC — Lo Audio Codec

Lossless audio codec for internal use. Target compression is FLAC-class (~50% of raw). Integer-only, bit-exact, streaming-oriented.

Scope

  • Input: signed integer PCM passed as i32 with |sample| ≤ 2²³ 1. 8-bit, 16-bit, 20-bit, and 24-bit sources are all valid without conversion — they compress at the bit cost of their actual values, not a 24-bit ceiling.
  • Sample rate: caller-specified; not encoded in the stream. The container or transport carries it.
  • Channels: mono per encoded stream. Stereo is two independent mono streams — for example, two QUIC streams over a shared connection, one per channel. No cross-channel joint coding.
  • Frames: independently decodable. No cross-frame state; a lost or corrupt frame never affects subsequent decodes.

Pipeline

samples  →  LPC analysis  →  residuals  →  partitioned Rice  →  frame bytes
                                                ↓
                                       (inverse for decode)

Three encoder-side choices, searched per frame:

  • LPC order: the reference encoder tries a sparse grid {0, 2, 4, 6, 8, 10, 12, 16, 20, 24, 28, 32} with a 2-order early-out once cost stops improving. Order 0 is verbatim (residuals equal the raw samples). The wire format permits any order in [0, 32].
  • Coefficient shift ∈ [0, 5]: widens the Q-format of the stored predictor coefficients from Q15 (range [1, 1)) out to Q10 (range [32, 32)) so low-frequency / narrow-resonance content doesn't clamp |a[1]| near 2. Chosen deterministically per order as the smallest shift that avoids clamping.
  • Rice partition order ∈ [0, 7]: splits the residual stream into 2^partition_order equal partitions, each with its own Rice parameter k ∈ [0, 23] chosen by convex descent.

Levinson-Durbin runs once up to order 32 into a flat stack-allocated buffer (LpcLevels) and the per-order coefficients are consulted by slice reference; the order search itself does no heap allocation.

Intended use

  • QUIC streaming — one reliable stream per audio channel. Frames fit the per-stream framing (length-prefixed or datagram-mapped) without modification.
  • Offline file playback — a container pairs the channel streams by timestamp; each stream decodes independently.

Frame size guidance

Frame size is a latency-vs-compression knob chosen at the application layer. The codec accepts any frame_sample_count in [1, 65535], but the LPC/Rice search amortises better on larger frames (shared header, more samples per fitted coefficient vector). Concrete defaults:

Use case Frame size Latency at 48 kHz Notes
Real-time voice, tight latency 160 @ 16 kHz (10 ms) matches WebRTC/Opus 10 ms mode
Real-time voice, balanced 320 @ 16 kHz (20 ms) default for MCU workload in tests/mcu_mix.rs
Game/conf streaming 960 @ 48 kHz (20 ms) 20 ms one QUIC datagram per frame fits typical MTUs
Music streaming 2048 @ 48 kHz (43 ms) 43 ms compression benefit flattens past this
Offline archival 4096 @ 48 kHz (85 ms) tightest LPC fit; default in tests/corpus.rs, matches FLAC's default blocksize for apples-to-apples compression comparison

Partition orders that evenly divide the frame size dominate the search cost. Power-of-two frame sizes (256, 512, 1024, 2048, 4096) unlock every partition_order ∈ [0, 7]; 960 and 2880 (common WebRTC rates) allow orders up to 6 and 5 respectively; prime sizes like 137 collapse to partition_order = 0. Prefer power-of-two frame sizes unless a container format constrains the choice.

Structure

lac/
├── Cargo.toml
├── README.md                ← you are here
├── Specification.md                  ← wire format specification
├── corpus/                  ← test WAVs (speech + music), LFS-tracked via .gitattributes
├── src/
│   ├── lib.rs               ← public API and project-wide constants
│   ├── bit_io.rs            ← MSB-first bit reader/writer
│   ├── lpc.rs               ← Levinson-Durbin, LpcLevels flat buffer, residuals/synthesis
│   ├── rice.rs              ← zigzag + partitioned Rice coding, convex-descent k
│   ├── frame.rs             ← frame header, encode_frame, decode_frame
│   └── test_signals.rs      ← integer-only sine LUT for float-free test inputs
├── tests/
│   ├── corpus.rs            ← compression ratio + FLAC comparison on real audio
│   ├── synthetic.rs         ← bit-depth + pathological-content round-trips, no corpus needed
│   ├── latency.rs           ← P50/P95/P99/max encode+decode latency, peak heap, alloc count
│   └── mcu_mix.rs           ← end-to-end MCU workload (decode → mix → re-encode)
├── benches/
│   ├── codec.rs             ← nightly #[bench] harness (encode, decode, compute_residuals)
│   └── compare-flac.sh      ← diagnostic shell script: wall-clock flac encode across corpus
└── fuzz/
    ├── fuzz_targets/
    │   ├── decode_arbitrary.rs     ← decoder robustness under arbitrary bytes
    │   └── roundtrip_arbitrary.rs  ← encoder/decoder self-consistency
    └── dict/
        ├── decode_arbitrary.dict   ← libFuzzer dict: sync word + field boundary constants
        └── roundtrip_arbitrary.dict ← libFuzzer dict: sample-value boundaries

See Specification.md for the normative wire format.

Public API

Every sample is an i32 with magnitude bounded by 2²³ 1. Narrower integer sources go through unchanged:

use lac::{encode_frame, decode_frame};

// 16-bit microphone PCM → just widen with `i32::from`. Do NOT shift
// left by 8 to "align" to 24-bit: that multiplies residual magnitudes
// by 256 and costs 8 extra bits per residual in the Rice payload. The
// codec compresses at the bit cost of the actual sample magnitudes,
// not a 24-bit ceiling.
let pcm_16: Vec<i16> = /* from microphone */ Vec::new();
let samples: Vec<i32> = pcm_16.iter().map(|&s| i32::from(s)).collect();

let bytes = encode_frame(&samples);
let recovered: Vec<i32> = decode_frame(&bytes)?;
assert_eq!(recovered, samples);
# Ok::<(), lac::DecodeError>(())

For 24-bit PCM, samples are already in range — pass through directly. For 8-bit PCM, i32::from(s as i8) (signed) or the equivalent from your unsigned-offset-128 source.

Round-trip is bit-exact: decode_frame(encode_frame(s)) == s for every valid s.

Buffer-reusing API for hot loops

For the MCU re-encode fanout and QUIC senders that own a per-channel scratch buffer, use encode_frame_into / decode_frame_into to target a caller-owned Vec<u8> / Vec<i32> instead of allocating fresh on each call:

use lac::{encode_frame_into, decode_frame_into};

let mut encoded = Vec::new();  // one buffer per channel, reused across frames
let mut decoded = Vec::new();

for frame_samples in frames_iter() {
    encode_frame_into(&frame_samples, &mut encoded);
    // … send `encoded` …
}

for incoming_bytes in incoming_iter() {
    decode_frame_into(&incoming_bytes, &mut decoded)?;
    // … consume `decoded` …
}
# fn frames_iter() -> impl Iterator<Item = Vec<i32>> { std::iter::empty() }
# fn incoming_iter() -> impl Iterator<Item = Vec<u8>> { std::iter::empty() }
# Ok::<(), lac::DecodeError>(())

Both _into variants clear the destination at entry and retain its capacity, so steady-state usage makes zero allocations past the first frame.

Output size expectations

For realistic audio (speech, music, ambient), compressed frames land around 15-55 % of raw sample bytes (speech near the low end, music near the high end). Callers reusing a scratch buffer can safely preallocate to 1× raw and take the extension cost only on the rare adversarial frame.

For untrusted input — payloads where residuals might be crafted to maximise Rice output — the worst-case expansion bound is ~17× raw: at the Rice k = 23 ceiling, each codeword is up to 535 bits (511 unary zeros + terminator + 23 remainder), or ~67 bytes per residual. A pipeline that must pre-size a bounded output buffer for arbitrary input can use samples.len() * 68 bytes as a loose upper bound. The encoder never exceeds this.

Error recovery

On decode failure the caller substitutes frame_sample_count zeros (silence) for the frame period. The count is recoverable from the frame itself as long as the header parsed, even if the bitstream body then failed — call parse_header on the same buffer:

use lac::{decode_frame, parse_header};

const SESSION_DEFAULT_FRAME: usize = 320;  // negotiated at session start

let bytes = Vec::<u8>::new();
let samples = match decode_frame(&bytes) {
    Ok(s) => s,
    Err(_) => {
        let count = parse_header(&bytes)
            .map(|(h, _)| h.frame_sample_count as usize)
            .unwrap_or(SESSION_DEFAULT_FRAME);
        vec![0i32; count]
    }
};

When the header itself fails (BadSyncWord, InvalidPredictionOrder, InvalidPartitionOrder, InvalidCoefficientShift, or Truncated below 7 bytes), the frame length is unknowable and the caller must fall back to a session-level default.

Concurrency

LAC's encode and decode APIs are pure functions with no shared state — no globals, no internal Mutex, no unsafe. All public types are Send + Sync. Calls on different threads never contend with each other, and each call's scratch buffers are owned (stack or the caller-supplied Vec).

The intended deployment shape for multi-channel and multi-stream workloads is one thread or task per channel. The codec itself does no threading: scheduling is left to the application so it can pick whichever executor fits (tokio for async servers, rayon for data- parallel workloads, std::thread for straight-ahead concurrency).

MCU re-encode fanout with stdlib primitives only:

use std::thread;
use lac::encode_frame;

let mixes: Vec<Vec<i32>> = Vec::new();
let outgoing: Vec<Vec<u8>> = thread::scope(|s| {
    let handles: Vec<_> = mixes
        .iter()
        .map(|mix| s.spawn(move || encode_frame(mix)))
        .collect();
    handles.into_iter().map(|h| h.join().unwrap()).collect()
});

Or with rayon, if the project already pulls it in:

// use rayon::prelude::*;
// let outgoing: Vec<Vec<u8>> = mixes.par_iter().map(|m| encode_frame(m)).collect();

The allocator you link against sets the ceiling on multi-core scaling: glibc malloc has measurable lock contention at tens of cores, whereas mimalloc / jemalloc keep per-thread caches and scale further. The codec itself doesn't care which one you pick — it allocs through the global allocator like any other Rust library.

Input-size caps on untrusted channels

Applications accepting LAC frames from untrusted peers should cap the per-frame input size at the application layer. The decoder's per-codeword unary-run bound (spec §4.2) prevents any single codeword from consuming unbounded CPU, but total decode cost scales with buffer length; an attacker handed an unbounded payload can force proportional scan work. Typical real frames are sub-kilobyte; a cap of 64 KB per frame is comfortably above any legitimate LAC payload and cheap to enforce at the framing layer (QUIC stream length field, length-prefixed framing, etc.). The Truncated error fires naturally when a payload is cut, so a hard cap doesn't break legal traffic — it just bounds pathological work.

Silence-substitution amplification

Spec §6.1 mandates that callers substitute frame_sample_count zeros on decode failure. An attacker can craft a tiny frame (~10-byte header with frame_sample_count = 65535) whose Rice payload is malformed; the decoder rejects, the caller dutifully emits 65 535 output samples of silence. At 48 kHz mono i32, that's ~256 KB of zeros per ~10-byte input frame — a ~25 000× amplification.

The output is silence, not attacker-chosen data, so this is a downstream-resource-exhaustion vector (memory, bandwidth, re-encode work at an MCU) rather than a data-injection vector. Mitigation is at the application layer: cap frame_sample_count to the session's negotiated frame size before invoking the silence substitution. QUIC / WebRTC sessions already negotiate a frame size at setup; using that as a hard upper bound on the silence-fill length collapses the amplification ratio to 1×. An MCU that reads parse_header(&data).frame_sample_count without validating it against the session cap inherits the amplification unchanged.

Packet loss & concealment

Frames are independently decodable: losing one frame never corrupts another, regardless of which concealment strategy the application picks. This is a genuine deployment asset on lossy transports (QUIC datagrams, UDP), and the section below walks the plausible strategies in increasing quality order.

Strategy 1: silence substitution (the default)

The baseline decode_frame returns Err on structural failure; the application substitutes frame_sample_count zeros for the lost frame period (see parse_header recovery pattern under Public API → Error recovery). Fast, deterministic, audible as a brief cut — acceptable for voice up to ~20 ms of loss, jarring beyond that.

Strategy 2: sample-and-hold

Repeat the last successfully decoded sample for the frame period. Zero-cost on the decoder side, preserves DC level so the click at the drop boundary is softer than silence. Quality at 20 ms of loss is better than silence for voice, slightly worse for music (DC hold on a non-stationary signal adds a small transient when the next frame arrives).

// After a successful decode, store the last sample for reuse on loss.
// On loss: fill the gap with that value.
# fn last_decoded_sample() -> i32 { 0 }
# const N: usize = 320;
let conceal = vec![last_decoded_sample(); N];

Strategy 3: linear fade

Interpolate from the last valid sample down to zero over the lost frame period. Removes the DC-hold transient and the "cut to silence" click both. Costs N integer adds per lost frame. Recommended baseline for any application that can afford 2-5 lines of PLC code.

Strategy 4: LPC-coefficient extrapolation

The last successfully decoded frame's AudioFrameHeader carries the LPC coefficients the encoder chose — available from parse_header at no extra cost — and the LPC filter is locally stationary over a 20-40 ms horizon. Run the synthesis formula (§3.6 of Specification.md) forward from the last decoded samples to predict the missing frame. Quality is best on pitched content (voiced speech, sustained notes); on transients it degrades gracefully because the predictor's autoregressive behaviour damps toward zero over the frame.

Not built into the library — the math is straightforward and the "right" tuning varies by deployment (how much damping, whether to blend with sample-and-hold on transients, etc.). See src/lpc.rs's lpc_synthesize_into for the integer synthesis routine that a PLC implementation would call.

Multi-frame loss guidance

The strategies above are only useful up to a handful of consecutive lost frames. Rough thresholds at 20 ms frame periods:

Consecutive lost frames Effective loss Verdict
1 20 ms Inaudible with fade or LPC extrapolation; brief click with silence or sample-and-hold
2-3 40-60 ms Noticeable glitch; LPC extrapolation minimises but cannot hide it
4-10 80-200 ms Audible dropout. PLC keeps the audio from sounding "broken" but doesn't restore content
> 10 > 200 ms Treat the stream as broken; reset the receiver's concealment state to avoid droning artifacts, and if possible ask the transport to signal "resync" upstream

Mid-stream resync on a datagram transport uses the sync word (0x1ACC) as an alignment anchor: on a string of bad frames, search the next N bytes of the buffer for the big-endian sequence \x1a\xcc and retry parse_header from each candidate offset until one succeeds. The search is O(N); on a 20 ms frame at 48 kHz there are at most ~180 bytes per frame to scan, so amortised cost is negligible.

Testing

cargo test                                                           # unit tests
cargo test --test corpus    --release -- --nocapture                 # compression vs FLAC, lac_enc_ms
cargo test --test synthetic --release -- --nocapture                 # bit-depth + pathological content
cargo test --test latency   --release -- --nocapture --test-threads=1  # p50/p95/p99 + alloc count
cargo test --test mcu_mix   --release -- --nocapture --test-threads=1  # MCU throughput
cargo test --test conformance --release -- --nocapture               # byte-level spec conformance
cargo test --test determinism --release                              # encode byte-equality on repeat
cargo fuzz run decode_arbitrary    -- -dict=dict/decode_arbitrary.dict
cargo fuzz run roundtrip_arbitrary -- -dict=dict/roundtrip_arbitrary.dict
cargo bench                                                          # nightly bench
benches/compare-flac.sh                                              # flac side of the speed table

Published-crate caveat. Cargo.toml excludes corpus/* and fuzz/* from the published tarball — they'd blow up crate size and the audio isn't redistributable under crates.io's constraints anyway. A user running cargo test against a cargo add lac'd dependency sees every corpus test pass because the require_corpus! macro skips missing files silently; the compression-ratio assertions, FLAC comparisons, latency P99 checks, and MCU throughput checks all go unrun. The full regression suite requires the git repository (with LFS pulled). The synthetic, conformance, determinism, and unit tests run unchanged from either source.

Coverage at a glance:

  • Unit — round-trips for every LPC order 0-32 and every partition order 0-7, prime frame lengths that force partition_order = 0, all-zero frames, full-scale sample magnitudes, malformed-header rejection for every field (sync_word, prediction_order, partition_order, coefficient_shift), truncated bitstreams, and a convex-descent vs exhaustive-search select_k differential.
  • Corpus — round-trip + compression-ratio + FLAC subprocess comparison on a mixed speech and music corpus; asserts ratio ceilings so a codec regression fails CI; prints LAC encode wall-clock for correlation against benches/compare-flac.sh.
  • Synthetic — deterministic LFSR-driven round-trips at 8/16/20/24-bit source widths and pathological content (all-zero, DC offset, Nyquist square, silence + click, full-scale constant, prime-length frame). No corpus dependency so the tests run on every CI checkout.
  • Latency — per-frame encode/decode timing on real speech with a custom tracking allocator for peak-heap and per-frame allocation-count numbers; reports P50/P95/P99/max and asserts P99 < frame period so a real-time regression fails CI.
  • MCU — decode → PCM mix → re-encode simulation on real speech for 2/3/5/8/16 participants (continuous speech) plus an 8-participant rotating dominant-speaker variant; asserts MCU egress ≤ SFU-fanout egress.
  • Fuzz — libFuzzer targets for decoder robustness and encoder/decoder self-consistency on arbitrary bytes, seeded with dictionaries of the wire-format constants (sync word, field boundaries) and sample-magnitude boundaries (8/16/20/24-bit ceilings).

Measurements

Reference hardware

Short name CPU ISA highlights
7840HS AMD Ryzen 7 7840HS (laptop, 8c/16t, up to 5.1 GHz) AVX-512 (F/BW/CD/DQ/VL/VNNI/VBMI), BMI2, FMA
RPi5 Raspberry Pi 5 (Cortex-A76 quad, 2.4 GHz) NEON
VF2 StarFive VisionFive 2 (SiFive U74 quad, 1.5 GHz) RVV 0.7 (some LLVM autovec, less mature than x86 or NEON)

Numbers below are measured at default cargo build --release (no target-cpu=native, no project-level RUSTFLAGS). Empty cells are awaiting measurement on the listed hardware. FLAC comparison uses both -5 (the CLI default, what production pipelines typically use) and -8 (--best, the compression upper bound).

Corpus attribution

The measurements are taken on two publicly-licensed audio corpora checked into corpus/:

  • Speech: the AMI Meeting Corpus (files named ES2002a.*), recorded by the AMI Consortium (University of Edinburgh, IDIAP, TNO, Brno University of Technology, University of Sheffield, and partners). Distributed under CC BY 4.0.
  • Music: Kimiko Ishizaka's recording of J.S. Bach's Goldberg Variations, BWV 988 (files named Kimiko Ishizaka - …), from the Open Goldberg Variations project (Robert Douglass, producer). Released under CC0 1.0 — public domain dedication, no attribution legally required, credited here as a courtesy.

Both corpora are used unmodified apart from the file selection described in the tables below.

Compression (hardware-independent, bit-exact across targets)

LAC ratio = LAC encoded / raw PCM. Both codecs use the same 4096-sample block size on this corpus — LAC's tests/corpus.rs sets FRAME_SIZE = 4096, which matches FLAC's default blocksize at -5 and -8 for ≤ 48 kHz content, so header and coefficient overhead is amortised identically on both sides.

Corpus file Class LAC FLAC -5 FLAC -8 LAC / -5 LAC / -8
ES2002a.Headset-0.wav headset speech, 16 kHz / 16-bit 0.178 0.187 0.186 0.954 0.958
ES2002a.Mix-Headset.wav mixed meeting, 16 kHz / 16-bit 0.292 0.300 0.297 0.975 0.984
ES2002a.Array1-01.wav array speech, 16 kHz / 16-bit 0.375 0.378 0.377 0.989 0.994
Goldberg Aria (01) solo piano, 96 kHz / 24-bit 0.483 0.458 0.457 1.053 1.056
Goldberg Variatio 4 (05, fughetta) solo piano, 96 kHz / 24-bit 0.514 0.483 0.481 1.065 1.067
Goldberg Variatio 16 (17, Ouverture) solo piano, 96 kHz / 24-bit 0.512 0.479 0.478 1.068 1.070

Speech reliably beats FLAC at both levels by a small margin; music trails by 5-7 % (the Q-format gap at low frequencies, mitigated but not eliminated by coefficient_shift). FLAC's jump from -5 to -8 buys essentially nothing on this corpus (≤ 0.2 pp of ratio), so the realistic LAC-vs-FLAC comparison in practice is against -5. Numbers are byte-identical regardless of hardware because LAC's output is specified bit-exactly.

Encode wall-clock (ms, full file)

One table per hardware target; each has LAC alongside both FLAC levels so the speed cost of each quality point is visible. The -5 column is the most representative real-world comparison.

7840HS (AMD Ryzen 7 7840HS):

Corpus file Duration LAC FLAC -5 FLAC -8
ES2002a.Headset-0.wav ~42 min, 16 kHz / 16-bit 1158 221 436
ES2002a.Array1-01.wav ~42 min, 16 kHz / 16-bit 1292 226 447
ES2002a.Mix-Headset.wav ~42 min, 16 kHz / 16-bit 1367 223 469
Goldberg Variatio 4 (05) ~68 s, 96 kHz / 24-bit stereo 809 272 647
Goldberg Variatio 16 (17) ~188 s, 96 kHz / 24-bit stereo 2126 754 1741
Goldberg Aria (01) ~300 s, 96 kHz / 24-bit stereo 3521 1166 2703

RPi5 (Raspberry Pi 5, Cortex-A76 @ 2.4 GHz):

Corpus file Duration LAC FLAC -5 FLAC -8
ES2002a.Headset-0.wav ~42 min, 16 kHz / 16-bit 2856 477 959
ES2002a.Array1-01.wav ~42 min, 16 kHz / 16-bit 3249 495 1096
ES2002a.Mix-Headset.wav ~42 min, 16 kHz / 16-bit 3363 505 1132
Goldberg Variatio 4 (05) ~68 s, 96 kHz / 24-bit stereo 1904 606 1570
Goldberg Variatio 16 (17) ~188 s, 96 kHz / 24-bit stereo 5201 1627 4324
Goldberg Aria (01) ~300 s, 96 kHz / 24-bit stereo 9015 2572 6832

VF2 (StarFive VisionFive 2, SiFive U74 quad @ 1.5 GHz):

Corpus file Duration LAC FLAC -5 FLAC -8
ES2002a.Headset-0.wav ~42 min, 16 kHz / 16-bit 29385 2355 5614
ES2002a.Array1-01.wav ~42 min, 16 kHz / 16-bit 33231 2502 6688
ES2002a.Mix-Headset.wav ~42 min, 16 kHz / 16-bit 34899 2548 6878
Goldberg Variatio 4 (05) ~68 s, 96 kHz / 24-bit stereo 18185 3184 9278
Goldberg Variatio 16 (17) ~188 s, 96 kHz / 24-bit stereo 49811 8535 25454
Goldberg Aria (01) ~300 s, 96 kHz / 24-bit stereo 88208 13544 40650

LAC is ~5-6× slower than FLAC -5 and ~2-3× slower than FLAC --best on x86 because libFLAC ships hand-tuned SSE intrinsics for its autocorrelation kernel and LAC relies on LLVM autovectorization. End-to-end perf barely changes with target-cpu=native: the kernel does pick up AVX-512 zmm dot-products, but the frame encode is bottlenecked elsewhere (Rice k-search and bitstream assembly dominate the remaining time).

On RPi5 (ARM Cortex-A76, NEON) LAC runs ~2.5× slower than on 7840HS in absolute terms, but the ratio against FLAC shifts noticeably: LAC is ~6-7× slower than FLAC -5 on speech (wider gap, libFLAC's NEON path is well-tuned for the 16 kHz / 16-bit content) but only ~3× slower on 96 kHz / 24-bit music (narrower gap — 24-bit content gives libFLAC's specialization less leverage). Against FLAC --best the music gap narrows further to ~1.2-1.3×. The 7840HS-vs-RPi5 delta in the LAC column shows scalar autovec quality is broadly comparable across x86 and ARM backends; the delta in the FLAC columns shows where hand-tuned intrinsics disappear on a different ISA.

On VF2 (RISC-V SiFive U74, RVV 0.7 — not supported by mainline libFLAC or LLVM autovec yet) LAC runs ~10× slower than on RPi5. Both codecs fall back to pure scalar execution; the gap between them widens to ~12-13× on speech and ~6× on music vs FLAC -5, or ~5× / ~2× vs FLAC --best. Two factors compound: the U74 is a single-issue in-order core vs the Cortex-A76's dual-issue out-of-order (base IPC is ~2× lower at the ISA-agnostic level), and LLVM's scalar Rust codegen for RISC-V is less mature than its x86/ARM output — tighter inner loops in libFLAC's hand-written C survive this better than LAC's Rust does. The absolute numbers are still useful: even at 88 s to encode 5 minutes of 96/24 stereo audio, LAC comfortably meets realtime for streaming use (see the P99 latency table below).

Per-frame encode latency P99 (µs)

All rows use real AMI speech samples. Frame sample count sets the real-time deadline; P99 must stay below that period for the frame to ship inside its own playback slot.

Test Frame Period 7840HS P99 RPi5 P99 VF2 P99
latency_headset_speech_160 160 @ 16 kHz 10 ms 20 38 235
latency_headset_speech_320 320 @ 16 kHz 20 ms 36 76 499
latency_headset_speech_480 480 @ 16 kHz 30 ms 37 81 635
latency_headset_speech_prime 503 @ 16 kHz 31 ms 23 52 387
latency_array_speech_320 320 @ 16 kHz 20 ms 42 77 506
latency_mixed_meeting_320 320 @ 16 kHz 20 ms 43 84 551

P99 headroom is ~400-1300× on 7840HS, ~130-600× on RPi5, and ~36-81× on VF2. Every row on every platform stays comfortably inside the realtime deadline — even VF2's worst case (mixed_meeting_320 at 551 µs on a 20 ms frame) has 36× margin. LAC meets its streaming contract on every target tested.

MCU throughput (× realtime on one core)

Realtime multiplier = audio-ms processed per wall-clock-ms, per core. "20× realtime" means one core sustains twenty simultaneous meetings of the listed configuration.

Test Activity 7840HS RPi5 VF2
mcu_mix_1on1_voice (P=2) continuous 279× 145× 22×
mcu_mix_3people_voice (P=3) continuous 193× 95× 14×
mcu_mix_5people_voice (P=5) continuous 120× 57× 9×
mcu_mix_8people_voice (P=8) continuous 77× 35× 5×
mcu_mix_8people_dominant_speaker (P=8) rotating speaker 106× 43× 6×
mcu_mix_16people_voice (P=16) continuous 39× 17× 2.5×

MCU egress byte count as a fraction of SFU fanout egress on 7840HS: 1.00 (P=2, trivially equal), 0.60 (P=3), 0.36 (P=5), 0.22 (P=8 continuous), 0.35 (P=8 dominant-speaker), 0.10 (P=16). The continuous case is the lower bound — SFU fanout scales quadratically in participant count while MCU mix egress scales linearly, so the relative savings grow as the meeting does. The dominant-speaker case inverts that trend slightly: SFU fanout of N-1 near-silent streams is almost free, so the SFU baseline falls faster than the MCU mix cost does. These numbers are byte-accounting, not wall-clock.