# LAC — Lo Audio Codec Lossless audio codec for internal use. Target compression is FLAC-class (~50% of raw). Integer-only, bit-exact, streaming-oriented. ## Scope - **Input**: signed integer PCM passed as `i32` with `|sample| ≤ 2²³ − 1`. 8-bit, 16-bit, 20-bit, and 24-bit sources are all valid without conversion — they compress at the bit cost of their actual values, not a 24-bit ceiling. - **Sample rate**: caller-specified; not encoded in the stream. The container or transport carries it. - **Channels**: mono per encoded stream. Stereo is two independent mono streams — for example, two QUIC streams over a shared connection, one per channel. No cross-channel joint coding. - **Frames**: independently decodable. No cross-frame state; a lost or corrupt frame never affects subsequent decodes. ## Pipeline ```text samples → LPC analysis → residuals → partitioned Rice → frame bytes ↓ (inverse for decode) ``` Three encoder-side choices, searched per frame: - **LPC order**: the reference encoder tries a sparse grid `{0, 2, 4, 6, 8, 10, 12, 16, 20, 24, 28, 32}` with a 2-order early-out once cost stops improving. Order 0 is verbatim (residuals equal the raw samples). The wire format permits any order in `[0, 32]`. - **Coefficient shift** `∈ [0, 5]`: widens the Q-format of the stored predictor coefficients from Q15 (range `[−1, 1)`) out to Q10 (range `[−32, 32)`) so low-frequency / narrow-resonance content doesn't clamp `|a[1]|` near 2. Chosen deterministically per order as the smallest shift that avoids clamping. - **Rice partition order** `∈ [0, 7]`: splits the residual stream into `2^partition_order` equal partitions, each with its own Rice parameter `k ∈ [0, 23]` chosen by convex descent. Levinson-Durbin runs once up to order 32 into a flat stack-allocated buffer (`LpcLevels`) and the per-order coefficients are consulted by slice reference; the order search itself does no heap allocation. ## Intended use - **QUIC streaming** — one reliable stream per audio channel. Frames fit the per-stream framing (length-prefixed or datagram-mapped) without modification. - **Offline file playback** — a container pairs the channel streams by timestamp; each stream decodes independently. ## Frame size guidance Frame size is a latency-vs-compression knob chosen at the application layer. The codec accepts any `frame_sample_count` in `[1, 65535]`, but the LPC/Rice search amortises better on larger frames (shared header, more samples per fitted coefficient vector). Concrete defaults: | Use case | Frame size | Latency at 48 kHz | Notes | |---|---|---|---| | Real-time voice, tight latency | 160 @ 16 kHz (10 ms) | — | matches WebRTC/Opus 10 ms mode | | Real-time voice, balanced | **320 @ 16 kHz (20 ms)** | — | default for MCU workload in `tests/mcu_mix.rs` | | Game/conf streaming | **960 @ 48 kHz (20 ms)** | 20 ms | one QUIC datagram per frame fits typical MTUs | | Music streaming | **2048 @ 48 kHz (43 ms)** | 43 ms | compression benefit flattens past this | | Offline archival | **4096 @ 48 kHz (85 ms)** | — | tightest LPC fit; default in `tests/corpus.rs`, matches FLAC's default blocksize for apples-to-apples compression comparison | Partition orders that evenly divide the frame size dominate the search cost. Power-of-two frame sizes (256, 512, 1024, 2048, 4096) unlock every `partition_order ∈ [0, 7]`; 960 and 2880 (common WebRTC rates) allow orders up to 6 and 5 respectively; prime sizes like 137 collapse to `partition_order = 0`. Prefer power-of-two frame sizes unless a container format constrains the choice. ## Structure ```text lac/ ├── Cargo.toml ├── README.md ← you are here ├── Specification.md ← wire format specification ├── corpus/ ← test WAVs (speech + music), LFS-tracked via .gitattributes ├── src/ │ ├── lib.rs ← public API and project-wide constants │ ├── bit_io.rs ← MSB-first bit reader/writer │ ├── lpc.rs ← Levinson-Durbin, LpcLevels flat buffer, residuals/synthesis │ ├── rice.rs ← zigzag + partitioned Rice coding, convex-descent k │ ├── frame.rs ← frame header, encode_frame, decode_frame │ └── test_signals.rs ← integer-only sine LUT for float-free test inputs ├── tests/ │ ├── corpus.rs ← compression ratio + FLAC comparison on real audio │ ├── synthetic.rs ← bit-depth + pathological-content round-trips, no corpus needed │ ├── latency.rs ← P50/P95/P99/max encode+decode latency, peak heap, alloc count │ └── mcu_mix.rs ← end-to-end MCU workload (decode → mix → re-encode) ├── benches/ │ ├── codec.rs ← nightly #[bench] harness (encode, decode, compute_residuals) │ └── compare-flac.sh ← diagnostic shell script: wall-clock flac encode across corpus └── fuzz/ ├── fuzz_targets/ │ ├── decode_arbitrary.rs ← decoder robustness under arbitrary bytes │ └── roundtrip_arbitrary.rs ← encoder/decoder self-consistency └── dict/ ├── decode_arbitrary.dict ← libFuzzer dict: sync word + field boundary constants └── roundtrip_arbitrary.dict ← libFuzzer dict: sample-value boundaries ``` See `Specification.md` for the normative wire format. ## Public API Every sample is an `i32` with magnitude bounded by `2²³ − 1`. Narrower integer sources go through unchanged: ```rust use lac::{encode_frame, decode_frame}; // 16-bit microphone PCM → just widen with `i32::from`. Do NOT shift // left by 8 to "align" to 24-bit: that multiplies residual magnitudes // by 256 and costs 8 extra bits per residual in the Rice payload. The // codec compresses at the bit cost of the actual sample magnitudes, // not a 24-bit ceiling. let pcm_16: Vec = /* from microphone */ Vec::new(); let samples: Vec = pcm_16.iter().map(|&s| i32::from(s)).collect(); let bytes = encode_frame(&samples); let recovered: Vec = decode_frame(&bytes)?; assert_eq!(recovered, samples); # Ok::<(), lac::DecodeError>(()) ``` For 24-bit PCM, samples are already in range — pass through directly. For 8-bit PCM, `i32::from(s as i8)` (signed) or the equivalent from your unsigned-offset-128 source. Round-trip is bit-exact: `decode_frame(encode_frame(s)) == s` for every valid `s`. ### Buffer-reusing API for hot loops For the MCU re-encode fanout and QUIC senders that own a per-channel scratch buffer, use [`encode_frame_into`] / [`decode_frame_into`] to target a caller-owned `Vec` / `Vec` instead of allocating fresh on each call: ```rust use lac::{encode_frame_into, decode_frame_into}; let mut encoded = Vec::new(); // one buffer per channel, reused across frames let mut decoded = Vec::new(); for frame_samples in frames_iter() { encode_frame_into(&frame_samples, &mut encoded); // … send `encoded` … } for incoming_bytes in incoming_iter() { decode_frame_into(&incoming_bytes, &mut decoded)?; // … consume `decoded` … } # fn frames_iter() -> impl Iterator> { std::iter::empty() } # fn incoming_iter() -> impl Iterator> { std::iter::empty() } # Ok::<(), lac::DecodeError>(()) ``` Both `_into` variants clear the destination at entry and retain its capacity, so steady-state usage makes zero allocations past the first frame. ### Output size expectations For realistic audio (speech, music, ambient), compressed frames land around **15-55 %** of raw sample bytes (speech near the low end, music near the high end). Callers reusing a scratch buffer can safely preallocate to 1× raw and take the extension cost only on the rare adversarial frame. For untrusted input — payloads where residuals might be crafted to maximise Rice output — the worst-case expansion bound is ~17× raw: at the Rice `k = 23` ceiling, each codeword is up to 535 bits (511 unary zeros + terminator + 23 remainder), or ~67 bytes per residual. A pipeline that must pre-size a bounded output buffer for arbitrary input can use `samples.len() * 68` bytes as a loose upper bound. The encoder never exceeds this. ### Error recovery On decode failure the caller substitutes `frame_sample_count` zeros (silence) for the frame period. The count is recoverable from the frame itself as long as the *header* parsed, even if the bitstream body then failed — call [`parse_header`] on the same buffer: ```rust use lac::{decode_frame, parse_header}; const SESSION_DEFAULT_FRAME: usize = 320; // negotiated at session start let bytes = Vec::::new(); let samples = match decode_frame(&bytes) { Ok(s) => s, Err(_) => { let count = parse_header(&bytes) .map(|(h, _)| h.frame_sample_count as usize) .unwrap_or(SESSION_DEFAULT_FRAME); vec![0i32; count] } }; ``` When the header itself fails (`BadSyncWord`, `InvalidPredictionOrder`, `InvalidPartitionOrder`, `InvalidCoefficientShift`, or `Truncated` below 7 bytes), the frame length is unknowable and the caller must fall back to a session-level default. [`encode_frame_into`]: https://docs.rs/lac/latest/lac/fn.encode_frame_into.html [`decode_frame_into`]: https://docs.rs/lac/latest/lac/fn.decode_frame_into.html [`parse_header`]: https://docs.rs/lac/latest/lac/fn.parse_header.html ## Concurrency LAC's encode and decode APIs are pure functions with no shared state — no globals, no internal `Mutex`, no `unsafe`. All public types are `Send + Sync`. Calls on different threads never contend with each other, and each call's scratch buffers are owned (stack or the caller-supplied `Vec`). The intended deployment shape for multi-channel and multi-stream workloads is **one thread or task per channel**. The codec itself does no threading: scheduling is left to the application so it can pick whichever executor fits (tokio for async servers, rayon for data- parallel workloads, `std::thread` for straight-ahead concurrency). MCU re-encode fanout with stdlib primitives only: ```rust use std::thread; use lac::encode_frame; let mixes: Vec> = Vec::new(); let outgoing: Vec> = thread::scope(|s| { let handles: Vec<_> = mixes .iter() .map(|mix| s.spawn(move || encode_frame(mix))) .collect(); handles.into_iter().map(|h| h.join().unwrap()).collect() }); ``` Or with rayon, if the project already pulls it in: ```rust // use rayon::prelude::*; // let outgoing: Vec> = mixes.par_iter().map(|m| encode_frame(m)).collect(); ``` The allocator you link against sets the ceiling on multi-core scaling: glibc `malloc` has measurable lock contention at tens of cores, whereas mimalloc / jemalloc keep per-thread caches and scale further. The codec itself doesn't care which one you pick — it allocs through the global allocator like any other Rust library. ### Input-size caps on untrusted channels Applications accepting LAC frames from untrusted peers should cap the per-frame input size at the application layer. The decoder's per-codeword unary-run bound (spec §4.2) prevents any single codeword from consuming unbounded CPU, but total decode cost scales with buffer length; an attacker handed an unbounded payload can force proportional scan work. Typical real frames are sub-kilobyte; **a cap of 64 KB per frame is comfortably above any legitimate LAC payload and cheap to enforce at the framing layer** (QUIC stream length field, length-prefixed framing, etc.). The `Truncated` error fires naturally when a payload is cut, so a hard cap doesn't break legal traffic — it just bounds pathological work. ### Silence-substitution amplification Spec §6.1 mandates that callers substitute `frame_sample_count` zeros on decode failure. An attacker can craft a tiny frame (~10-byte header with `frame_sample_count = 65535`) whose Rice payload is malformed; the decoder rejects, the caller dutifully emits 65 535 output samples of silence. At 48 kHz mono `i32`, that's **~256 KB of zeros per ~10-byte input frame — a ~25 000× amplification**. The output is silence, not attacker-chosen data, so this is a downstream-resource-exhaustion vector (memory, bandwidth, re-encode work at an MCU) rather than a data-injection vector. Mitigation is at the application layer: **cap `frame_sample_count` to the session's negotiated frame size** before invoking the silence substitution. QUIC / WebRTC sessions already negotiate a frame size at setup; using that as a hard upper bound on the silence-fill length collapses the amplification ratio to 1×. An MCU that reads `parse_header(&data).frame_sample_count` without validating it against the session cap inherits the amplification unchanged. ## Packet loss & concealment Frames are independently decodable: losing one frame never corrupts another, regardless of which concealment strategy the application picks. This is a genuine deployment asset on lossy transports (QUIC datagrams, UDP), and the section below walks the plausible strategies in increasing quality order. ### Strategy 1: silence substitution (the default) The baseline `decode_frame` returns `Err` on structural failure; the application substitutes `frame_sample_count` zeros for the lost frame period (see `parse_header` recovery pattern under *Public API → Error recovery*). Fast, deterministic, audible as a brief cut — acceptable for voice up to ~20 ms of loss, jarring beyond that. ### Strategy 2: sample-and-hold Repeat the last successfully decoded sample for the frame period. Zero-cost on the decoder side, preserves DC level so the click at the drop boundary is softer than silence. Quality at 20 ms of loss is better than silence for voice, slightly worse for music (DC hold on a non-stationary signal adds a small transient when the next frame arrives). ```rust // After a successful decode, store the last sample for reuse on loss. // On loss: fill the gap with that value. # fn last_decoded_sample() -> i32 { 0 } # const N: usize = 320; let conceal = vec![last_decoded_sample(); N]; ``` ### Strategy 3: linear fade Interpolate from the last valid sample down to zero over the lost frame period. Removes the DC-hold transient and the "cut to silence" click both. Costs N integer adds per lost frame. Recommended baseline for any application that can afford 2-5 lines of PLC code. ### Strategy 4: LPC-coefficient extrapolation The last successfully decoded frame's [`AudioFrameHeader`] carries the LPC coefficients the encoder chose — available from [`parse_header`] at no extra cost — and the LPC filter is locally stationary over a 20-40 ms horizon. Run the synthesis formula (§3.6 of `Specification.md`) forward from the last decoded samples to *predict* the missing frame. Quality is best on pitched content (voiced speech, sustained notes); on transients it degrades gracefully because the predictor's autoregressive behaviour damps toward zero over the frame. Not built into the library — the math is straightforward and the "right" tuning varies by deployment (how much damping, whether to blend with sample-and-hold on transients, etc.). See `src/lpc.rs`'s `lpc_synthesize_into` for the integer synthesis routine that a PLC implementation would call. ### Multi-frame loss guidance The strategies above are only useful up to a handful of consecutive lost frames. Rough thresholds at 20 ms frame periods: | Consecutive lost frames | Effective loss | Verdict | |---|---|---| | 1 | 20 ms | Inaudible with fade or LPC extrapolation; brief click with silence or sample-and-hold | | 2-3 | 40-60 ms | Noticeable glitch; LPC extrapolation minimises but cannot hide it | | 4-10 | 80-200 ms | Audible dropout. PLC keeps the audio from sounding "broken" but doesn't restore content | | > 10 | > 200 ms | Treat the stream as broken; reset the receiver's concealment state to avoid droning artifacts, and if possible ask the transport to signal "resync" upstream | Mid-stream resync on a datagram transport uses the sync word (`0x1ACC`) as an alignment anchor: on a string of bad frames, search the next `N` bytes of the buffer for the big-endian sequence `\x1a\xcc` and retry `parse_header` from each candidate offset until one succeeds. The search is O(N); on a 20 ms frame at 48 kHz there are at most ~180 bytes per frame to scan, so amortised cost is negligible. [`AudioFrameHeader`]: https://docs.rs/lac/latest/lac/struct.AudioFrameHeader.html ## Testing ``` cargo test # unit tests cargo test --test corpus --release -- --nocapture # compression vs FLAC, lac_enc_ms cargo test --test synthetic --release -- --nocapture # bit-depth + pathological content cargo test --test latency --release -- --nocapture --test-threads=1 # p50/p95/p99 + alloc count cargo test --test mcu_mix --release -- --nocapture --test-threads=1 # MCU throughput cargo test --test conformance --release -- --nocapture # byte-level spec conformance cargo test --test determinism --release # encode byte-equality on repeat cargo fuzz run decode_arbitrary -- -dict=dict/decode_arbitrary.dict cargo fuzz run roundtrip_arbitrary -- -dict=dict/roundtrip_arbitrary.dict cargo bench # nightly bench benches/compare-flac.sh # flac side of the speed table ``` **Published-crate caveat.** `Cargo.toml` excludes `corpus/*` and `fuzz/*` from the published tarball — they'd blow up crate size and the audio isn't redistributable under crates.io's constraints anyway. A user running `cargo test` against a `cargo add lac`'d dependency sees every corpus test *pass* because the `require_corpus!` macro skips missing files silently; the compression-ratio assertions, FLAC comparisons, latency P99 checks, and MCU throughput checks all go unrun. The full regression suite requires the git repository (with LFS pulled). The synthetic, conformance, determinism, and unit tests run unchanged from either source. Coverage at a glance: - **Unit** — round-trips for every LPC order 0-32 and every partition order 0-7, prime frame lengths that force `partition_order = 0`, all-zero frames, full-scale sample magnitudes, malformed-header rejection for every field (`sync_word`, `prediction_order`, `partition_order`, `coefficient_shift`), truncated bitstreams, and a convex-descent vs exhaustive-search `select_k` differential. - **Corpus** — round-trip + compression-ratio + FLAC subprocess comparison on a mixed speech and music corpus; asserts ratio ceilings so a codec regression fails CI; prints LAC encode wall-clock for correlation against `benches/compare-flac.sh`. - **Synthetic** — deterministic LFSR-driven round-trips at 8/16/20/24-bit source widths and pathological content (all-zero, DC offset, Nyquist square, silence + click, full-scale constant, prime-length frame). No corpus dependency so the tests run on every CI checkout. - **Latency** — per-frame encode/decode timing on real speech with a custom tracking allocator for peak-heap *and* per-frame allocation-count numbers; reports P50/P95/P99/max and asserts P99 < frame period so a real-time regression fails CI. - **MCU** — decode → PCM mix → re-encode simulation on real speech for 2/3/5/8/16 participants (continuous speech) plus an 8-participant rotating dominant-speaker variant; asserts MCU egress ≤ SFU-fanout egress. - **Fuzz** — libFuzzer targets for decoder robustness and encoder/decoder self-consistency on arbitrary bytes, seeded with dictionaries of the wire-format constants (sync word, field boundaries) and sample-magnitude boundaries (8/16/20/24-bit ceilings). ## Measurements ### Reference hardware | Short name | CPU | ISA highlights | |---|---|---| | **7840HS** | AMD Ryzen 7 7840HS (laptop, 8c/16t, up to 5.1 GHz) | AVX-512 (F/BW/CD/DQ/VL/VNNI/VBMI), BMI2, FMA | | **RPi5** | Raspberry Pi 5 (Cortex-A76 quad, 2.4 GHz) | NEON | | **VF2** | StarFive VisionFive 2 (SiFive U74 quad, 1.5 GHz) | RVV 0.7 (some LLVM autovec, less mature than x86 or NEON) | Numbers below are measured at default `cargo build --release` (no `target-cpu=native`, no project-level `RUSTFLAGS`). Empty cells are awaiting measurement on the listed hardware. FLAC comparison uses both `-5` (the CLI default, what production pipelines typically use) and `-8` (`--best`, the compression upper bound). ### Corpus attribution The measurements are taken on two publicly-licensed audio corpora checked into `corpus/`: - **Speech**: the [AMI Meeting Corpus](https://groups.inf.ed.ac.uk/ami/corpus/) (files named `ES2002a.*`), recorded by the AMI Consortium (University of Edinburgh, IDIAP, TNO, Brno University of Technology, University of Sheffield, and partners). Distributed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). - **Music**: Kimiko Ishizaka's recording of J.S. Bach's *Goldberg Variations, BWV 988* (files named `Kimiko Ishizaka - …`), from the [Open Goldberg Variations project](https://opengoldbergvariations.org/) (Robert Douglass, producer). Released under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/) — public domain dedication, no attribution legally required, credited here as a courtesy. Both corpora are used unmodified apart from the file selection described in the tables below. ### Compression (hardware-independent, bit-exact across targets) LAC ratio = LAC encoded / raw PCM. Both codecs use the same 4096-sample block size on this corpus — LAC's `tests/corpus.rs` sets `FRAME_SIZE = 4096`, which matches FLAC's default blocksize at `-5` and `-8` for ≤ 48 kHz content, so header and coefficient overhead is amortised identically on both sides. | Corpus file | Class | LAC | FLAC -5 | FLAC -8 | LAC / -5 | LAC / -8 | |---|---|---:|---:|---:|---:|---:| | `ES2002a.Headset-0.wav` | headset speech, 16 kHz / 16-bit | 0.178 | 0.187 | 0.186 | 0.954 | 0.958 | | `ES2002a.Mix-Headset.wav` | mixed meeting, 16 kHz / 16-bit | 0.292 | 0.300 | 0.297 | 0.975 | 0.984 | | `ES2002a.Array1-01.wav` | array speech, 16 kHz / 16-bit | 0.375 | 0.378 | 0.377 | 0.989 | 0.994 | | Goldberg Aria (01) | solo piano, 96 kHz / 24-bit | 0.483 | 0.458 | 0.457 | 1.053 | 1.056 | | Goldberg Variatio 4 (05, fughetta) | solo piano, 96 kHz / 24-bit | 0.514 | 0.483 | 0.481 | 1.065 | 1.067 | | Goldberg Variatio 16 (17, Ouverture) | solo piano, 96 kHz / 24-bit | 0.512 | 0.479 | 0.478 | 1.068 | 1.070 | Speech reliably beats FLAC at both levels by a small margin; music trails by 5-7 % (the Q-format gap at low frequencies, mitigated but not eliminated by `coefficient_shift`). FLAC's jump from `-5` to `-8` buys essentially nothing on this corpus (≤ 0.2 pp of ratio), so the realistic LAC-vs-FLAC comparison in practice is against `-5`. Numbers are byte-identical regardless of hardware because LAC's output is specified bit-exactly. ### Encode wall-clock (ms, full file) One table per hardware target; each has LAC alongside both FLAC levels so the speed cost of each quality point is visible. The `-5` column is the most representative real-world comparison. **7840HS** (AMD Ryzen 7 7840HS): | Corpus file | Duration | LAC | FLAC -5 | FLAC -8 | |---|---|---:|---:|---:| | `ES2002a.Headset-0.wav` | ~42 min, 16 kHz / 16-bit | 1158 | 221 | 436 | | `ES2002a.Array1-01.wav` | ~42 min, 16 kHz / 16-bit | 1292 | 226 | 447 | | `ES2002a.Mix-Headset.wav` | ~42 min, 16 kHz / 16-bit | 1367 | 223 | 469 | | Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 809 | 272 | 647 | | Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 2126 | 754 | 1741 | | Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 3521 | 1166 | 2703 | **RPi5** (Raspberry Pi 5, Cortex-A76 @ 2.4 GHz): | Corpus file | Duration | LAC | FLAC -5 | FLAC -8 | |---|---|---:|---:|---:| | `ES2002a.Headset-0.wav` | ~42 min, 16 kHz / 16-bit | 2856 | 477 | 959 | | `ES2002a.Array1-01.wav` | ~42 min, 16 kHz / 16-bit | 3249 | 495 | 1096 | | `ES2002a.Mix-Headset.wav` | ~42 min, 16 kHz / 16-bit | 3363 | 505 | 1132 | | Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 1904 | 606 | 1570 | | Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 5201 | 1627 | 4324 | | Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 9015 | 2572 | 6832 | **VF2** (StarFive VisionFive 2, SiFive U74 quad @ 1.5 GHz): | Corpus file | Duration | LAC | FLAC -5 | FLAC -8 | |---|---|---:|---:|---:| | `ES2002a.Headset-0.wav` | ~42 min, 16 kHz / 16-bit | 29385 | 2355 | 5614 | | `ES2002a.Array1-01.wav` | ~42 min, 16 kHz / 16-bit | 33231 | 2502 | 6688 | | `ES2002a.Mix-Headset.wav` | ~42 min, 16 kHz / 16-bit | 34899 | 2548 | 6878 | | Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 18185 | 3184 | 9278 | | Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 49811 | 8535 | 25454 | | Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 88208 | 13544 | 40650 | LAC is ~5-6× slower than FLAC `-5` and ~2-3× slower than FLAC `--best` on x86 because libFLAC ships hand-tuned SSE intrinsics for its autocorrelation kernel and LAC relies on LLVM autovectorization. End-to-end perf barely changes with `target-cpu=native`: the kernel does pick up AVX-512 zmm dot-products, but the frame encode is bottlenecked elsewhere (Rice k-search and bitstream assembly dominate the remaining time). On RPi5 (ARM Cortex-A76, NEON) LAC runs ~2.5× slower than on 7840HS in absolute terms, but the *ratio* against FLAC shifts noticeably: LAC is ~6-7× slower than FLAC `-5` on speech (wider gap, libFLAC's NEON path is well-tuned for the 16 kHz / 16-bit content) but only ~3× slower on 96 kHz / 24-bit music (narrower gap — 24-bit content gives libFLAC's specialization less leverage). Against FLAC `--best` the music gap narrows further to ~1.2-1.3×. The 7840HS-vs-RPi5 delta in the LAC column shows scalar autovec quality is broadly comparable across x86 and ARM backends; the delta in the FLAC columns shows where hand-tuned intrinsics disappear on a different ISA. On VF2 (RISC-V SiFive U74, RVV 0.7 — not supported by mainline libFLAC or LLVM autovec yet) LAC runs ~10× slower than on RPi5. Both codecs fall back to pure scalar execution; the gap between them *widens* to ~12-13× on speech and ~6× on music vs FLAC `-5`, or ~5× / ~2× vs FLAC `--best`. Two factors compound: the U74 is a single-issue in-order core vs the Cortex-A76's dual-issue out-of-order (base IPC is ~2× lower at the ISA-agnostic level), and LLVM's scalar Rust codegen for RISC-V is less mature than its x86/ARM output — tighter inner loops in libFLAC's hand-written C survive this better than LAC's Rust does. The absolute numbers are still useful: even at 88 s to encode 5 minutes of 96/24 stereo audio, LAC comfortably meets realtime for streaming use (see the P99 latency table below). ### Per-frame encode latency P99 (µs) All rows use real AMI speech samples. Frame sample count sets the real-time deadline; P99 must stay below that period for the frame to ship inside its own playback slot. | Test | Frame | Period | 7840HS P99 | RPi5 P99 | VF2 P99 | |---|---|---:|---:|---:|---:| | `latency_headset_speech_160` | 160 @ 16 kHz | 10 ms | 20 | 38 | 235 | | `latency_headset_speech_320` | 320 @ 16 kHz | 20 ms | 36 | 76 | 499 | | `latency_headset_speech_480` | 480 @ 16 kHz | 30 ms | 37 | 81 | 635 | | `latency_headset_speech_prime` | 503 @ 16 kHz | 31 ms | 23 | 52 | 387 | | `latency_array_speech_320` | 320 @ 16 kHz | 20 ms | 42 | 77 | 506 | | `latency_mixed_meeting_320` | 320 @ 16 kHz | 20 ms | 43 | 84 | 551 | P99 headroom is ~400-1300× on 7840HS, ~130-600× on RPi5, and ~36-81× on VF2. Every row on every platform stays comfortably inside the realtime deadline — even VF2's worst case (`mixed_meeting_320` at 551 µs on a 20 ms frame) has 36× margin. LAC meets its streaming contract on every target tested. ### MCU throughput (× realtime on one core) Realtime multiplier = audio-ms processed per wall-clock-ms, per core. "`20×` realtime" means one core sustains twenty simultaneous meetings of the listed configuration. | Test | Activity | 7840HS | RPi5 | VF2 | |---|---|---:|---:|---:| | `mcu_mix_1on1_voice` (P=2) | continuous | 279× | 145× | 22× | | `mcu_mix_3people_voice` (P=3) | continuous | 193× | 95× | 14× | | `mcu_mix_5people_voice` (P=5) | continuous | 120× | 57× | 9× | | `mcu_mix_8people_voice` (P=8) | continuous | 77× | 35× | 5× | | `mcu_mix_8people_dominant_speaker` (P=8) | rotating speaker | 106× | 43× | 6× | | `mcu_mix_16people_voice` (P=16) | continuous | 39× | 17× | 2.5× | MCU egress byte count as a fraction of SFU fanout egress on 7840HS: 1.00 (P=2, trivially equal), 0.60 (P=3), 0.36 (P=5), 0.22 (P=8 continuous), 0.35 (P=8 dominant-speaker), 0.10 (P=16). The continuous case is the lower bound — SFU fanout scales quadratically in participant count while MCU mix egress scales linearly, so the relative savings grow as the meeting does. The dominant-speaker case inverts that trend slightly: SFU fanout of N-1 near-silent streams is almost free, so the SFU baseline falls faster than the MCU mix cost does. These numbers are byte-accounting, not wall-clock.