lac/README.md

# LAC — Lo Audio Codec

Lossless audio codec for internal use. Target compression is FLAC-class
(~50% of raw). Integer-only, bit-exact, streaming-oriented.

## Scope

- **Input**: signed integer PCM passed as `i32` with `|sample| ≤ 2²³ − 1`.
  8-bit, 16-bit, 20-bit, and 24-bit sources are all valid without
  conversion — they compress at the bit cost of their actual values, not
  a 24-bit ceiling.
- **Sample rate**: caller-specified; not encoded in the stream. The container
  or transport carries it.
- **Channels**: mono per encoded stream. Stereo is two independent mono
  streams — for example, two QUIC streams over a shared connection, one per
  channel. No cross-channel joint coding.
- **Frames**: independently decodable. No cross-frame state; a lost or corrupt
  frame never affects subsequent decodes.

## Pipeline

```text
samples  →  LPC analysis  →  residuals  →  partitioned Rice  →  frame bytes
                                                ↓
                                       (inverse for decode)
```

Three encoder-side choices, searched per frame:

- **LPC order**: the reference encoder tries a sparse grid
  `{0, 2, 4, 6, 8, 10, 12, 16, 20, 24, 28, 32}` with a 2-order early-out
  once cost stops improving. Order 0 is verbatim (residuals equal the raw
  samples). The wire format permits any order in `[0, 32]`.
- **Coefficient shift** `∈ [0, 5]`: widens the Q-format of the stored
  predictor coefficients from Q15 (range `[−1, 1)`) out to Q10 (range
  `[−32, 32)`) so low-frequency / narrow-resonance content doesn't clamp
  `|a[1]|` near 2. Chosen deterministically per order as the smallest
  shift that avoids clamping.
- **Rice partition order** `∈ [0, 7]`: splits the residual stream into
  `2^partition_order` equal partitions, each with its own Rice parameter
  `k ∈ [0, 23]` chosen by convex descent.

Levinson-Durbin runs once up to order 32 into a flat stack-allocated
buffer (`LpcLevels`) and the per-order coefficients are consulted by
slice reference; the order search itself does no heap allocation.

## Intended use

- **QUIC streaming** — one reliable stream per audio channel. Frames fit
  the per-stream framing (length-prefixed or datagram-mapped) without
  modification.
- **Offline file playback** — a container pairs the channel streams by
  timestamp; each stream decodes independently.

## Frame size guidance

Frame size is a latency-vs-compression knob chosen at the application
layer. The codec accepts any `frame_sample_count` in `[1, 65535]`, but
the LPC/Rice search amortises better on larger frames (shared header,
more samples per fitted coefficient vector). Concrete defaults:

| Use case | Frame size | Latency at 48 kHz | Notes |
|---|---|---|---|
| Real-time voice, tight latency | 160 @ 16 kHz (10 ms) | — | matches WebRTC/Opus 10 ms mode |
| Real-time voice, balanced | **320 @ 16 kHz (20 ms)** | — | default for MCU workload in `tests/mcu_mix.rs` |
| Game/conf streaming | **960 @ 48 kHz (20 ms)** | 20 ms | one QUIC datagram per frame fits typical MTUs |
| Music streaming | **2048 @ 48 kHz (43 ms)** | 43 ms | compression benefit flattens past this |
| Offline archival | **4096 @ 48 kHz (85 ms)** | — | tightest LPC fit; default in `tests/corpus.rs`, matches FLAC's default blocksize for apples-to-apples compression comparison |

Partition orders that evenly divide the frame size dominate the search
cost. Power-of-two frame sizes (256, 512, 1024, 2048, 4096) unlock every
`partition_order ∈ [0, 7]`; 960 and 2880 (common WebRTC rates) allow
orders up to 6 and 5 respectively; prime sizes like 137 collapse to
`partition_order = 0`. Prefer power-of-two frame sizes unless a
container format constrains the choice.

## Structure

```text
lac/
├── Cargo.toml
├── README.md                ← you are here
├── Specification.md                  ← wire format specification
├── corpus/                  ← test WAVs (speech + music), LFS-tracked via .gitattributes
├── src/
│   ├── lib.rs               ← public API and project-wide constants
│   ├── bit_io.rs            ← MSB-first bit reader/writer
│   ├── lpc.rs               ← Levinson-Durbin, LpcLevels flat buffer, residuals/synthesis
│   ├── rice.rs              ← zigzag + partitioned Rice coding, convex-descent k
│   ├── frame.rs             ← frame header, encode_frame, decode_frame
│   └── test_signals.rs      ← integer-only sine LUT for float-free test inputs
├── tests/
│   ├── corpus.rs            ← compression ratio + FLAC comparison on real audio
│   ├── synthetic.rs         ← bit-depth + pathological-content round-trips, no corpus needed
│   ├── latency.rs           ← P50/P95/P99/max encode+decode latency, peak heap, alloc count
│   └── mcu_mix.rs           ← end-to-end MCU workload (decode → mix → re-encode)
├── benches/
│   ├── codec.rs             ← nightly #[bench] harness (encode, decode, compute_residuals)
│   └── compare-flac.sh      ← diagnostic shell script: wall-clock flac encode across corpus
└── fuzz/
    ├── fuzz_targets/
    │   ├── decode_arbitrary.rs     ← decoder robustness under arbitrary bytes
    │   └── roundtrip_arbitrary.rs  ← encoder/decoder self-consistency
    └── dict/
        ├── decode_arbitrary.dict   ← libFuzzer dict: sync word + field boundary constants
        └── roundtrip_arbitrary.dict ← libFuzzer dict: sample-value boundaries
```

See `Specification.md` for the normative wire format.

## Public API

Every sample is an `i32` with magnitude bounded by `2²³ − 1`. Narrower
integer sources go through unchanged:

```rust
use lac::{encode_frame, decode_frame};

// 16-bit microphone PCM → just widen with `i32::from`. Do NOT shift
// left by 8 to "align" to 24-bit: that multiplies residual magnitudes
// by 256 and costs 8 extra bits per residual in the Rice payload. The
// codec compresses at the bit cost of the actual sample magnitudes,
// not a 24-bit ceiling.
let pcm_16: Vec<i16> = /* from microphone */ Vec::new();
let samples: Vec<i32> = pcm_16.iter().map(|&s| i32::from(s)).collect();

let bytes = encode_frame(&samples);
let recovered: Vec<i32> = decode_frame(&bytes)?;
assert_eq!(recovered, samples);
# Ok::<(), lac::DecodeError>(())
```

For 24-bit PCM, samples are already in range — pass through directly.
For 8-bit PCM, `i32::from(s as i8)` (signed) or the equivalent from your
unsigned-offset-128 source.

Round-trip is bit-exact: `decode_frame(encode_frame(s)) == s` for every
valid `s`.

### Buffer-reusing API for hot loops

For the MCU re-encode fanout and QUIC senders that own a per-channel
scratch buffer, use [`encode_frame_into`] / [`decode_frame_into`] to
target a caller-owned `Vec<u8>` / `Vec<i32>` instead of allocating
fresh on each call:

```rust
use lac::{encode_frame_into, decode_frame_into};

let mut encoded = Vec::new();  // one buffer per channel, reused across frames
let mut decoded = Vec::new();

for frame_samples in frames_iter() {
    encode_frame_into(&frame_samples, &mut encoded);
    // … send `encoded` …
}

for incoming_bytes in incoming_iter() {
    decode_frame_into(&incoming_bytes, &mut decoded)?;
    // … consume `decoded` …
}
# fn frames_iter() -> impl Iterator<Item = Vec<i32>> { std::iter::empty() }
# fn incoming_iter() -> impl Iterator<Item = Vec<u8>> { std::iter::empty() }
# Ok::<(), lac::DecodeError>(())
```

Both `_into` variants clear the destination at entry and retain its
capacity, so steady-state usage makes zero allocations past the first
frame.

### Output size expectations

For realistic audio (speech, music, ambient), compressed frames land
around **15-55 %** of raw sample bytes (speech near the low end, music
near the high end). Callers reusing a scratch buffer can safely
preallocate to 1× raw and take the extension cost only on the rare
adversarial frame.

For untrusted input — payloads where residuals might be crafted to
maximise Rice output — the worst-case expansion bound is ~17× raw: at
the Rice `k = 23` ceiling, each codeword is up to 535 bits (511 unary
zeros + terminator + 23 remainder), or ~67 bytes per residual. A
pipeline that must pre-size a bounded output buffer for arbitrary
input can use `samples.len() * 68` bytes as a loose upper bound. The
encoder never exceeds this.

### Error recovery

On decode failure the caller substitutes `frame_sample_count` zeros
(silence) for the frame period. The count is recoverable from the
frame itself as long as the *header* parsed, even if the bitstream
body then failed — call [`parse_header`] on the same buffer:

```rust
use lac::{decode_frame, parse_header};

const SESSION_DEFAULT_FRAME: usize = 320;  // negotiated at session start

let bytes = Vec::<u8>::new();
let samples = match decode_frame(&bytes) {
    Ok(s) => s,
    Err(_) => {
        let count = parse_header(&bytes)
            .map(|(h, _)| h.frame_sample_count as usize)
            .unwrap_or(SESSION_DEFAULT_FRAME);
        vec![0i32; count]
    }
};
```

When the header itself fails (`BadSyncWord`, `InvalidPredictionOrder`,
`InvalidPartitionOrder`, `InvalidCoefficientShift`, or `Truncated`
below 7 bytes), the frame length is unknowable and the caller must
fall back to a session-level default.

[`encode_frame_into`]: https://docs.rs/lac/latest/lac/fn.encode_frame_into.html
[`decode_frame_into`]: https://docs.rs/lac/latest/lac/fn.decode_frame_into.html
[`parse_header`]: https://docs.rs/lac/latest/lac/fn.parse_header.html

## Concurrency

LAC's encode and decode APIs are pure functions with no shared state —
no globals, no internal `Mutex`, no `unsafe`. All public types are
`Send + Sync`. Calls on different threads never contend with each
other, and each call's scratch buffers are owned (stack or the
caller-supplied `Vec`).

The intended deployment shape for multi-channel and multi-stream
workloads is **one thread or task per channel**. The codec itself does
no threading: scheduling is left to the application so it can pick
whichever executor fits (tokio for async servers, rayon for data-
parallel workloads, `std::thread` for straight-ahead concurrency).

MCU re-encode fanout with stdlib primitives only:

```rust
use std::thread;
use lac::encode_frame;

let mixes: Vec<Vec<i32>> = Vec::new();
let outgoing: Vec<Vec<u8>> = thread::scope(|s| {
    let handles: Vec<_> = mixes
        .iter()
        .map(|mix| s.spawn(move || encode_frame(mix)))
        .collect();
    handles.into_iter().map(|h| h.join().unwrap()).collect()
});
```

Or with rayon, if the project already pulls it in:

```rust
// use rayon::prelude::*;
// let outgoing: Vec<Vec<u8>> = mixes.par_iter().map(|m| encode_frame(m)).collect();
```

The allocator you link against sets the ceiling on multi-core
scaling: glibc `malloc` has measurable lock contention at tens of
cores, whereas mimalloc / jemalloc keep per-thread caches and scale
further. The codec itself doesn't care which one you pick — it allocs
through the global allocator like any other Rust library.

### Input-size caps on untrusted channels

Applications accepting LAC frames from untrusted peers should cap the
per-frame input size at the application layer. The decoder's
per-codeword unary-run bound (spec §4.2) prevents any single codeword
from consuming unbounded CPU, but total decode cost scales with
buffer length; an attacker handed an unbounded payload can force
proportional scan work. Typical real frames are sub-kilobyte; **a cap
of 64 KB per frame is comfortably above any legitimate LAC payload
and cheap to enforce at the framing layer** (QUIC stream length
field, length-prefixed framing, etc.). The `Truncated` error fires
naturally when a payload is cut, so a hard cap doesn't break legal
traffic — it just bounds pathological work.

### Silence-substitution amplification

Spec §6.1 mandates that callers substitute `frame_sample_count` zeros
on decode failure. An attacker can craft a tiny frame (~10-byte
header with `frame_sample_count = 65535`) whose Rice payload is
malformed; the decoder rejects, the caller dutifully emits 65 535
output samples of silence. At 48 kHz mono `i32`, that's **~256 KB of
zeros per ~10-byte input frame — a ~25 000× amplification**.

The output is silence, not attacker-chosen data, so this is a
downstream-resource-exhaustion vector (memory, bandwidth,
re-encode work at an MCU) rather than a data-injection vector.
Mitigation is at the application layer: **cap `frame_sample_count`
to the session's negotiated frame size** before invoking the silence
substitution. QUIC / WebRTC sessions already negotiate a frame size
at setup; using that as a hard upper bound on the silence-fill
length collapses the amplification ratio to 1×. An MCU that reads
`parse_header(&data).frame_sample_count` without validating it
against the session cap inherits the amplification unchanged.

## Packet loss & concealment

Frames are independently decodable: losing one frame never corrupts
another, regardless of which concealment strategy the application
picks. This is a genuine deployment asset on lossy transports (QUIC
datagrams, UDP), and the section below walks the plausible strategies
in increasing quality order.

### Strategy 1: silence substitution (the default)

The baseline `decode_frame` returns `Err` on structural failure; the
application substitutes `frame_sample_count` zeros for the lost frame
period (see `parse_header` recovery pattern under *Public API →
Error recovery*). Fast, deterministic, audible as a brief cut —
acceptable for voice up to ~20 ms of loss, jarring beyond that.

### Strategy 2: sample-and-hold

Repeat the last successfully decoded sample for the frame period.
Zero-cost on the decoder side, preserves DC level so the click at
the drop boundary is softer than silence. Quality at 20 ms of loss
is better than silence for voice, slightly worse for music (DC hold
on a non-stationary signal adds a small transient when the next
frame arrives).

```rust
// After a successful decode, store the last sample for reuse on loss.
// On loss: fill the gap with that value.
# fn last_decoded_sample() -> i32 { 0 }
# const N: usize = 320;
let conceal = vec![last_decoded_sample(); N];
```

### Strategy 3: linear fade

Interpolate from the last valid sample down to zero over the lost
frame period. Removes the DC-hold transient and the "cut to silence"
click both. Costs N integer adds per lost frame. Recommended baseline
for any application that can afford 2-5 lines of PLC code.

### Strategy 4: LPC-coefficient extrapolation

The last successfully decoded frame's [`AudioFrameHeader`] carries
the LPC coefficients the encoder chose — available from
[`parse_header`] at no extra cost — and the LPC filter is locally
stationary over a 20-40 ms horizon. Run the synthesis formula (§3.6
of `Specification.md`) forward from the last decoded samples to *predict*
the missing frame. Quality is best on pitched content (voiced
speech, sustained notes); on transients it degrades gracefully
because the predictor's autoregressive behaviour damps toward zero
over the frame.

Not built into the library — the math is straightforward and the
"right" tuning varies by deployment (how much damping, whether to
blend with sample-and-hold on transients, etc.). See `src/lpc.rs`'s
`lpc_synthesize_into` for the integer synthesis routine that a
PLC implementation would call.

### Multi-frame loss guidance

The strategies above are only useful up to a handful of consecutive
lost frames. Rough thresholds at 20 ms frame periods:

| Consecutive lost frames | Effective loss | Verdict |
|---|---|---|
| 1 | 20 ms | Inaudible with fade or LPC extrapolation; brief click with silence or sample-and-hold |
| 2-3 | 40-60 ms | Noticeable glitch; LPC extrapolation minimises but cannot hide it |
| 4-10 | 80-200 ms | Audible dropout. PLC keeps the audio from sounding "broken" but doesn't restore content |
| > 10 | > 200 ms | Treat the stream as broken; reset the receiver's concealment state to avoid droning artifacts, and if possible ask the transport to signal "resync" upstream |

Mid-stream resync on a datagram transport uses the sync word
(`0x1ACC`) as an alignment anchor: on a string of bad frames,
search the next `N` bytes of the buffer for the big-endian sequence
`\x1a\xcc` and retry `parse_header` from each candidate offset
until one succeeds. The search is O(N); on a 20 ms frame at 48 kHz
there are at most ~180 bytes per frame to scan, so amortised cost
is negligible.

[`AudioFrameHeader`]: https://docs.rs/lac/latest/lac/struct.AudioFrameHeader.html

## Testing

```
cargo test                                                           # unit tests
cargo test --test corpus    --release -- --nocapture                 # compression vs FLAC, lac_enc_ms
cargo test --test synthetic --release -- --nocapture                 # bit-depth + pathological content
cargo test --test latency   --release -- --nocapture --test-threads=1  # p50/p95/p99 + alloc count
cargo test --test mcu_mix   --release -- --nocapture --test-threads=1  # MCU throughput
cargo test --test conformance --release -- --nocapture               # byte-level spec conformance
cargo test --test determinism --release                              # encode byte-equality on repeat
cargo fuzz run decode_arbitrary    -- -dict=dict/decode_arbitrary.dict
cargo fuzz run roundtrip_arbitrary -- -dict=dict/roundtrip_arbitrary.dict
cargo bench                                                          # nightly bench
benches/compare-flac.sh                                              # flac side of the speed table
```

**Published-crate caveat.** `Cargo.toml` excludes `corpus/*` and
`fuzz/*` from the published tarball — they'd blow up crate size and
the audio isn't redistributable under crates.io's constraints anyway.
A user running `cargo test` against a `cargo add lac`'d dependency
sees every corpus test *pass* because the `require_corpus!` macro
skips missing files silently; the compression-ratio assertions,
FLAC comparisons, latency P99 checks, and MCU throughput checks all
go unrun. The full regression suite requires the git repository
(with LFS pulled). The synthetic, conformance, determinism, and
unit tests run unchanged from either source.

Coverage at a glance:

- **Unit** — round-trips for every LPC order 0-32 and every partition order
  0-7, prime frame lengths that force `partition_order = 0`, all-zero
  frames, full-scale sample magnitudes, malformed-header rejection for
  every field (`sync_word`, `prediction_order`, `partition_order`,
  `coefficient_shift`), truncated bitstreams, and a convex-descent vs
  exhaustive-search `select_k` differential.
- **Corpus** — round-trip + compression-ratio + FLAC subprocess comparison
  on a mixed speech and music corpus; asserts ratio ceilings so a codec
  regression fails CI; prints LAC encode wall-clock for correlation
  against `benches/compare-flac.sh`.
- **Synthetic** — deterministic LFSR-driven round-trips at 8/16/20/24-bit
  source widths and pathological content (all-zero, DC offset, Nyquist
  square, silence + click, full-scale constant, prime-length frame). No
  corpus dependency so the tests run on every CI checkout.
- **Latency** — per-frame encode/decode timing on real speech with a
  custom tracking allocator for peak-heap *and* per-frame allocation-count
  numbers; reports P50/P95/P99/max and asserts P99 < frame period so a
  real-time regression fails CI.
- **MCU** — decode → PCM mix → re-encode simulation on real speech for
  2/3/5/8/16 participants (continuous speech) plus an 8-participant
  rotating dominant-speaker variant; asserts MCU egress ≤ SFU-fanout egress.
- **Fuzz** — libFuzzer targets for decoder robustness and
  encoder/decoder self-consistency on arbitrary bytes, seeded with
  dictionaries of the wire-format constants (sync word, field boundaries)
  and sample-magnitude boundaries (8/16/20/24-bit ceilings).

## Measurements

### Reference hardware

| Short name | CPU | ISA highlights |
|---|---|---|
| **7840HS** | AMD Ryzen 7 7840HS (laptop, 8c/16t, up to 5.1 GHz) | AVX-512 (F/BW/CD/DQ/VL/VNNI/VBMI), BMI2, FMA |
| **RPi5** | Raspberry Pi 5 (Cortex-A76 quad, 2.4 GHz) | NEON |
| **VF2** | StarFive VisionFive 2 (SiFive U74 quad, 1.5 GHz) | RVV 0.7 (some LLVM autovec, less mature than x86 or NEON) |

Numbers below are measured at default `cargo build --release` (no
`target-cpu=native`, no project-level `RUSTFLAGS`). Empty cells are
awaiting measurement on the listed hardware. FLAC comparison uses both
`-5` (the CLI default, what production pipelines typically use) and
`-8` (`--best`, the compression upper bound).

### Corpus attribution

The measurements are taken on two publicly-licensed audio corpora
checked into `corpus/`:

- **Speech**: the [AMI Meeting Corpus](https://groups.inf.ed.ac.uk/ami/corpus/)
  (files named `ES2002a.*`), recorded by the AMI Consortium (University
  of Edinburgh, IDIAP, TNO, Brno University of Technology, University
  of Sheffield, and partners). Distributed under
  [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
- **Music**: Kimiko Ishizaka's recording of J.S. Bach's *Goldberg
  Variations, BWV 988* (files named `Kimiko Ishizaka - …`), from the
  [Open Goldberg Variations project](https://opengoldbergvariations.org/)
  (Robert Douglass, producer). Released under
  [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/) —
  public domain dedication, no attribution legally required, credited
  here as a courtesy.

Both corpora are used unmodified apart from the file selection
described in the tables below.

### Compression (hardware-independent, bit-exact across targets)

LAC ratio = LAC encoded / raw PCM. Both codecs use the same 4096-sample
block size on this corpus — LAC's `tests/corpus.rs` sets
`FRAME_SIZE = 4096`, which matches FLAC's default blocksize at `-5`
and `-8` for ≤ 48 kHz content, so header and coefficient overhead is
amortised identically on both sides.

| Corpus file | Class | LAC | FLAC -5 | FLAC -8 | LAC / -5 | LAC / -8 |
|---|---|---:|---:|---:|---:|---:|
| `ES2002a.Headset-0.wav` | headset speech, 16 kHz / 16-bit | 0.178 | 0.187 | 0.186 | 0.954 | 0.958 |
| `ES2002a.Mix-Headset.wav` | mixed meeting, 16 kHz / 16-bit | 0.292 | 0.300 | 0.297 | 0.975 | 0.984 |
| `ES2002a.Array1-01.wav` | array speech, 16 kHz / 16-bit | 0.375 | 0.378 | 0.377 | 0.989 | 0.994 |
| Goldberg Aria (01) | solo piano, 96 kHz / 24-bit | 0.483 | 0.458 | 0.457 | 1.053 | 1.056 |
| Goldberg Variatio 4 (05, fughetta) | solo piano, 96 kHz / 24-bit | 0.514 | 0.483 | 0.481 | 1.065 | 1.067 |
| Goldberg Variatio 16 (17, Ouverture) | solo piano, 96 kHz / 24-bit | 0.512 | 0.479 | 0.478 | 1.068 | 1.070 |

Speech reliably beats FLAC at both levels by a small margin; music
trails by 5-7 % (the Q-format gap at low frequencies, mitigated but
not eliminated by `coefficient_shift`). FLAC's jump from `-5` to `-8`
buys essentially nothing on this corpus (≤ 0.2 pp of ratio), so the
realistic LAC-vs-FLAC comparison in practice is against `-5`. Numbers
are byte-identical regardless of hardware because LAC's output is
specified bit-exactly.

### Encode wall-clock (ms, full file)

One table per hardware target; each has LAC alongside both FLAC levels
so the speed cost of each quality point is visible. The `-5` column is
the most representative real-world comparison.

**7840HS** (AMD Ryzen 7 7840HS):

| Corpus file | Duration | LAC | FLAC -5 | FLAC -8 |
|---|---|---:|---:|---:|
| `ES2002a.Headset-0.wav` | ~42 min, 16 kHz / 16-bit | 1158 | 221 | 436 |
| `ES2002a.Array1-01.wav` | ~42 min, 16 kHz / 16-bit | 1292 | 226 | 447 |
| `ES2002a.Mix-Headset.wav` | ~42 min, 16 kHz / 16-bit | 1367 | 223 | 469 |
| Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 809 | 272 | 647 |
| Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 2126 | 754 | 1741 |
| Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 3521 | 1166 | 2703 |

**RPi5** (Raspberry Pi 5, Cortex-A76 @ 2.4 GHz):

| Corpus file | Duration | LAC | FLAC -5 | FLAC -8 |
|---|---|---:|---:|---:|
| `ES2002a.Headset-0.wav` | ~42 min, 16 kHz / 16-bit | 2856 | 477 | 959 |
| `ES2002a.Array1-01.wav` | ~42 min, 16 kHz / 16-bit | 3249 | 495 | 1096 |
| `ES2002a.Mix-Headset.wav` | ~42 min, 16 kHz / 16-bit | 3363 | 505 | 1132 |
| Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 1904 | 606 | 1570 |
| Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 5201 | 1627 | 4324 |
| Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 9015 | 2572 | 6832 |

**VF2** (StarFive VisionFive 2, SiFive U74 quad @ 1.5 GHz):

| Corpus file | Duration | LAC | FLAC -5 | FLAC -8 |
|---|---|---:|---:|---:|
| `ES2002a.Headset-0.wav` | ~42 min, 16 kHz / 16-bit | 29385 | 2355 | 5614 |
| `ES2002a.Array1-01.wav` | ~42 min, 16 kHz / 16-bit | 33231 | 2502 | 6688 |
| `ES2002a.Mix-Headset.wav` | ~42 min, 16 kHz / 16-bit | 34899 | 2548 | 6878 |
| Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 18185 | 3184 | 9278 |
| Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 49811 | 8535 | 25454 |
| Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 88208 | 13544 | 40650 |

LAC is ~5-6× slower than FLAC `-5` and ~2-3× slower than FLAC `--best`
on x86 because libFLAC ships hand-tuned SSE intrinsics for its
autocorrelation kernel and LAC relies on LLVM autovectorization.
End-to-end perf barely changes with `target-cpu=native`: the kernel does
pick up AVX-512 zmm dot-products, but the frame encode is bottlenecked
elsewhere (Rice k-search and bitstream assembly dominate the remaining
time).

On RPi5 (ARM Cortex-A76, NEON) LAC runs ~2.5× slower than on 7840HS in
absolute terms, but the *ratio* against FLAC shifts noticeably: LAC is
~6-7× slower than FLAC `-5` on speech (wider gap, libFLAC's NEON path
is well-tuned for the 16 kHz / 16-bit content) but only ~3× slower on
96 kHz / 24-bit music (narrower gap — 24-bit content gives libFLAC's
specialization less leverage). Against FLAC `--best` the music gap
narrows further to ~1.2-1.3×. The 7840HS-vs-RPi5 delta in the LAC
column shows scalar autovec quality is broadly comparable across x86
and ARM backends; the delta in the FLAC columns shows where hand-tuned
intrinsics disappear on a different ISA.

On VF2 (RISC-V SiFive U74, RVV 0.7 — not supported by mainline libFLAC
or LLVM autovec yet) LAC runs ~10× slower than on RPi5. Both codecs
fall back to pure scalar execution; the gap between them *widens* to
~12-13× on speech and ~6× on music vs FLAC `-5`, or ~5× / ~2× vs
FLAC `--best`. Two factors compound: the U74 is a single-issue
in-order core vs the Cortex-A76's dual-issue out-of-order (base IPC is
~2× lower at the ISA-agnostic level), and LLVM's scalar Rust codegen
for RISC-V is less mature than its x86/ARM output — tighter inner
loops in libFLAC's hand-written C survive this better than LAC's
Rust does. The absolute numbers are still useful: even at 88 s to
encode 5 minutes of 96/24 stereo audio, LAC comfortably meets
realtime for streaming use (see the P99 latency table below).

### Per-frame encode latency P99 (µs)

All rows use real AMI speech samples. Frame sample count sets the
real-time deadline; P99 must stay below that period for the frame to
ship inside its own playback slot.

| Test | Frame | Period | 7840HS P99 | RPi5 P99 | VF2 P99 |
|---|---|---:|---:|---:|---:|
| `latency_headset_speech_160` | 160 @ 16 kHz | 10 ms | 20 | 38 | 235 |
| `latency_headset_speech_320` | 320 @ 16 kHz | 20 ms | 36 | 76 | 499 |
| `latency_headset_speech_480` | 480 @ 16 kHz | 30 ms | 37 | 81 | 635 |
| `latency_headset_speech_prime` | 503 @ 16 kHz | 31 ms | 23 | 52 | 387 |
| `latency_array_speech_320` | 320 @ 16 kHz | 20 ms | 42 | 77 | 506 |
| `latency_mixed_meeting_320` | 320 @ 16 kHz | 20 ms | 43 | 84 | 551 |

P99 headroom is ~400-1300× on 7840HS, ~130-600× on RPi5, and
~36-81× on VF2. Every row on every platform stays comfortably inside
the realtime deadline — even VF2's worst case (`mixed_meeting_320` at
551 µs on a 20 ms frame) has 36× margin. LAC meets its streaming
contract on every target tested.

### MCU throughput (× realtime on one core)

Realtime multiplier = audio-ms processed per wall-clock-ms, per core.
"`20×` realtime" means one core sustains twenty simultaneous meetings
of the listed configuration.

| Test | Activity | 7840HS | RPi5 | VF2 |
|---|---|---:|---:|---:|
| `mcu_mix_1on1_voice` (P=2) | continuous | 279× | 145× | 22× |
| `mcu_mix_3people_voice` (P=3) | continuous | 193× | 95× | 14× |
| `mcu_mix_5people_voice` (P=5) | continuous | 120× | 57× | 9× |
| `mcu_mix_8people_voice` (P=8) | continuous | 77× | 35× | 5× |
| `mcu_mix_8people_dominant_speaker` (P=8) | rotating speaker | 106× | 43× | 6× |
| `mcu_mix_16people_voice` (P=16) | continuous | 39× | 17× | 2.5× |

MCU egress byte count as a fraction of SFU fanout egress on 7840HS:
1.00 (P=2, trivially equal), 0.60 (P=3), 0.36 (P=5), 0.22 (P=8
continuous), 0.35 (P=8 dominant-speaker), 0.10 (P=16). The continuous
case is the lower bound — SFU fanout scales quadratically in
participant count while MCU mix egress scales linearly, so the
relative savings grow as the meeting does. The dominant-speaker case
inverts that trend slightly: SFU fanout of N-1 near-silent streams is
almost free, so the SFU baseline falls faster than the MCU mix cost
does. These numbers are byte-accounting, not wall-clock.