609 lines
29 KiB
Markdown
609 lines
29 KiB
Markdown
# LAC — Lo Audio Codec
|
||
|
||
Lossless audio codec for internal use. Target compression is FLAC-class
|
||
(~50% of raw). Integer-only, bit-exact, streaming-oriented.
|
||
|
||
## Scope
|
||
|
||
- **Input**: signed integer PCM passed as `i32` with `|sample| ≤ 2²³ − 1`.
|
||
8-bit, 16-bit, 20-bit, and 24-bit sources are all valid without
|
||
conversion — they compress at the bit cost of their actual values, not
|
||
a 24-bit ceiling.
|
||
- **Sample rate**: caller-specified; not encoded in the stream. The container
|
||
or transport carries it.
|
||
- **Channels**: mono per encoded stream. Stereo is two independent mono
|
||
streams — for example, two QUIC streams over a shared connection, one per
|
||
channel. No cross-channel joint coding.
|
||
- **Frames**: independently decodable. No cross-frame state; a lost or corrupt
|
||
frame never affects subsequent decodes.
|
||
|
||
## Pipeline
|
||
|
||
```text
|
||
samples → LPC analysis → residuals → partitioned Rice → frame bytes
|
||
↓
|
||
(inverse for decode)
|
||
```
|
||
|
||
Three encoder-side choices, searched per frame:
|
||
|
||
- **LPC order**: the reference encoder tries a sparse grid
|
||
`{0, 2, 4, 6, 8, 10, 12, 16, 20, 24, 28, 32}` with a 2-order early-out
|
||
once cost stops improving. Order 0 is verbatim (residuals equal the raw
|
||
samples). The wire format permits any order in `[0, 32]`.
|
||
- **Coefficient shift** `∈ [0, 5]`: widens the Q-format of the stored
|
||
predictor coefficients from Q15 (range `[−1, 1)`) out to Q10 (range
|
||
`[−32, 32)`) so low-frequency / narrow-resonance content doesn't clamp
|
||
`|a[1]|` near 2. Chosen deterministically per order as the smallest
|
||
shift that avoids clamping.
|
||
- **Rice partition order** `∈ [0, 7]`: splits the residual stream into
|
||
`2^partition_order` equal partitions, each with its own Rice parameter
|
||
`k ∈ [0, 23]` chosen by convex descent.
|
||
|
||
Levinson-Durbin runs once up to order 32 into a flat stack-allocated
|
||
buffer (`LpcLevels`) and the per-order coefficients are consulted by
|
||
slice reference; the order search itself does no heap allocation.
|
||
|
||
## Intended use
|
||
|
||
- **QUIC streaming** — one reliable stream per audio channel. Frames fit
|
||
the per-stream framing (length-prefixed or datagram-mapped) without
|
||
modification.
|
||
- **Offline file playback** — a container pairs the channel streams by
|
||
timestamp; each stream decodes independently.
|
||
|
||
## Frame size guidance
|
||
|
||
Frame size is a latency-vs-compression knob chosen at the application
|
||
layer. The codec accepts any `frame_sample_count` in `[1, 65535]`, but
|
||
the LPC/Rice search amortises better on larger frames (shared header,
|
||
more samples per fitted coefficient vector). Concrete defaults:
|
||
|
||
| Use case | Frame size | Latency at 48 kHz | Notes |
|
||
|---|---|---|---|
|
||
| Real-time voice, tight latency | 160 @ 16 kHz (10 ms) | — | matches WebRTC/Opus 10 ms mode |
|
||
| Real-time voice, balanced | **320 @ 16 kHz (20 ms)** | — | default for MCU workload in `tests/mcu_mix.rs` |
|
||
| Game/conf streaming | **960 @ 48 kHz (20 ms)** | 20 ms | one QUIC datagram per frame fits typical MTUs |
|
||
| Music streaming | **2048 @ 48 kHz (43 ms)** | 43 ms | compression benefit flattens past this |
|
||
| Offline archival | **4096 @ 48 kHz (85 ms)** | — | tightest LPC fit; default in `tests/corpus.rs`, matches FLAC's default blocksize for apples-to-apples compression comparison |
|
||
|
||
Partition orders that evenly divide the frame size dominate the search
|
||
cost. Power-of-two frame sizes (256, 512, 1024, 2048, 4096) unlock every
|
||
`partition_order ∈ [0, 7]`; 960 and 2880 (common WebRTC rates) allow
|
||
orders up to 6 and 5 respectively; prime sizes like 137 collapse to
|
||
`partition_order = 0`. Prefer power-of-two frame sizes unless a
|
||
container format constrains the choice.
|
||
|
||
## Structure
|
||
|
||
```text
|
||
lac/
|
||
├── Cargo.toml
|
||
├── README.md ← you are here
|
||
├── Specification.md ← wire format specification
|
||
├── corpus/ ← test WAVs (speech + music), LFS-tracked via .gitattributes
|
||
├── src/
|
||
│ ├── lib.rs ← public API and project-wide constants
|
||
│ ├── bit_io.rs ← MSB-first bit reader/writer
|
||
│ ├── lpc.rs ← Levinson-Durbin, LpcLevels flat buffer, residuals/synthesis
|
||
│ ├── rice.rs ← zigzag + partitioned Rice coding, convex-descent k
|
||
│ ├── frame.rs ← frame header, encode_frame, decode_frame
|
||
│ └── test_signals.rs ← integer-only sine LUT for float-free test inputs
|
||
├── tests/
|
||
│ ├── corpus.rs ← compression ratio + FLAC comparison on real audio
|
||
│ ├── synthetic.rs ← bit-depth + pathological-content round-trips, no corpus needed
|
||
│ ├── latency.rs ← P50/P95/P99/max encode+decode latency, peak heap, alloc count
|
||
│ └── mcu_mix.rs ← end-to-end MCU workload (decode → mix → re-encode)
|
||
├── benches/
|
||
│ ├── codec.rs ← nightly #[bench] harness (encode, decode, compute_residuals)
|
||
│ └── compare-flac.sh ← diagnostic shell script: wall-clock flac encode across corpus
|
||
└── fuzz/
|
||
├── fuzz_targets/
|
||
│ ├── decode_arbitrary.rs ← decoder robustness under arbitrary bytes
|
||
│ └── roundtrip_arbitrary.rs ← encoder/decoder self-consistency
|
||
└── dict/
|
||
├── decode_arbitrary.dict ← libFuzzer dict: sync word + field boundary constants
|
||
└── roundtrip_arbitrary.dict ← libFuzzer dict: sample-value boundaries
|
||
```
|
||
|
||
See `Specification.md` for the normative wire format.
|
||
|
||
## Public API
|
||
|
||
Every sample is an `i32` with magnitude bounded by `2²³ − 1`. Narrower
|
||
integer sources go through unchanged:
|
||
|
||
```rust
|
||
use lac::{encode_frame, decode_frame};
|
||
|
||
// 16-bit microphone PCM → just widen with `i32::from`. Do NOT shift
|
||
// left by 8 to "align" to 24-bit: that multiplies residual magnitudes
|
||
// by 256 and costs 8 extra bits per residual in the Rice payload. The
|
||
// codec compresses at the bit cost of the actual sample magnitudes,
|
||
// not a 24-bit ceiling.
|
||
let pcm_16: Vec<i16> = /* from microphone */ Vec::new();
|
||
let samples: Vec<i32> = pcm_16.iter().map(|&s| i32::from(s)).collect();
|
||
|
||
let bytes = encode_frame(&samples);
|
||
let recovered: Vec<i32> = decode_frame(&bytes)?;
|
||
assert_eq!(recovered, samples);
|
||
# Ok::<(), lac::DecodeError>(())
|
||
```
|
||
|
||
For 24-bit PCM, samples are already in range — pass through directly.
|
||
For 8-bit PCM, `i32::from(s as i8)` (signed) or the equivalent from your
|
||
unsigned-offset-128 source.
|
||
|
||
Round-trip is bit-exact: `decode_frame(encode_frame(s)) == s` for every
|
||
valid `s`.
|
||
|
||
### Buffer-reusing API for hot loops
|
||
|
||
For the MCU re-encode fanout and QUIC senders that own a per-channel
|
||
scratch buffer, use [`encode_frame_into`] / [`decode_frame_into`] to
|
||
target a caller-owned `Vec<u8>` / `Vec<i32>` instead of allocating
|
||
fresh on each call:
|
||
|
||
```rust
|
||
use lac::{encode_frame_into, decode_frame_into};
|
||
|
||
let mut encoded = Vec::new(); // one buffer per channel, reused across frames
|
||
let mut decoded = Vec::new();
|
||
|
||
for frame_samples in frames_iter() {
|
||
encode_frame_into(&frame_samples, &mut encoded);
|
||
// … send `encoded` …
|
||
}
|
||
|
||
for incoming_bytes in incoming_iter() {
|
||
decode_frame_into(&incoming_bytes, &mut decoded)?;
|
||
// … consume `decoded` …
|
||
}
|
||
# fn frames_iter() -> impl Iterator<Item = Vec<i32>> { std::iter::empty() }
|
||
# fn incoming_iter() -> impl Iterator<Item = Vec<u8>> { std::iter::empty() }
|
||
# Ok::<(), lac::DecodeError>(())
|
||
```
|
||
|
||
Both `_into` variants clear the destination at entry and retain its
|
||
capacity, so steady-state usage makes zero allocations past the first
|
||
frame.
|
||
|
||
### Output size expectations
|
||
|
||
For realistic audio (speech, music, ambient), compressed frames land
|
||
around **15-55 %** of raw sample bytes (speech near the low end, music
|
||
near the high end). Callers reusing a scratch buffer can safely
|
||
preallocate to 1× raw and take the extension cost only on the rare
|
||
adversarial frame.
|
||
|
||
For untrusted input — payloads where residuals might be crafted to
|
||
maximise Rice output — the worst-case expansion bound is ~17× raw: at
|
||
the Rice `k = 23` ceiling, each codeword is up to 535 bits (511 unary
|
||
zeros + terminator + 23 remainder), or ~67 bytes per residual. A
|
||
pipeline that must pre-size a bounded output buffer for arbitrary
|
||
input can use `samples.len() * 68` bytes as a loose upper bound. The
|
||
encoder never exceeds this.
|
||
|
||
### Error recovery
|
||
|
||
On decode failure the caller substitutes `frame_sample_count` zeros
|
||
(silence) for the frame period. The count is recoverable from the
|
||
frame itself as long as the *header* parsed, even if the bitstream
|
||
body then failed — call [`parse_header`] on the same buffer:
|
||
|
||
```rust
|
||
use lac::{decode_frame, parse_header};
|
||
|
||
const SESSION_DEFAULT_FRAME: usize = 320; // negotiated at session start
|
||
|
||
let bytes = Vec::<u8>::new();
|
||
let samples = match decode_frame(&bytes) {
|
||
Ok(s) => s,
|
||
Err(_) => {
|
||
let count = parse_header(&bytes)
|
||
.map(|(h, _)| h.frame_sample_count as usize)
|
||
.unwrap_or(SESSION_DEFAULT_FRAME);
|
||
vec![0i32; count]
|
||
}
|
||
};
|
||
```
|
||
|
||
When the header itself fails (`BadSyncWord`, `InvalidPredictionOrder`,
|
||
`InvalidPartitionOrder`, `InvalidCoefficientShift`, or `Truncated`
|
||
below 7 bytes), the frame length is unknowable and the caller must
|
||
fall back to a session-level default.
|
||
|
||
[`encode_frame_into`]: https://docs.rs/lac/latest/lac/fn.encode_frame_into.html
|
||
[`decode_frame_into`]: https://docs.rs/lac/latest/lac/fn.decode_frame_into.html
|
||
[`parse_header`]: https://docs.rs/lac/latest/lac/fn.parse_header.html
|
||
|
||
## Concurrency
|
||
|
||
LAC's encode and decode APIs are pure functions with no shared state —
|
||
no globals, no internal `Mutex`, no `unsafe`. All public types are
|
||
`Send + Sync`. Calls on different threads never contend with each
|
||
other, and each call's scratch buffers are owned (stack or the
|
||
caller-supplied `Vec`).
|
||
|
||
The intended deployment shape for multi-channel and multi-stream
|
||
workloads is **one thread or task per channel**. The codec itself does
|
||
no threading: scheduling is left to the application so it can pick
|
||
whichever executor fits (tokio for async servers, rayon for data-
|
||
parallel workloads, `std::thread` for straight-ahead concurrency).
|
||
|
||
MCU re-encode fanout with stdlib primitives only:
|
||
|
||
```rust
|
||
use std::thread;
|
||
use lac::encode_frame;
|
||
|
||
let mixes: Vec<Vec<i32>> = Vec::new();
|
||
let outgoing: Vec<Vec<u8>> = thread::scope(|s| {
|
||
let handles: Vec<_> = mixes
|
||
.iter()
|
||
.map(|mix| s.spawn(move || encode_frame(mix)))
|
||
.collect();
|
||
handles.into_iter().map(|h| h.join().unwrap()).collect()
|
||
});
|
||
```
|
||
|
||
Or with rayon, if the project already pulls it in:
|
||
|
||
```rust
|
||
// use rayon::prelude::*;
|
||
// let outgoing: Vec<Vec<u8>> = mixes.par_iter().map(|m| encode_frame(m)).collect();
|
||
```
|
||
|
||
The allocator you link against sets the ceiling on multi-core
|
||
scaling: glibc `malloc` has measurable lock contention at tens of
|
||
cores, whereas mimalloc / jemalloc keep per-thread caches and scale
|
||
further. The codec itself doesn't care which one you pick — it allocs
|
||
through the global allocator like any other Rust library.
|
||
|
||
### Input-size caps on untrusted channels
|
||
|
||
Applications accepting LAC frames from untrusted peers should cap the
|
||
per-frame input size at the application layer. The decoder's
|
||
per-codeword unary-run bound (spec §4.2) prevents any single codeword
|
||
from consuming unbounded CPU, but total decode cost scales with
|
||
buffer length; an attacker handed an unbounded payload can force
|
||
proportional scan work. Typical real frames are sub-kilobyte; **a cap
|
||
of 64 KB per frame is comfortably above any legitimate LAC payload
|
||
and cheap to enforce at the framing layer** (QUIC stream length
|
||
field, length-prefixed framing, etc.). The `Truncated` error fires
|
||
naturally when a payload is cut, so a hard cap doesn't break legal
|
||
traffic — it just bounds pathological work.
|
||
|
||
### Silence-substitution amplification
|
||
|
||
Spec §6.1 mandates that callers substitute `frame_sample_count` zeros
|
||
on decode failure. An attacker can craft a tiny frame (~10-byte
|
||
header with `frame_sample_count = 65535`) whose Rice payload is
|
||
malformed; the decoder rejects, the caller dutifully emits 65 535
|
||
output samples of silence. At 48 kHz mono `i32`, that's **~256 KB of
|
||
zeros per ~10-byte input frame — a ~25 000× amplification**.
|
||
|
||
The output is silence, not attacker-chosen data, so this is a
|
||
downstream-resource-exhaustion vector (memory, bandwidth,
|
||
re-encode work at an MCU) rather than a data-injection vector.
|
||
Mitigation is at the application layer: **cap `frame_sample_count`
|
||
to the session's negotiated frame size** before invoking the silence
|
||
substitution. QUIC / WebRTC sessions already negotiate a frame size
|
||
at setup; using that as a hard upper bound on the silence-fill
|
||
length collapses the amplification ratio to 1×. An MCU that reads
|
||
`parse_header(&data).frame_sample_count` without validating it
|
||
against the session cap inherits the amplification unchanged.
|
||
|
||
## Packet loss & concealment
|
||
|
||
Frames are independently decodable: losing one frame never corrupts
|
||
another, regardless of which concealment strategy the application
|
||
picks. This is a genuine deployment asset on lossy transports (QUIC
|
||
datagrams, UDP), and the section below walks the plausible strategies
|
||
in increasing quality order.
|
||
|
||
### Strategy 1: silence substitution (the default)
|
||
|
||
The baseline `decode_frame` returns `Err` on structural failure; the
|
||
application substitutes `frame_sample_count` zeros for the lost frame
|
||
period (see `parse_header` recovery pattern under *Public API →
|
||
Error recovery*). Fast, deterministic, audible as a brief cut —
|
||
acceptable for voice up to ~20 ms of loss, jarring beyond that.
|
||
|
||
### Strategy 2: sample-and-hold
|
||
|
||
Repeat the last successfully decoded sample for the frame period.
|
||
Zero-cost on the decoder side, preserves DC level so the click at
|
||
the drop boundary is softer than silence. Quality at 20 ms of loss
|
||
is better than silence for voice, slightly worse for music (DC hold
|
||
on a non-stationary signal adds a small transient when the next
|
||
frame arrives).
|
||
|
||
```rust
|
||
// After a successful decode, store the last sample for reuse on loss.
|
||
// On loss: fill the gap with that value.
|
||
# fn last_decoded_sample() -> i32 { 0 }
|
||
# const N: usize = 320;
|
||
let conceal = vec![last_decoded_sample(); N];
|
||
```
|
||
|
||
### Strategy 3: linear fade
|
||
|
||
Interpolate from the last valid sample down to zero over the lost
|
||
frame period. Removes the DC-hold transient and the "cut to silence"
|
||
click both. Costs N integer adds per lost frame. Recommended baseline
|
||
for any application that can afford 2-5 lines of PLC code.
|
||
|
||
### Strategy 4: LPC-coefficient extrapolation
|
||
|
||
The last successfully decoded frame's [`AudioFrameHeader`] carries
|
||
the LPC coefficients the encoder chose — available from
|
||
[`parse_header`] at no extra cost — and the LPC filter is locally
|
||
stationary over a 20-40 ms horizon. Run the synthesis formula (§3.6
|
||
of `Specification.md`) forward from the last decoded samples to *predict*
|
||
the missing frame. Quality is best on pitched content (voiced
|
||
speech, sustained notes); on transients it degrades gracefully
|
||
because the predictor's autoregressive behaviour damps toward zero
|
||
over the frame.
|
||
|
||
Not built into the library — the math is straightforward and the
|
||
"right" tuning varies by deployment (how much damping, whether to
|
||
blend with sample-and-hold on transients, etc.). See `src/lpc.rs`'s
|
||
`lpc_synthesize_into` for the integer synthesis routine that a
|
||
PLC implementation would call.
|
||
|
||
### Multi-frame loss guidance
|
||
|
||
The strategies above are only useful up to a handful of consecutive
|
||
lost frames. Rough thresholds at 20 ms frame periods:
|
||
|
||
| Consecutive lost frames | Effective loss | Verdict |
|
||
|---|---|---|
|
||
| 1 | 20 ms | Inaudible with fade or LPC extrapolation; brief click with silence or sample-and-hold |
|
||
| 2-3 | 40-60 ms | Noticeable glitch; LPC extrapolation minimises but cannot hide it |
|
||
| 4-10 | 80-200 ms | Audible dropout. PLC keeps the audio from sounding "broken" but doesn't restore content |
|
||
| > 10 | > 200 ms | Treat the stream as broken; reset the receiver's concealment state to avoid droning artifacts, and if possible ask the transport to signal "resync" upstream |
|
||
|
||
Mid-stream resync on a datagram transport uses the sync word
|
||
(`0x1ACC`) as an alignment anchor: on a string of bad frames,
|
||
search the next `N` bytes of the buffer for the big-endian sequence
|
||
`\x1a\xcc` and retry `parse_header` from each candidate offset
|
||
until one succeeds. The search is O(N); on a 20 ms frame at 48 kHz
|
||
there are at most ~180 bytes per frame to scan, so amortised cost
|
||
is negligible.
|
||
|
||
[`AudioFrameHeader`]: https://docs.rs/lac/latest/lac/struct.AudioFrameHeader.html
|
||
|
||
## Testing
|
||
|
||
```
|
||
cargo test # unit tests
|
||
cargo test --test corpus --release -- --nocapture # compression vs FLAC, lac_enc_ms
|
||
cargo test --test synthetic --release -- --nocapture # bit-depth + pathological content
|
||
cargo test --test latency --release -- --nocapture --test-threads=1 # p50/p95/p99 + alloc count
|
||
cargo test --test mcu_mix --release -- --nocapture --test-threads=1 # MCU throughput
|
||
cargo test --test conformance --release -- --nocapture # byte-level spec conformance
|
||
cargo test --test determinism --release # encode byte-equality on repeat
|
||
cargo fuzz run decode_arbitrary -- -dict=dict/decode_arbitrary.dict
|
||
cargo fuzz run roundtrip_arbitrary -- -dict=dict/roundtrip_arbitrary.dict
|
||
cargo bench # nightly bench
|
||
benches/compare-flac.sh # flac side of the speed table
|
||
```
|
||
|
||
**Published-crate caveat.** `Cargo.toml` excludes `corpus/*` and
|
||
`fuzz/*` from the published tarball — they'd blow up crate size and
|
||
the audio isn't redistributable under crates.io's constraints anyway.
|
||
A user running `cargo test` against a `cargo add lac`'d dependency
|
||
sees every corpus test *pass* because the `require_corpus!` macro
|
||
skips missing files silently; the compression-ratio assertions,
|
||
FLAC comparisons, latency P99 checks, and MCU throughput checks all
|
||
go unrun. The full regression suite requires the git repository
|
||
(with LFS pulled). The synthetic, conformance, determinism, and
|
||
unit tests run unchanged from either source.
|
||
|
||
Coverage at a glance:
|
||
|
||
- **Unit** — round-trips for every LPC order 0-32 and every partition order
|
||
0-7, prime frame lengths that force `partition_order = 0`, all-zero
|
||
frames, full-scale sample magnitudes, malformed-header rejection for
|
||
every field (`sync_word`, `prediction_order`, `partition_order`,
|
||
`coefficient_shift`), truncated bitstreams, and a convex-descent vs
|
||
exhaustive-search `select_k` differential.
|
||
- **Corpus** — round-trip + compression-ratio + FLAC subprocess comparison
|
||
on a mixed speech and music corpus; asserts ratio ceilings so a codec
|
||
regression fails CI; prints LAC encode wall-clock for correlation
|
||
against `benches/compare-flac.sh`.
|
||
- **Synthetic** — deterministic LFSR-driven round-trips at 8/16/20/24-bit
|
||
source widths and pathological content (all-zero, DC offset, Nyquist
|
||
square, silence + click, full-scale constant, prime-length frame). No
|
||
corpus dependency so the tests run on every CI checkout.
|
||
- **Latency** — per-frame encode/decode timing on real speech with a
|
||
custom tracking allocator for peak-heap *and* per-frame allocation-count
|
||
numbers; reports P50/P95/P99/max and asserts P99 < frame period so a
|
||
real-time regression fails CI.
|
||
- **MCU** — decode → PCM mix → re-encode simulation on real speech for
|
||
2/3/5/8/16 participants (continuous speech) plus an 8-participant
|
||
rotating dominant-speaker variant; asserts MCU egress ≤ SFU-fanout egress.
|
||
- **Fuzz** — libFuzzer targets for decoder robustness and
|
||
encoder/decoder self-consistency on arbitrary bytes, seeded with
|
||
dictionaries of the wire-format constants (sync word, field boundaries)
|
||
and sample-magnitude boundaries (8/16/20/24-bit ceilings).
|
||
|
||
## Measurements
|
||
|
||
### Reference hardware
|
||
|
||
| Short name | CPU | ISA highlights |
|
||
|---|---|---|
|
||
| **7840HS** | AMD Ryzen 7 7840HS (laptop, 8c/16t, up to 5.1 GHz) | AVX-512 (F/BW/CD/DQ/VL/VNNI/VBMI), BMI2, FMA |
|
||
| **RPi5** | Raspberry Pi 5 (Cortex-A76 quad, 2.4 GHz) | NEON |
|
||
| **VF2** | StarFive VisionFive 2 (SiFive U74 quad, 1.5 GHz) | RVV 0.7 (some LLVM autovec, less mature than x86 or NEON) |
|
||
|
||
Numbers below are measured at default `cargo build --release` (no
|
||
`target-cpu=native`, no project-level `RUSTFLAGS`). Empty cells are
|
||
awaiting measurement on the listed hardware. FLAC comparison uses both
|
||
`-5` (the CLI default, what production pipelines typically use) and
|
||
`-8` (`--best`, the compression upper bound).
|
||
|
||
### Corpus attribution
|
||
|
||
The measurements are taken on two publicly-licensed audio corpora
|
||
checked into `corpus/`:
|
||
|
||
- **Speech**: the [AMI Meeting Corpus](https://groups.inf.ed.ac.uk/ami/corpus/)
|
||
(files named `ES2002a.*`), recorded by the AMI Consortium (University
|
||
of Edinburgh, IDIAP, TNO, Brno University of Technology, University
|
||
of Sheffield, and partners). Distributed under
|
||
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
|
||
- **Music**: Kimiko Ishizaka's recording of J.S. Bach's *Goldberg
|
||
Variations, BWV 988* (files named `Kimiko Ishizaka - …`), from the
|
||
[Open Goldberg Variations project](https://opengoldbergvariations.org/)
|
||
(Robert Douglass, producer). Released under
|
||
[CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/) —
|
||
public domain dedication, no attribution legally required, credited
|
||
here as a courtesy.
|
||
|
||
Both corpora are used unmodified apart from the file selection
|
||
described in the tables below.
|
||
|
||
### Compression (hardware-independent, bit-exact across targets)
|
||
|
||
LAC ratio = LAC encoded / raw PCM. Both codecs use the same 4096-sample
|
||
block size on this corpus — LAC's `tests/corpus.rs` sets
|
||
`FRAME_SIZE = 4096`, which matches FLAC's default blocksize at `-5`
|
||
and `-8` for ≤ 48 kHz content, so header and coefficient overhead is
|
||
amortised identically on both sides.
|
||
|
||
| Corpus file | Class | LAC | FLAC -5 | FLAC -8 | LAC / -5 | LAC / -8 |
|
||
|---|---|---:|---:|---:|---:|---:|
|
||
| `ES2002a.Headset-0.wav` | headset speech, 16 kHz / 16-bit | 0.178 | 0.187 | 0.186 | 0.954 | 0.958 |
|
||
| `ES2002a.Mix-Headset.wav` | mixed meeting, 16 kHz / 16-bit | 0.292 | 0.300 | 0.297 | 0.975 | 0.984 |
|
||
| `ES2002a.Array1-01.wav` | array speech, 16 kHz / 16-bit | 0.375 | 0.378 | 0.377 | 0.989 | 0.994 |
|
||
| Goldberg Aria (01) | solo piano, 96 kHz / 24-bit | 0.483 | 0.458 | 0.457 | 1.053 | 1.056 |
|
||
| Goldberg Variatio 4 (05, fughetta) | solo piano, 96 kHz / 24-bit | 0.514 | 0.483 | 0.481 | 1.065 | 1.067 |
|
||
| Goldberg Variatio 16 (17, Ouverture) | solo piano, 96 kHz / 24-bit | 0.512 | 0.479 | 0.478 | 1.068 | 1.070 |
|
||
|
||
Speech reliably beats FLAC at both levels by a small margin; music
|
||
trails by 5-7 % (the Q-format gap at low frequencies, mitigated but
|
||
not eliminated by `coefficient_shift`). FLAC's jump from `-5` to `-8`
|
||
buys essentially nothing on this corpus (≤ 0.2 pp of ratio), so the
|
||
realistic LAC-vs-FLAC comparison in practice is against `-5`. Numbers
|
||
are byte-identical regardless of hardware because LAC's output is
|
||
specified bit-exactly.
|
||
|
||
### Encode wall-clock (ms, full file)
|
||
|
||
One table per hardware target; each has LAC alongside both FLAC levels
|
||
so the speed cost of each quality point is visible. The `-5` column is
|
||
the most representative real-world comparison.
|
||
|
||
**7840HS** (AMD Ryzen 7 7840HS):
|
||
|
||
| Corpus file | Duration | LAC | FLAC -5 | FLAC -8 |
|
||
|---|---|---:|---:|---:|
|
||
| `ES2002a.Headset-0.wav` | ~42 min, 16 kHz / 16-bit | 1158 | 221 | 436 |
|
||
| `ES2002a.Array1-01.wav` | ~42 min, 16 kHz / 16-bit | 1292 | 226 | 447 |
|
||
| `ES2002a.Mix-Headset.wav` | ~42 min, 16 kHz / 16-bit | 1367 | 223 | 469 |
|
||
| Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 809 | 272 | 647 |
|
||
| Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 2126 | 754 | 1741 |
|
||
| Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 3521 | 1166 | 2703 |
|
||
|
||
**RPi5** (Raspberry Pi 5, Cortex-A76 @ 2.4 GHz):
|
||
|
||
| Corpus file | Duration | LAC | FLAC -5 | FLAC -8 |
|
||
|---|---|---:|---:|---:|
|
||
| `ES2002a.Headset-0.wav` | ~42 min, 16 kHz / 16-bit | 2856 | 477 | 959 |
|
||
| `ES2002a.Array1-01.wav` | ~42 min, 16 kHz / 16-bit | 3249 | 495 | 1096 |
|
||
| `ES2002a.Mix-Headset.wav` | ~42 min, 16 kHz / 16-bit | 3363 | 505 | 1132 |
|
||
| Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 1904 | 606 | 1570 |
|
||
| Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 5201 | 1627 | 4324 |
|
||
| Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 9015 | 2572 | 6832 |
|
||
|
||
**VF2** (StarFive VisionFive 2, SiFive U74 quad @ 1.5 GHz):
|
||
|
||
| Corpus file | Duration | LAC | FLAC -5 | FLAC -8 |
|
||
|---|---|---:|---:|---:|
|
||
| `ES2002a.Headset-0.wav` | ~42 min, 16 kHz / 16-bit | 29385 | 2355 | 5614 |
|
||
| `ES2002a.Array1-01.wav` | ~42 min, 16 kHz / 16-bit | 33231 | 2502 | 6688 |
|
||
| `ES2002a.Mix-Headset.wav` | ~42 min, 16 kHz / 16-bit | 34899 | 2548 | 6878 |
|
||
| Goldberg Variatio 4 (05) | ~68 s, 96 kHz / 24-bit stereo | 18185 | 3184 | 9278 |
|
||
| Goldberg Variatio 16 (17) | ~188 s, 96 kHz / 24-bit stereo | 49811 | 8535 | 25454 |
|
||
| Goldberg Aria (01) | ~300 s, 96 kHz / 24-bit stereo | 88208 | 13544 | 40650 |
|
||
|
||
LAC is ~5-6× slower than FLAC `-5` and ~2-3× slower than FLAC `--best`
|
||
on x86 because libFLAC ships hand-tuned SSE intrinsics for its
|
||
autocorrelation kernel and LAC relies on LLVM autovectorization.
|
||
End-to-end perf barely changes with `target-cpu=native`: the kernel does
|
||
pick up AVX-512 zmm dot-products, but the frame encode is bottlenecked
|
||
elsewhere (Rice k-search and bitstream assembly dominate the remaining
|
||
time).
|
||
|
||
On RPi5 (ARM Cortex-A76, NEON) LAC runs ~2.5× slower than on 7840HS in
|
||
absolute terms, but the *ratio* against FLAC shifts noticeably: LAC is
|
||
~6-7× slower than FLAC `-5` on speech (wider gap, libFLAC's NEON path
|
||
is well-tuned for the 16 kHz / 16-bit content) but only ~3× slower on
|
||
96 kHz / 24-bit music (narrower gap — 24-bit content gives libFLAC's
|
||
specialization less leverage). Against FLAC `--best` the music gap
|
||
narrows further to ~1.2-1.3×. The 7840HS-vs-RPi5 delta in the LAC
|
||
column shows scalar autovec quality is broadly comparable across x86
|
||
and ARM backends; the delta in the FLAC columns shows where hand-tuned
|
||
intrinsics disappear on a different ISA.
|
||
|
||
On VF2 (RISC-V SiFive U74, RVV 0.7 — not supported by mainline libFLAC
|
||
or LLVM autovec yet) LAC runs ~10× slower than on RPi5. Both codecs
|
||
fall back to pure scalar execution; the gap between them *widens* to
|
||
~12-13× on speech and ~6× on music vs FLAC `-5`, or ~5× / ~2× vs
|
||
FLAC `--best`. Two factors compound: the U74 is a single-issue
|
||
in-order core vs the Cortex-A76's dual-issue out-of-order (base IPC is
|
||
~2× lower at the ISA-agnostic level), and LLVM's scalar Rust codegen
|
||
for RISC-V is less mature than its x86/ARM output — tighter inner
|
||
loops in libFLAC's hand-written C survive this better than LAC's
|
||
Rust does. The absolute numbers are still useful: even at 88 s to
|
||
encode 5 minutes of 96/24 stereo audio, LAC comfortably meets
|
||
realtime for streaming use (see the P99 latency table below).
|
||
|
||
### Per-frame encode latency P99 (µs)
|
||
|
||
All rows use real AMI speech samples. Frame sample count sets the
|
||
real-time deadline; P99 must stay below that period for the frame to
|
||
ship inside its own playback slot.
|
||
|
||
| Test | Frame | Period | 7840HS P99 | RPi5 P99 | VF2 P99 |
|
||
|---|---|---:|---:|---:|---:|
|
||
| `latency_headset_speech_160` | 160 @ 16 kHz | 10 ms | 20 | 38 | 235 |
|
||
| `latency_headset_speech_320` | 320 @ 16 kHz | 20 ms | 36 | 76 | 499 |
|
||
| `latency_headset_speech_480` | 480 @ 16 kHz | 30 ms | 37 | 81 | 635 |
|
||
| `latency_headset_speech_prime` | 503 @ 16 kHz | 31 ms | 23 | 52 | 387 |
|
||
| `latency_array_speech_320` | 320 @ 16 kHz | 20 ms | 42 | 77 | 506 |
|
||
| `latency_mixed_meeting_320` | 320 @ 16 kHz | 20 ms | 43 | 84 | 551 |
|
||
|
||
P99 headroom is ~400-1300× on 7840HS, ~130-600× on RPi5, and
|
||
~36-81× on VF2. Every row on every platform stays comfortably inside
|
||
the realtime deadline — even VF2's worst case (`mixed_meeting_320` at
|
||
551 µs on a 20 ms frame) has 36× margin. LAC meets its streaming
|
||
contract on every target tested.
|
||
|
||
### MCU throughput (× realtime on one core)
|
||
|
||
Realtime multiplier = audio-ms processed per wall-clock-ms, per core.
|
||
"`20×` realtime" means one core sustains twenty simultaneous meetings
|
||
of the listed configuration.
|
||
|
||
| Test | Activity | 7840HS | RPi5 | VF2 |
|
||
|---|---|---:|---:|---:|
|
||
| `mcu_mix_1on1_voice` (P=2) | continuous | 279× | 145× | 22× |
|
||
| `mcu_mix_3people_voice` (P=3) | continuous | 193× | 95× | 14× |
|
||
| `mcu_mix_5people_voice` (P=5) | continuous | 120× | 57× | 9× |
|
||
| `mcu_mix_8people_voice` (P=8) | continuous | 77× | 35× | 5× |
|
||
| `mcu_mix_8people_dominant_speaker` (P=8) | rotating speaker | 106× | 43× | 6× |
|
||
| `mcu_mix_16people_voice` (P=16) | continuous | 39× | 17× | 2.5× |
|
||
|
||
MCU egress byte count as a fraction of SFU fanout egress on 7840HS:
|
||
1.00 (P=2, trivially equal), 0.60 (P=3), 0.36 (P=5), 0.22 (P=8
|
||
continuous), 0.35 (P=8 dominant-speaker), 0.10 (P=16). The continuous
|
||
case is the lower bound — SFU fanout scales quadratically in
|
||
participant count while MCU mix egress scales linearly, so the
|
||
relative savings grow as the meeting does. The dominant-speaker case
|
||
inverts that trend slightly: SFU fanout of N-1 near-silent streams is
|
||
almost free, so the SFU baseline falls faster than the MCU mix cost
|
||
does. These numbers are byte-accounting, not wall-clock.
|