diff --git a/Specification.md b/Specification.md new file mode 100644 index 0000000..eedae43 --- /dev/null +++ b/Specification.md @@ -0,0 +1,784 @@ +# LAC Wire Format + +Normative specification of the LAC bitstream. This document is the authority +on byte layout, field semantics, and encoder/decoder constraints. + +## 1. Conventions + +- All multi-byte integer fields are **big-endian**. +- Bit streams are **MSB-first**: the first bit written occupies bit 7 of its + byte, subsequent bits fill lower positions, and a new byte begins once + eight bits have been emitted. +- Samples are **signed integers** passed as `i32` with magnitude bounded + by `|sample| ≤ 2²³ − 1`. The upper 9 bits of each `i32` must be a + consistent sign extension of the 24-bit-magnitude value. Narrower source + formats (8-bit, 16-bit, 20-bit integer PCM) are passed through directly + — they trivially satisfy the magnitude bound — and compress at the bit + cost of their actual values, not a 24-bit ceiling. The codec does not + carry bit-depth metadata; the container or application layer is + responsible for remembering the source format. +- Sample rate is **not** part of the bitstream. The container or transport + carries it. +- A frame encodes **one channel**. Stereo is two independent streams of + frames, one per channel. + +## 2. Frame Layout + +A frame is a contiguous byte sequence: + +```text ++--------+--------------------+ +| header | rice_bitstream | ++--------+--------------------+ +``` + +The `header` is fixed-structure (variable length because the coefficient +array depends on `prediction_order`). The `rice_bitstream` is a bit-packed +payload; its byte length is `ceil(total_rice_bits / 8)` with zero padding in +the low bits of the last byte. + +Decoder input is the complete frame. There is no intra-frame continuation or +fragmentation — the transport layer handles that. + +## 3. Frame Header + +```text + Offset Size Field Type Constraint + ------ ---- -------------------- ------- ---------------------------- + 0-1 2 sync_word u16 BE == 0x1ACC + 2 1 prediction_order u8 ∈ [0, 32] + 3 1 partition_order u8 ∈ [0, 7] + 4 1 coefficient_shift u8 ∈ [0, 5] + 5-6 2 frame_sample_count u16 BE ≥ 1, % (1 << partition_order) == 0 + 7+ 2·p lpc_coefficients i16 BE[] length = prediction_order = p +``` + +Total header length: `7 + 2 · prediction_order` bytes. + +### 3.1 `sync_word` + +Fixed value `0x1ACC`. Present to identify a LAC frame on lightly framed +transports and to reject foreign payloads at the first check. Decoders +**must** reject any frame whose first two bytes are not `0x1ACC`. + +### 3.2 `prediction_order` + +Integer order of the LPC analysis filter used to produce residuals. + +- Value `0` is **verbatim mode**: no prediction, residuals equal the samples, + and the `lpc_coefficients` array is empty (zero bytes). +- Values `1` through `32` are standard LPC orders; `lpc_coefficients` carries + exactly that many predictor coefficients, interpreted in the Q-format + determined by `coefficient_shift` (§3.4). + +Decoders **must** reject values above 32. + +### 3.3 `partition_order` + +Controls how the residual stream is split for Rice coding. + +- `partition_count = 1 << partition_order`. +- The residual stream is divided into `partition_count` equal partitions of + `frame_sample_count / partition_count` samples each. + +Decoders **must** reject values above 7 and **must** reject frames where +`frame_sample_count` is not a multiple of `partition_count`. + +### 3.4 `coefficient_shift` + +Controls the fixed-point scale of the stored Q-format LPC predictor +coefficients. Coefficients are stored as 16-bit integers interpreted as +`Q(15 − coefficient_shift)`: + +| shift | Q-format | Real-value range | Use case | +|-------|----------|-------------------|------------------------------------------------| +| 0 | Q15 | `[−1, 1)` | Coefficients with magnitude < 1 (most orders > 1) | +| 1 | Q14 | `[−2, 2)` | Low-frequency content, `a[1]` near −2 | +| 2 | Q13 | `[−4, 4)` | Extreme bass / narrow resonances | +| 3 | Q12 | `[−8, 8)` | Pathological transients | +| 4 | Q11 | `[−16, 16)` | Reserved for synthetic signals | +| 5 | Q10 | `[−32, 32)` | Upper bound; decoder rejects larger values | + +The encoder **must** select the smallest `coefficient_shift` at which no +coefficient's real value exceeds the representable range for that shift — +i.e., the smallest scale that does not clamp. Smaller shifts give finer +precision and thus smaller residuals when no clamping is required. + +If no `coefficient_shift ∈ [0, 5]` suffices (the real coefficient +magnitude exceeds the Q10 range at `shift = 5`), the encoder +saturates each offending coefficient independently to the i16 range +`[−32768, 32767]`. Bit-exact round-trip is preserved because the +decoder applies the synthesis formula to whatever 16-bit values the +wire carries; the cost of saturation is compression — the predictor +no longer matches the encoder's ideal coefficients, so residuals +grow. Real audio at the input-magnitude contract (§1) rarely reaches +this case; synthetic or adversarial inputs can force it. + +Decoders **must** reject values above 5. The shift applies uniformly to +every coefficient in `lpc_coefficients`; there is no per-coefficient +scale. + +When `prediction_order == 0` (verbatim frame), `coefficient_shift` +**must** be `0`. The shift only modifies how stored coefficients are +interpreted, and a verbatim frame stores none. Decoders **must** +reject frames with `prediction_order == 0` and `coefficient_shift != 0` +as malformed; this rule closes the space of legal but meaningless +headers so two implementations agree bit-for-bit on which inputs +round-trip. + +### 3.5 `frame_sample_count` + +Number of audio samples produced by this frame (also the number of residuals +in the Rice bitstream). The value **must** be in `[1, 65535]`. + +Decoders **must** reject `frame_sample_count == 0`: a zero-sample frame +trivially satisfies the partition-divisibility check below +(`0 mod n == 0` for any `n`) but carries no audio and has no legal +Rice payload. + +For compliance with `partition_order`, the value **must** satisfy +`frame_sample_count mod (1 << partition_order) == 0`. Decoders **must** +reject frames where this does not hold. + +### 3.6 `lpc_coefficients` + +Array of `prediction_order` predictor coefficients, each a 16-bit big-endian +signed integer interpreted in `Q(15 − coefficient_shift)` format — see §3.4 +for the shift semantics. + +The wire format does not distinguish coefficients by derivation. The +synthesis formula below applies identically whether the encoder +obtained the values from Levinson-Durbin analysis, from a fixed +coefficient template (e.g. FLAC-style integer predictors), from a +trained model, or from any other strategy. What goes on the wire is +just `prediction_order` 16-bit integers; how the encoder chose them +is encoder-internal and not observable to the decoder. + +Synthesis formula (applied in the decoder): + +```text +s = 15 − coefficient_shift +bias = 1 << (s − 1) +predict[i] = (Σ_{j=0..terms-1} coeff[j] · sample[i − j − 1] + bias) >> s +sample[i] = residual[i] + predict[i] +``` + +where `terms = min(i, prediction_order)`. The `+ bias` term implements +round-to-nearest for the right shift and is **required** for bit-exact +decoding. For the default `coefficient_shift = 0`: `s = 15`, `bias = 16384`. + +The `>> s` operator **must** be an **arithmetic right shift** on +signed integers — equivalent to floor division by `2^s`. Combined with +the `+ bias` pre-add, this implements **round-half toward +∞**: on a +value whose scaled form is exactly `k + 0.5`, the result is `k + 1` +for both positive and negative `k`. Implementations using truncating +integer division (C's `/` on signed integers, which rounds toward +zero) **will diverge** from this on any `sum + bias` that is negative +and not evenly divisible by `2^s`: arithmetic shift rounds further +from zero, truncating division rounds toward zero. Concrete example: +at `s = 15`, `sum = -32769`, `bias = 16384`, arithmetic shift gives +`(-16385) >> 15 = -1`, truncating division gives `-16385 / 32768 = 0`. +Decoders in languages whose native integer division does not floor +**must** emulate arithmetic right shift explicitly on the signed +accumulator. + +#### Accumulator width + +The inner sum `Σ coeff[j] · sample[i − j − 1]` **must** be computed in +a signed integer accumulator of at least **49 bits** (equivalently: an +`i64` or wider). Worst-case bounds at `prediction_order = 32`, +`coefficient_shift = 5` (Q10), and full-scale samples give a product +of magnitude `(2¹⁵) · (2²³ − 1) ≈ 2³⁸` per term, summed over 32 terms +for a maximum of `~2⁴³`. Adding the bias keeps the result below `2⁴⁴`. +A 32-bit accumulator overflows at orders ≥ 16 with full-scale inputs — +implementations that reach for `int32_t` because samples are 32-bit +will silently corrupt high-order frames. + +JavaScript / TypeScript implementers should note that `Number` is an +IEEE 754 double, not a signed integer: its 53-bit safe-integer range +covers in-contract accumulator values, but adversarial bitstreams (see +§6.2) can produce out-of-contract samples whose synthesis arithmetic +lands in the 2⁴⁹–2⁵¹ range and beyond, where `Number` silently loses +low bits to float rounding. For bit-exact spec compliance in JS/TS, +**use `BigInt` for the accumulator** — it has the integer semantics +the spec requires; `Number` does not. + +#### Warm-up (`terms == 0`) + +When `i == 0`, `terms = min(0, prediction_order) = 0`. The sum is +empty and `predict[0] = 0` — the `(0 + bias) >> s` formula is **not** +applied. Stating this explicitly avoids an implementation that +mechanically applies the formula in the warm-up case and produces +`predict[0] = bias >> s`, which is zero only in specific +`(bias, s)` parametrisations and surprising in any other. + +For `0 < i < prediction_order`, the sum truncates to the available +`i` predecessors (`terms = i`). The formula applies as stated. + +#### Sign convention for stored coefficients + +The synthesis formula uses `+Σ`. Classical Levinson-Durbin +implementations that derive LPC from the error-prediction AR model + +```text +x[n] = −Σ a[j] · x[n-j] + e[n] (error convention) +``` + +produce coefficients `a[j]` whose sign is the **opposite** of what +the synthesis formula expects; those encoders **must** negate before +quantisation so the wire value is `coeff[j-1] = −a[j]`. +Implementations using the predictor convention + +```text +x̂[n] = +Σ c[j] · x[n-j] (predictor convention) +``` + +store `c[j]` directly. + +Both conventions are common in DSP literature. Encoders **must** +verify that the coefficients emitted on the wire, when substituted +into the synthesis formula above, reproduce the encoder's own +prediction. The reference implementation uses the error convention +and negates at quantisation time. + +#### Overflow semantics of the final add + +The `residual[i] + predict[i]` add is specified as a **wrapping i32 +add** (two's complement, modulo `2³²`; in languages without native +signed-overflow semantics, compute `(residual + predict) & 0xFFFFFFFF` +and then re-interpret as a signed 32-bit integer via sign-extension +of bit 31). On well-formed bitstreams — those produced by a compliant +encoder from in-contract samples (§1) — the result stays inside the +sample-magnitude contract and the wrap is never observable. +Adversarial bitstreams with crafted coefficients and residuals **may** +produce any `i32` value; the decoder **must not** panic, abort, or +reject on the basis of this add's result. The consequences of +out-of-contract decoder output are addressed in §6.2. + +## 4. Rice Bitstream + +Immediately follows the header. Flat MSB-first bitstream structured as +consecutive partition payloads: + +```text ++---------+---------+-----+-----------+ +| part. 0 | part. 1 | ... | part. P-1 | ++---------+---------+-----+-----------+ +``` + +where `P = 1 << partition_order`. Each partition has the same structure: + +```text ++-------+-------------+-------------+-----+-------------+ +| k (5) | codeword 0 | codeword 1 | ... | codeword M-1| ++-------+-------------+-------------+-----+-------------+ +``` + +where `M = frame_sample_count / P` is the per-partition residual count. + +Partitions are **bit-contiguous**: the 5-bit `k` field of partition +`i + 1` begins at the bit immediately following the last codeword of +partition `i`. There is no byte alignment between partitions. Only +the final trailing padding described in §4.3 is byte-aligned. + +Within a partition, the bit cursor likewise advances continuously: +codeword 0 begins at the bit immediately following the 5-bit `k` +field, codeword 1 immediately after codeword 0's remainder bit, and so +on. A conformant decoder maintains a single bit-read position across +the entire Rice bitstream — from the `k` of partition 0 through the +last codeword of partition P−1 — and never realigns to a byte or bit +boundary between fields. This is implicit in the byte-stream decoder +design (a bit reader that consumes bits sequentially needs no +special handling at field boundaries) but stated here so second-team +implementations do not introduce a spurious alignment. + +### 4.1 Per-Partition Parameter `k` + +Five-bit unsigned integer, MSB-first, immediately before the partition's +codewords. `k` is the Rice parameter for this partition and must be in +`[0, 23]`. Decoders **must** reject values above 23 as malformed. + +### 4.2 Codeword + +Each residual is encoded by: + +1. **Zigzag mapping** from signed to unsigned: + + ```text + z = (r << 1) ^ (r >> 31) interpreted as u32 + ``` + + where `(r >> 31)` is an **arithmetic right shift** on the i32 + residual, sign-extending the sign bit to all 32 bit positions — `0` + for non-negative `r`, `-1` (all ones) for negative `r`. The entire + expression is **masked to 32 bits** before being interpreted as + `u32`; in languages with arbitrary-precision integers (e.g. Python) + or where native bitwise ops return signed 32-bit (e.g. JavaScript / + TypeScript `Number`, where `(x ^ y) >>> 0` coerces to u32), this + mask is explicit (`& 0xFFFFFFFF` or `>>> 0`) and **required** — + without it, negative residuals produce zigzag values with extra + high bits set. Implementations in languages whose native right + shift is logical on unsigned types **must** coerce `r` to a signed + 32-bit type first; implementations in languages where `>>` on + signed types is implementation-defined (e.g. pre-C++20 C/C++) + **must** emulate the arithmetic shift explicitly. + + The `r << 1` factor is always safe on i32. Residuals on the + encoder side are bounded by `|r| ≤ |sample| + |predict| ≤ 2·2²³ + ≈ 2²⁴`, so `r << 1` has magnitude `≤ 2²⁵` and fits in i32 without + overflow or undefined behaviour — even for `r` at the most + negative value the encoder can ever produce from in-contract + input (§1). Decoder-side code does not perform this shift; the + inverse uses `z >> 1` on a u32, which is always defined. + + The mapping sends + + ```text + {0, −1, 1, −2, 2, −3, 3, …} → {0, 1, 2, 3, 4, 5, 6, …} + ``` + + so small magnitudes of either sign map to small unsigned values. + + The decoder's inverse is + + ```text + r = ((z >> 1) as i32) ^ −((z & 1) as i32) + ``` + + where both shifts here are natural (unsigned-u32 logical for the + first, integer negation for the second). Stating this inverse + explicitly removes any ambiguity about how an implementation must + invert the zigzag. + +2. **Rice code** at parameter `k`: + - **Unary part**: `q = z >> k` zero-bits followed by a single + terminating 1-bit. + - **Remainder part**: `k` bits of `z & ((1 << k) − 1)`, MSB-first. + (The remainder part is absent when `k == 0`.) + +Total codeword length: `q + 1 + k` bits. + +#### Decoder-side unary-run bound + +Decoders **must** reject any codeword whose unary run length satisfies + +```text +q > (2³² − 1) >> k (equivalently, q > u32::MAX >> k) +``` + +A valid codeword reconstructs `z = (q << k) | remainder` as a u32. +`q > u32::MAX >> k` implies `q << k ≥ 2³²`, which either overflows +u32 silently (a critical decoder bug class — corrupt output with no +error) or indicates a malformed stream. Either way the frame **must** +be discarded with `InvalidParameter` or equivalent rejection class +(§6). The bound varies with `k`: at `k = 23` it is `511`, at `k = 0` +it is `u32::MAX` (no practical constraint). + +This rule also caps the CPU cost of unary scanning on adversarial +input: without the cap, a decoder could be forced to scan an +arbitrarily long run of zero bits before reaching either a `1` or +the buffer end. + +### 4.3 Byte Padding + +After the last codeword of the last partition, any unused bits of the final +byte are zero-padded on the LSB side. The encoder writes `0` for all padding +bits; the decoder ignores them. + +### 4.4 Bitstream Length + +The Rice bitstream's total bit length is the sum of codeword bit +lengths across every partition, plus 5 bits per partition for the `k` +fields: + +```text +total_bits = P · 5 + Σ_{all residuals} (q + 1 + k_partition) +``` + +This depends on every residual's quotient and cannot be computed +from the frame header alone. A decoder **streams-decodes** until it +has produced exactly `frame_sample_count` samples, then stops. +Any bits remaining inside the last byte are padding (§4.3) and +carry no information. + +A decoder **must not** require the Rice bitstream's byte length to be +signalled out-of-band. The header plus the zero-padded byte-aligned +tail fully determines the frame boundary; `parse_header`'s +`bytes_consumed` return plus streaming Rice decode locates the end of +the frame in the input buffer. + +## 5. Degenerate Cases + +### 5.1 All-Zero Frame + +For an all-zero sample vector, the encoder **must** use +`prediction_order = 0` because the Levinson-Durbin recursion is +undefined at `R[0] = 0`. Residuals equal the input (all zeros). + +Partition-order and per-partition `k` selection remain at the +encoder's discretion; any legal `(partition_order, k)` combination +produces a bit-exact-decodable frame. The minimum-cost choice — +`partition_order = 0`, `k = 0` — produces a Rice payload of exactly +`5 + frame_sample_count` bits. Compliant encoders are **not** +required to pick this minimum. + +### 5.2 Single-Sample Frame + +`frame_sample_count = 1` is valid but forces `partition_order = 0` (the only +value that divides 1 evenly). The single sample is Rice-coded directly +because no predecessors exist for any LPC order. + +## 6. Error Recovery + +Decoders **must** detect and reject every frame that violates the +constraints elsewhere in this document. The exhaustive list of +rejection classes, each of which is a distinct error condition so +callers can distinguish them in telemetry, is: + +1. **Sync word mismatch** — bytes `0-1` differ from `0x1ACC` (§3.1). +2. **`prediction_order` out of range** — value > `32` (§3.2). +3. **`partition_order` out of range** — value > `7` (§3.3). +4. **`coefficient_shift` out of range** — value > `5` (§3.4). +5. **Verbatim frame with non-zero shift** — `prediction_order == 0` and + `coefficient_shift != 0` (§3.4). +6. **`frame_sample_count == 0`** (§3.5). +7. **`frame_sample_count` not divisible by `partition_count`** (§3.3). +8. **Buffer truncated** — fewer bytes than the header plus coefficient + array requires, or fewer bits than the Rice bitstream demands + during streaming decode. This class is intentionally coarse-grained: + a single `Truncated` variant covers header truncation, missing `k` + fields, mid-codeword exhaustion, and every other "buffer ends early" + shape the decoder can encounter. Sub-categorising these provides no + caller benefit — the recovery action (discard and substitute + silence, see §6.1) is identical regardless of where truncation + happened. +9. **Per-partition `k` out of range** — value > `23` (§4.1). +10. **Unary-run cap exceeded** — any codeword with `q > u32::MAX >> k` + (§4.2). + +On any of these, the decoder **must** discard the frame, produce no +output samples, and signal the error to the caller. **No partial +state may propagate** to the next frame's decode — frames are +independent (§2), so subsequent frames decode cleanly regardless. + +### 6.1 Caller-side silence substitution + +On rejection, the caller substitutes `frame_sample_count` zeros +(silence) for the frame period. The count is obtained as follows: + +- **Post-header rejections** (classes 8-10 above — `Truncated` in the + Rice bitstream, `InvalidParameter` during Rice decode): the frame + header parsed successfully before the failure, so the count is + recoverable. The caller re-parses just the header on the same buffer + (reference API: `parse_header(data)`) and reads `frame_sample_count` + from the resulting `AudioFrameHeader`. +- **Pre-header rejections** (classes 1-7 above): the header itself + failed; the frame length is not recoverable from the bitstream. The + caller **must** fall back to a session-level default frame size + carried out-of-band by the container or transport (WebRTC and QUIC + audio sessions typically negotiate this at session setup). + +This asymmetry is inherent to the wire format: `frame_sample_count` +lives inside the header at offset 5, so any rejection that happens +while parsing bytes 0-4 precedes its discovery. + +### 6.2 Decoder output magnitude + +On well-formed bitstreams produced by a compliant encoder from +in-contract samples (§1), decoder output satisfies +`|sample| ≤ 2²³ − 1`. + +Adversarial bitstreams — those with hand-crafted coefficients and +residuals that pass every rejection check in this section yet +produce arithmetic results outside the sample-magnitude contract — +**may** produce output samples of any `i32` value, including values +that exceed `2²³ − 1`. The decoder **must not** panic or reject on +this basis: the wrapping-add semantics of §3.6 are precisely what +makes every bit sequence produce a defined output, which is the +ground of the "no partial state propagates" contract at the top of +this section. + +Callers that re-feed decoder output into LAC's encoder (for example, +an MCU decode → PCM mix → re-encode pipeline) **should** validate or +clamp to the input magnitude contract before re-encoding. A +compliant encoder assumes its input satisfies `|sample| ≤ 2²³ − 1` +and is not required to re-validate. + +## 7. Encoder Guidance (non-normative) + +The reference encoder's search has three phases: + +```text +# Phase 0: all-zero short-circuit +R[0] = Σ sample[i]² +if R[0] == 0: + emit frame with prediction_order = 0, any legal partition_order (§5.1) + return + +# Phase 1: sparse LPC order grid with stop-when-stale early-out +for prediction_order in [0, 2, 4, 6, 8, 10, 12, 16, 20, 24, 28, 32]: + coeffs_q31 = levinson_durbin(samples, prediction_order) # cached + shift = smallest s such that every |coeff_real| < 2^s + coeffs_stored = quantize(coeffs_q31, to: Q(15 - shift)) + residuals = compute_residuals(samples, coeffs_stored, shift) + for partition_order in 0..=7: + if frame_sample_count % (1 << partition_order) != 0: continue + rice_bits = estimate_cost(residuals, partition_order) + total = header_bits(prediction_order) + rice_bits + track minimum over (prediction_order, partition_order) + if no improvement for 2 consecutive grid entries: break + +# Phase 2: fixed-predictor post-pass +for (fp_order, fp_coeffs, fp_shift) in FIXED_PREDICTORS: + residuals = compute_residuals(samples, fp_coeffs, fp_shift) + for partition_order in 0..=7: + if frame_sample_count % (1 << partition_order) != 0: continue + rice_bits = estimate_cost(residuals, partition_order) + total = header_bits(fp_order) + rice_bits + track minimum over (fp_order, partition_order) + +emit frame with the (order, partition_order) that minimised `total` +``` + +The sparse grid + early-out is a speed/compression trade-off; a +compliant encoder may still exhaustively search every integer order +`0..=32` for marginal gains at higher cost. The produced bitstreams +are interchangeable. + +The fixed-predictor post-pass tries FLAC-style integer predictors +(orders 1-4 with a small static coefficient table) after the LPC +grid. These evaluate quickly and occasionally beat the Levinson- +Durbin winner on content where a low-order integer polynomial fits +better than the statistically-optimal LPC fit — silent-plus-DC, very +smooth tones, polynomial-ish sensor data. Running them second avoids +tripping the stop-when-stale heuristic in the LPC phase. + +The reference encoder's `FIXED_PREDICTORS` table, materialising the +FLAC-style `[1]`, `[2, −1]`, `[3, −3, 1]`, `[4, −6, 4, −1]` integer +predictors at the smallest `coefficient_shift` that represents each +coefficient without clamping: + +| `prediction_order` | `lpc_coefficients` (Q-format integers) | `coefficient_shift` | Real-value interpretation | +|-------------------:|---------------------------------------------|--------------------:|------------------------------| +| 1 | `[16384]` | 1 (Q14) | `[1]` | +| 2 | `[16384, −8192]` | 2 (Q13) | `[2, −1]` | +| 3 | `[24576, −24576, 8192]` | 2 (Q13) | `[3, −3, 1]` | +| 4 | `[16384, −24576, 16384, −4096]` | 3 (Q12) | `[4, −6, 4, −1]` | + +These are the exact wire-format bytes a second-team encoder would emit +to match the reference's fixed-predictor outputs bit-for-bit. Compliant +encoders MAY use a different set (or none), since §3.6 treats the +coefficient field as opaque — decoders apply the synthesis formula +identically regardless of source. + +The `R[0] == 0` short-circuit is both a correctness requirement (§5.1 +— Levinson-Durbin is undefined at zero autocorrelation) and an +encoder-cost optimisation: on digital silence, the sparse grid and +fixed-predictor pass produce identical zero residuals and order 0 +wins on header size alone. + +Levinson-Durbin runs once to order 32 with all intermediate orders saved +into a flat buffer (one recursion pass yields all orders 1..32 at +`O(order²)` cost), so the outer loop fetch is free and order selection +is effectively `O(orders_tried × N)`. + +`shift` is determined per order by the coefficient magnitudes — there +is no shift search, as smaller shifts are always at least as good as +larger ones when they don't clamp (saturation, §3.4, is the +fallback). Rice cost at a given `partition_order` is exact and +closed-form given the per-partition `k`, so the inner search +introduces no estimation error. + +#### Levinson-Durbin numerical choices (reference, non-normative) + +The reference encoder runs Levinson-Durbin with i64 autocorrelation +accumulators, Q31 working coefficients, and widens to i128 for +reflection-coefficient intermediates at orders where Q31 would lose +precision (typically above order ~12). Rounding on the Q31→Q(15−shift) +quantisation step is round-half-up via `(a_q31 + bias) >> shift_amt`, +with `bias = 1 << (14 + shift)` — the direct analogue of the synthesis +formula's rounding (§3.6), chosen so analysis and synthesis agree on +tie-break direction. + +None of these choices are normative. Two encoders making different +precision or rounding choices will produce different coefficient +bytes on the same input, but both bitstreams decode correctly under +§3.6 so long as the coefficients they emit faithfully represent their +own LPC decision. §3.6's "wire format doesn't distinguish coefficients +by derivation" clause is exactly what permits this freedom. + +#### Rice `k` selection (reference, non-normative) + +Cost is closed-form for a given `(partition, k)`: `bit_cost(k) = +N · (1 + k) + Σ (v >> k)` where `v` ranges over the zigzag-mapped +residuals of the partition. Exhaustive search over `k ∈ [0, 23]` is +always acceptable and is the simplest compliant choice. + +The reference encoder uses convex descent: seed `k_seed = +⌊log₂(mean(v))⌋`, then walk either direction. Ties break **toward +smaller `k`** on descent (condition `cost ≤ best_cost`, so equal +costs at `k−1` overwrite `k`) and **strictly larger costs only** on +ascent (condition `cost < best_cost`, so equal costs at `k+1` don't +override `k`). This yields a unique `k` per partition matching an +exhaustive search's first-wins tie-break. + +On the reference corpus, the sparse grid's compression matches +exhaustive-search output within ~0.2 percentage points (measured); +the differential test caps the acceptable excess at 0.5%. +Implementations that prefer tighter compression at higher encode +cost can extend the grid without wire-format consequences — decoders +do not care which orders an encoder tried. + +### 7.1 Frame size (non-normative) + +The codec accepts any `frame_sample_count` in `[1, 65535]`. Larger +frames amortise the 7-byte fixed header and the LPC coefficient vector +over more samples, and generally compress better. Smaller frames give +tighter latency and finer partition-order granularity on transient +content. + +Recommended defaults by use case: + +- **Real-time voice** (QUIC streaming, MCU mix): 160-320 samples at + 16 kHz, or 480-960 at 48 kHz — matches 10-20 ms frame periods used + by Opus / WebRTC. +- **Real-time full-band** (game audio, music conferencing): 1024-2048 + at 48 kHz (21-43 ms). +- **Offline / archival**: 4096-8192 at 48 kHz; compression gains + flatten past this. + +Power-of-two frame sizes expose every `partition_order ∈ [0, 7]` to +the encoder's search. Non-power-of-two sizes restrict the search to +partition counts that divide evenly; a prime frame size forces +`partition_order = 0`. Encoders SHOULD prefer frame sizes with several +small-prime factors when free to choose. + +## 8. Versioning + +This document specifies **LAC version 1**, identified on the wire by +`sync_word = 0x1ACC`. No per-frame version byte is carried; the sync +word uniquely identifies the wire format. + +Future revisions of the format **must** use a distinct `sync_word`. The +recommended allocation is `0x1ACD` for v2, `0x1ACE` for v3, and so on +inside the `0x1ACC..0x1ACF` cluster, with the cluster boundary making +casual grep / hex-dump inspection robust. A revision whose wire format +cannot be made compatible with v1 at the header level **must** pick a +sync word outside this cluster. + +This approach is exhaustive by construction: + +- A v1 decoder that encounters a v2 frame sees an unrecognised sync + word on the first check (§3.1) and rejects cleanly — the same error + path as foreign or corrupted payloads. +- A v2-aware decoder dispatches on the sync word before reading any + further field, so it can fall back to v1 parsing when appropriate + or decode v2 frames natively. + +Because every field in §3 has a strict range with no reserved +high-bit patterns, in-place extension (flag bits inside existing +fields) is **not** a supported evolution path. New features go into a +new `sync_word`, not into reinterpreting existing field values. + +Transports that multiplex LAC frames with other formats should frame +each LAC payload explicitly (length prefix or stream separator); the +sync word alone is not a framing delimiter, only a format identifier. + +## 9. Implementation notes (non-normative) + +### 9.1 GPU offload is out of scope + +LAC is a scalar integer codec. The reference implementation, and any +conforming implementation this document anticipates, runs on a CPU. +GPU offload is deliberately not a goal: + +- **Levinson-Durbin** is serial by construction (each iteration depends + on the previous) and its intermediate accumulator needs more than 64 + bits of precision at higher orders — a poor fit for WGSL or SPIR-V + compute shaders, which have no native 128-bit integer arithmetic. +- **Rice decode** uses a data-dependent unary run for every residual; + on GPU execution models this diverges warps badly and its + sequential bit-cursor progression fights SIMD lane packing. +- **LPC synthesis** has a tight per-sample feedback loop (sample `i` + depends on samples `i-1`, `i-2`, …, `i-order`), so each channel is + inherently serial. +- **The one plausibly GPU-parallel phase** — residual computation + inside the encoder's order search — is also the phase where the + CPU's autovectorized implementation is already well-served by SIMD + on any modern target. At the measured encode latencies (P99 under + 50 µs on x86 for a 20 ms frame period, >400× headroom), there is no + motivation to offload it. + +A hypothetical future revision whose hot path genuinely benefited from +GPU execution (large-batch archival encoding across many channels at +once, for instance) would need to change the wire format to carry +enough shape metadata for batched kernels — i.e., a new sync word +under the versioning rules in §8, not a retrofit. + +## 10. Conformance test vectors + +The reference repository's `tests/conformance.rs` holds the canonical +test-vector set for this specification: + +- **`DECODE_FIXTURES`** — `(samples, bytes)` pairs pinned at the byte + level. A conformant decoder **must** produce the `samples` array + when fed the `bytes` array. Encoders have latitude (§3.6, §7), so a + second-team encoder's bytes for the same samples may differ; the + decoder direction is the normative one. Coverage includes the + smallest valid frames (single-sample verbatim, 4- and 8-sample + silence), single-sample polarity boundaries (±1, ±(2²³−1)), DC + offset, alternating-polarity Nyquist-like content, smooth polynomial + (fixed-predictor territory), and a 16-sample growing-amplitude + pattern exercising partition search. +- **`REJECT_FIXTURES`** — hand-constructed malformed inputs mapped to + their expected rejection variants. Covers every class in §6 (1-10): + bad sync, each field-range violation, verbatim + non-zero shift, + `frame_sample_count == 0`, non-divisible partition count, header / + coefficient / Rice-bitstream truncation, per-partition `k > 23`. +- **`reject_unary_run_above_cap`** — a programmatic test for §6 class + 10 (`q > u32::MAX >> k`). The minimal triggering payload is ~75 + bytes of mostly zeros; construction logic is in the test, not a + const fixture. + +Second-team implementations should port the decode fixtures +byte-for-byte and the reject fixtures byte-and-variant-for-variant. +`encode_matches_fixtures` in the same file is reference-specific (it +asserts the reference encoder's exact bytes) and is **not** a +conformance requirement — see §3.6's encoder-latitude clause. + +### 10.1 Reference encoder exemplars (non-normative) + +The same `(samples, bytes)` pairs, read in the encoder direction, +serve as **reference-encoder validation targets** for implementations +that want to match this project's reference byte-for-byte — a common +goal for porting work even though the spec does not require it. The +fixture set is deliberately chosen to pin every encoder-discretion +axis from §7: + +- `single_zero`, `single_pos_one`, `single_neg_one` — single-sample + frames. All three fall in the §5.2 "warm-up-is-whole-frame" regime + where the encoder's order choice is nearly arbitrary; pinning the + bytes fixes this project's choice (order 0 with the minimum-cost + Rice encoding). +- `silence_4`, `silence_8` — force `partition_order` tie-breaks on an + all-zero frame (every `partition_order ∈ [0, log₂(N)]` produces + identical cost; the reference picks the smallest via its convex- + descent tie-break). +- `dc_100_4`, `alternating_small_4` — exercise the order-vs-verbatim + decision. DC content favours a low-order LPC fit with small + residuals; alternating content favours order 1 with `a = −1` + (approximated at the closest Q-format). Pinning the bytes fixes + the reference's decision boundary. +- `single_full_scale_pos`, `single_full_scale_neg` — maximum-magnitude + single samples. Exercise the `|sample| ≤ 2²³ − 1` boundary on both + sides and fix the zigzag-of-extremum output. +- `linear_ramp_8` — smooth polynomial content, fixed-predictor + territory. Pins the reference's fixed-predictor-vs-LPC tie-break. +- `lfsr_noise_16` — exercises partition search on a frame large + enough for `partition_order > 0` to be competitive. + +A second-team encoder that produces the same bytes for every entry +here is **likely** (not guaranteed) to produce matching bytes on +wider inputs, since the tie-break axes are the ones most sensitive +to encoder discretion. An encoder that produces different bytes is +still compliant so long as its own bytes round-trip — see §3.6, §7. \ No newline at end of file