lac/Specification.md

# LAC Wire Format

Normative specification of the LAC bitstream. This document is the authority
on byte layout, field semantics, and encoder/decoder constraints.

## 1. Conventions

- All multi-byte integer fields are **big-endian**.
- Bit streams are **MSB-first**: the first bit written occupies bit 7 of its
  byte, subsequent bits fill lower positions, and a new byte begins once
  eight bits have been emitted.
- Samples are **signed integers** passed as `i32` with magnitude bounded
  by `|sample| ≤ 2²³ − 1`. The upper 9 bits of each `i32` must be a
  consistent sign extension of the 24-bit-magnitude value. Narrower source
  formats (8-bit, 16-bit, 20-bit integer PCM) are passed through directly
  — they trivially satisfy the magnitude bound — and compress at the bit
  cost of their actual values, not a 24-bit ceiling. The codec does not
  carry bit-depth metadata; the container or application layer is
  responsible for remembering the source format.
- Sample rate is **not** part of the bitstream. The container or transport
  carries it.
- A frame encodes **one channel**. Stereo is two independent streams of
  frames, one per channel.

## 2. Frame Layout

A frame is a contiguous byte sequence:

```text
+--------+--------------------+
| header | rice_bitstream     |
+--------+--------------------+
```

The `header` is fixed-structure (variable length because the coefficient
array depends on `prediction_order`). The `rice_bitstream` is a bit-packed
payload; its byte length is `ceil(total_rice_bits / 8)` with zero padding in
the low bits of the last byte.

Decoder input is the complete frame. There is no intra-frame continuation or
fragmentation — the transport layer handles that.

## 3. Frame Header

```text
 Offset  Size  Field                 Type     Constraint
 ------  ----  --------------------  -------  ----------------------------
   0-1   2     sync_word             u16 BE   == 0x1ACC
   2     1     prediction_order      u8       ∈ [0, 32]
   3     1     partition_order       u8       ∈ [0, 7]
   4     1     coefficient_shift     u8       ∈ [0, 5]
   5-6   2     frame_sample_count    u16 BE   ≥ 1, % (1 << partition_order) == 0
   7+    2·p   lpc_coefficients      i16 BE[] length = prediction_order = p
```

Total header length: `7 + 2 · prediction_order` bytes.

### 3.1 `sync_word`

Fixed value `0x1ACC`. Present to identify a LAC frame on lightly framed
transports and to reject foreign payloads at the first check. Decoders
**must** reject any frame whose first two bytes are not `0x1ACC`.

### 3.2 `prediction_order`

Integer order of the LPC analysis filter used to produce residuals.

- Value `0` is **verbatim mode**: no prediction, residuals equal the samples,
  and the `lpc_coefficients` array is empty (zero bytes).
- Values `1` through `32` are standard LPC orders; `lpc_coefficients` carries
  exactly that many predictor coefficients, interpreted in the Q-format
  determined by `coefficient_shift` (§3.4).

Decoders **must** reject values above 32.

### 3.3 `partition_order`

Controls how the residual stream is split for Rice coding.

- `partition_count = 1 << partition_order`.
- The residual stream is divided into `partition_count` equal partitions of
  `frame_sample_count / partition_count` samples each.

Decoders **must** reject values above 7 and **must** reject frames where
`frame_sample_count` is not a multiple of `partition_count`.

### 3.4 `coefficient_shift`

Controls the fixed-point scale of the stored Q-format LPC predictor
coefficients. Coefficients are stored as 16-bit integers interpreted as
`Q(15 − coefficient_shift)`:

| shift | Q-format | Real-value range  | Use case                                       |
|-------|----------|-------------------|------------------------------------------------|
| 0     | Q15      | `[−1, 1)`         | Coefficients with magnitude < 1 (most orders > 1) |
| 1     | Q14      | `[−2, 2)`         | Low-frequency content, `a[1]` near −2         |
| 2     | Q13      | `[−4, 4)`         | Extreme bass / narrow resonances               |
| 3     | Q12      | `[−8, 8)`         | Pathological transients                        |
| 4     | Q11      | `[−16, 16)`       | Reserved for synthetic signals                 |
| 5     | Q10      | `[−32, 32)`       | Upper bound; decoder rejects larger values     |

The encoder **must** select the smallest `coefficient_shift` at which no
coefficient's real value exceeds the representable range for that shift —
i.e., the smallest scale that does not clamp. Smaller shifts give finer
precision and thus smaller residuals when no clamping is required.

If no `coefficient_shift ∈ [0, 5]` suffices (the real coefficient
magnitude exceeds the Q10 range at `shift = 5`), the encoder
saturates each offending coefficient independently to the i16 range
`[−32768, 32767]`. Bit-exact round-trip is preserved because the
decoder applies the synthesis formula to whatever 16-bit values the
wire carries; the cost of saturation is compression — the predictor
no longer matches the encoder's ideal coefficients, so residuals
grow. Real audio at the input-magnitude contract (§1) rarely reaches
this case; synthetic or adversarial inputs can force it.

Decoders **must** reject values above 5. The shift applies uniformly to
every coefficient in `lpc_coefficients`; there is no per-coefficient
scale.

When `prediction_order == 0` (verbatim frame), `coefficient_shift`
**must** be `0`. The shift only modifies how stored coefficients are
interpreted, and a verbatim frame stores none. Decoders **must**
reject frames with `prediction_order == 0` and `coefficient_shift != 0`
as malformed; this rule closes the space of legal but meaningless
headers so two implementations agree bit-for-bit on which inputs
round-trip.

### 3.5 `frame_sample_count`

Number of audio samples produced by this frame (also the number of residuals
in the Rice bitstream). The value **must** be in `[1, 65535]`.

Decoders **must** reject `frame_sample_count == 0`: a zero-sample frame
trivially satisfies the partition-divisibility check below
(`0 mod n == 0` for any `n`) but carries no audio and has no legal
Rice payload.

For compliance with `partition_order`, the value **must** satisfy
`frame_sample_count mod (1 << partition_order) == 0`. Decoders **must**
reject frames where this does not hold.

### 3.6 `lpc_coefficients`

Array of `prediction_order` predictor coefficients, each a 16-bit big-endian
signed integer interpreted in `Q(15 − coefficient_shift)` format — see §3.4
for the shift semantics.

The wire format does not distinguish coefficients by derivation. The
synthesis formula below applies identically whether the encoder
obtained the values from Levinson-Durbin analysis, from a fixed
coefficient template (e.g. FLAC-style integer predictors), from a
trained model, or from any other strategy. What goes on the wire is
just `prediction_order` 16-bit integers; how the encoder chose them
is encoder-internal and not observable to the decoder.

Synthesis formula (applied in the decoder):

```text
s           = 15 − coefficient_shift
bias        = 1 << (s − 1)
predict[i] = (Σ_{j=0..terms-1} coeff[j] · sample[i − j − 1] + bias) >> s
sample[i]  = residual[i] + predict[i]
```

where `terms = min(i, prediction_order)`. The `+ bias` term implements
round-to-nearest for the right shift and is **required** for bit-exact
decoding. For the default `coefficient_shift = 0`: `s = 15`, `bias = 16384`.

The `>> s` operator **must** be an **arithmetic right shift** on
signed integers — equivalent to floor division by `2^s`. Combined with
the `+ bias` pre-add, this implements **round-half toward +∞**: on a
value whose scaled form is exactly `k + 0.5`, the result is `k + 1`
for both positive and negative `k`. Implementations using truncating
integer division (C's `/` on signed integers, which rounds toward
zero) **will diverge** from this on any `sum + bias` that is negative
and not evenly divisible by `2^s`: arithmetic shift rounds further
from zero, truncating division rounds toward zero. Concrete example:
at `s = 15`, `sum = -32769`, `bias = 16384`, arithmetic shift gives
`(-16385) >> 15 = -1`, truncating division gives `-16385 / 32768 = 0`.
Decoders in languages whose native integer division does not floor
**must** emulate arithmetic right shift explicitly on the signed
accumulator.

#### Accumulator width

The inner sum `Σ coeff[j] · sample[i − j − 1]` **must** be computed in
a signed integer accumulator of at least **49 bits** (equivalently: an
`i64` or wider). Worst-case bounds at `prediction_order = 32`,
`coefficient_shift = 5` (Q10), and full-scale samples give a product
of magnitude `(2¹⁵) · (2²³ − 1) ≈ 2³⁸` per term, summed over 32 terms
for a maximum of `~2⁴³`. Adding the bias keeps the result below `2⁴⁴`.
A 32-bit accumulator overflows at orders ≥ 16 with full-scale inputs —
implementations that reach for `int32_t` because samples are 32-bit
will silently corrupt high-order frames.

JavaScript / TypeScript implementers should note that `Number` is an
IEEE 754 double, not a signed integer: its 53-bit safe-integer range
covers in-contract accumulator values, but adversarial bitstreams (see
§6.2) can produce out-of-contract samples whose synthesis arithmetic
lands in the 2⁴⁹–2⁵¹ range and beyond, where `Number` silently loses
low bits to float rounding. For bit-exact spec compliance in JS/TS,
**use `BigInt` for the accumulator** — it has the integer semantics
the spec requires; `Number` does not.

#### Warm-up (`terms == 0`)

When `i == 0`, `terms = min(0, prediction_order) = 0`. The sum is
empty and `predict[0] = 0` — the `(0 + bias) >> s` formula is **not**
applied. Stating this explicitly avoids an implementation that
mechanically applies the formula in the warm-up case and produces
`predict[0] = bias >> s`, which is zero only in specific
`(bias, s)` parametrisations and surprising in any other.

For `0 < i < prediction_order`, the sum truncates to the available
`i` predecessors (`terms = i`). The formula applies as stated.

#### Sign convention for stored coefficients

The synthesis formula uses `+Σ`. Classical Levinson-Durbin
implementations that derive LPC from the error-prediction AR model

```text
x[n] = −Σ a[j] · x[n-j] + e[n]      (error convention)
```

produce coefficients `a[j]` whose sign is the **opposite** of what
the synthesis formula expects; those encoders **must** negate before
quantisation so the wire value is `coeff[j-1] = −a[j]`.
Implementations using the predictor convention

```text
x̂[n] = +Σ c[j] · x[n-j]             (predictor convention)
```

store `c[j]` directly.

Both conventions are common in DSP literature. Encoders **must**
verify that the coefficients emitted on the wire, when substituted
into the synthesis formula above, reproduce the encoder's own
prediction. The reference implementation uses the error convention
and negates at quantisation time.

#### Overflow semantics of the final add

The `residual[i] + predict[i]` add is specified as a **wrapping i32
add** (two's complement, modulo `2³²`; in languages without native
signed-overflow semantics, compute `(residual + predict) & 0xFFFFFFFF`
and then re-interpret as a signed 32-bit integer via sign-extension
of bit 31). On well-formed bitstreams — those produced by a compliant
encoder from in-contract samples (§1) — the result stays inside the
sample-magnitude contract and the wrap is never observable.
Adversarial bitstreams with crafted coefficients and residuals **may**
produce any `i32` value; the decoder **must not** panic, abort, or
reject on the basis of this add's result. The consequences of
out-of-contract decoder output are addressed in §6.2.

## 4. Rice Bitstream

Immediately follows the header. Flat MSB-first bitstream structured as
consecutive partition payloads:

```text
+---------+---------+-----+-----------+
| part. 0 | part. 1 | ... | part. P-1 |
+---------+---------+-----+-----------+
```

where `P = 1 << partition_order`. Each partition has the same structure:

```text
+-------+-------------+-------------+-----+-------------+
| k (5) | codeword 0  | codeword 1  | ... | codeword M-1|
+-------+-------------+-------------+-----+-------------+
```

where `M = frame_sample_count / P` is the per-partition residual count.

Partitions are **bit-contiguous**: the 5-bit `k` field of partition
`i + 1` begins at the bit immediately following the last codeword of
partition `i`. There is no byte alignment between partitions. Only
the final trailing padding described in §4.3 is byte-aligned.

Within a partition, the bit cursor likewise advances continuously:
codeword 0 begins at the bit immediately following the 5-bit `k`
field, codeword 1 immediately after codeword 0's remainder bit, and so
on. A conformant decoder maintains a single bit-read position across
the entire Rice bitstream — from the `k` of partition 0 through the
last codeword of partition P−1 — and never realigns to a byte or bit
boundary between fields. This is implicit in the byte-stream decoder
design (a bit reader that consumes bits sequentially needs no
special handling at field boundaries) but stated here so second-team
implementations do not introduce a spurious alignment.

### 4.1 Per-Partition Parameter `k`

Five-bit unsigned integer, MSB-first, immediately before the partition's
codewords. `k` is the Rice parameter for this partition and must be in
`[0, 23]`. Decoders **must** reject values above 23 as malformed.

### 4.2 Codeword

Each residual is encoded by:

1. **Zigzag mapping** from signed to unsigned:

   ```text
   z = (r << 1) ^ (r >> 31)            interpreted as u32
   ```

   where `(r >> 31)` is an **arithmetic right shift** on the i32
   residual, sign-extending the sign bit to all 32 bit positions — `0`
   for non-negative `r`, `-1` (all ones) for negative `r`. The entire
   expression is **masked to 32 bits** before being interpreted as
   `u32`; in languages with arbitrary-precision integers (e.g. Python)
   or where native bitwise ops return signed 32-bit (e.g. JavaScript /
   TypeScript `Number`, where `(x ^ y) >>> 0` coerces to u32), this
   mask is explicit (`& 0xFFFFFFFF` or `>>> 0`) and **required** —
   without it, negative residuals produce zigzag values with extra
   high bits set. Implementations in languages whose native right
   shift is logical on unsigned types **must** coerce `r` to a signed
   32-bit type first; implementations in languages where `>>` on
   signed types is implementation-defined (e.g. pre-C++20 C/C++)
   **must** emulate the arithmetic shift explicitly.

   The `r << 1` factor is always safe on i32. Residuals on the
   encoder side are bounded by `|r| ≤ |sample| + |predict| ≤ 2·2²³
   ≈ 2²⁴`, so `r << 1` has magnitude `≤ 2²⁵` and fits in i32 without
   overflow or undefined behaviour — even for `r` at the most
   negative value the encoder can ever produce from in-contract
   input (§1). Decoder-side code does not perform this shift; the
   inverse uses `z >> 1` on a u32, which is always defined.

   The mapping sends

   ```text
   {0, −1, 1, −2, 2, −3, 3, …}  →  {0, 1, 2, 3, 4, 5, 6, …}
   ```

   so small magnitudes of either sign map to small unsigned values.

   The decoder's inverse is

   ```text
   r = ((z >> 1) as i32) ^ −((z & 1) as i32)
   ```

   where both shifts here are natural (unsigned-u32 logical for the
   first, integer negation for the second). Stating this inverse
   explicitly removes any ambiguity about how an implementation must
   invert the zigzag.

2. **Rice code** at parameter `k`:
   - **Unary part**: `q = z >> k` zero-bits followed by a single
     terminating 1-bit.
   - **Remainder part**: `k` bits of `z & ((1 << k) − 1)`, MSB-first.
     (The remainder part is absent when `k == 0`.)

Total codeword length: `q + 1 + k` bits.

#### Decoder-side unary-run bound

Decoders **must** reject any codeword whose unary run length satisfies

```text
q > (2³² − 1) >> k         (equivalently, q > u32::MAX >> k)
```

A valid codeword reconstructs `z = (q << k) | remainder` as a u32.
`q > u32::MAX >> k` implies `q << k ≥ 2³²`, which either overflows
u32 silently (a critical decoder bug class — corrupt output with no
error) or indicates a malformed stream. Either way the frame **must**
be discarded with `InvalidParameter` or equivalent rejection class
(§6). The bound varies with `k`: at `k = 23` it is `511`, at `k = 0`
it is `u32::MAX` (no practical constraint).

This rule also caps the CPU cost of unary scanning on adversarial
input: without the cap, a decoder could be forced to scan an
arbitrarily long run of zero bits before reaching either a `1` or
the buffer end.

### 4.3 Byte Padding

After the last codeword of the last partition, any unused bits of the final
byte are zero-padded on the LSB side. The encoder writes `0` for all padding
bits; the decoder ignores them.

### 4.4 Bitstream Length

The Rice bitstream's total bit length is the sum of codeword bit
lengths across every partition, plus 5 bits per partition for the `k`
fields:

```text
total_bits = P · 5 + Σ_{all residuals} (q + 1 + k_partition)
```

This depends on every residual's quotient and cannot be computed
from the frame header alone. A decoder **streams-decodes** until it
has produced exactly `frame_sample_count` samples, then stops.
Any bits remaining inside the last byte are padding (§4.3) and
carry no information.

A decoder **must not** require the Rice bitstream's byte length to be
signalled out-of-band. The header plus the zero-padded byte-aligned
tail fully determines the frame boundary; `parse_header`'s
`bytes_consumed` return plus streaming Rice decode locates the end of
the frame in the input buffer.

## 5. Degenerate Cases

### 5.1 All-Zero Frame

For an all-zero sample vector, the encoder **must** use
`prediction_order = 0` because the Levinson-Durbin recursion is
undefined at `R[0] = 0`. Residuals equal the input (all zeros).

Partition-order and per-partition `k` selection remain at the
encoder's discretion; any legal `(partition_order, k)` combination
produces a bit-exact-decodable frame. The minimum-cost choice —
`partition_order = 0`, `k = 0` — produces a Rice payload of exactly
`5 + frame_sample_count` bits. Compliant encoders are **not**
required to pick this minimum.

### 5.2 Single-Sample Frame

`frame_sample_count = 1` is valid but forces `partition_order = 0` (the only
value that divides 1 evenly). The single sample is Rice-coded directly
because no predecessors exist for any LPC order.

## 6. Error Recovery

Decoders **must** detect and reject every frame that violates the
constraints elsewhere in this document. The exhaustive list of
rejection classes, each of which is a distinct error condition so
callers can distinguish them in telemetry, is:

1. **Sync word mismatch** — bytes `0-1` differ from `0x1ACC` (§3.1).
2. **`prediction_order` out of range** — value > `32` (§3.2).
3. **`partition_order` out of range** — value > `7` (§3.3).
4. **`coefficient_shift` out of range** — value > `5` (§3.4).
5. **Verbatim frame with non-zero shift** — `prediction_order == 0` and
   `coefficient_shift != 0` (§3.4).
6. **`frame_sample_count == 0`** (§3.5).
7. **`frame_sample_count` not divisible by `partition_count`** (§3.3).
8. **Buffer truncated** — fewer bytes than the header plus coefficient
   array requires, or fewer bits than the Rice bitstream demands
   during streaming decode. This class is intentionally coarse-grained:
   a single `Truncated` variant covers header truncation, missing `k`
   fields, mid-codeword exhaustion, and every other "buffer ends early"
   shape the decoder can encounter. Sub-categorising these provides no
   caller benefit — the recovery action (discard and substitute
   silence, see §6.1) is identical regardless of where truncation
   happened.
9. **Per-partition `k` out of range** — value > `23` (§4.1).
10. **Unary-run cap exceeded** — any codeword with `q > u32::MAX >> k`
    (§4.2).

On any of these, the decoder **must** discard the frame, produce no
output samples, and signal the error to the caller. **No partial
state may propagate** to the next frame's decode — frames are
independent (§2), so subsequent frames decode cleanly regardless.

### 6.1 Caller-side silence substitution

On rejection, the caller substitutes `frame_sample_count` zeros
(silence) for the frame period. The count is obtained as follows:

- **Post-header rejections** (classes 8-10 above — `Truncated` in the
  Rice bitstream, `InvalidParameter` during Rice decode): the frame
  header parsed successfully before the failure, so the count is
  recoverable. The caller re-parses just the header on the same buffer
  (reference API: `parse_header(data)`) and reads `frame_sample_count`
  from the resulting `AudioFrameHeader`.
- **Pre-header rejections** (classes 1-7 above): the header itself
  failed; the frame length is not recoverable from the bitstream. The
  caller **must** fall back to a session-level default frame size
  carried out-of-band by the container or transport (WebRTC and QUIC
  audio sessions typically negotiate this at session setup).

This asymmetry is inherent to the wire format: `frame_sample_count`
lives inside the header at offset 5, so any rejection that happens
while parsing bytes 0-4 precedes its discovery.

### 6.2 Decoder output magnitude

On well-formed bitstreams produced by a compliant encoder from
in-contract samples (§1), decoder output satisfies
`|sample| ≤ 2²³ − 1`.

Adversarial bitstreams — those with hand-crafted coefficients and
residuals that pass every rejection check in this section yet
produce arithmetic results outside the sample-magnitude contract —
**may** produce output samples of any `i32` value, including values
that exceed `2²³ − 1`. The decoder **must not** panic or reject on
this basis: the wrapping-add semantics of §3.6 are precisely what
makes every bit sequence produce a defined output, which is the
ground of the "no partial state propagates" contract at the top of
this section.

Callers that re-feed decoder output into LAC's encoder (for example,
an MCU decode → PCM mix → re-encode pipeline) **should** validate or
clamp to the input magnitude contract before re-encoding. A
compliant encoder assumes its input satisfies `|sample| ≤ 2²³ − 1`
and is not required to re-validate.

## 7. Encoder Guidance (non-normative)

The reference encoder's search has three phases:

```text
# Phase 0: all-zero short-circuit
R[0] = Σ sample[i]²
if R[0] == 0:
    emit frame with prediction_order = 0, any legal partition_order (§5.1)
    return

# Phase 1: sparse LPC order grid with stop-when-stale early-out
for prediction_order in [0, 2, 4, 6, 8, 10, 12, 16, 20, 24, 28, 32]:
    coeffs_q31     = levinson_durbin(samples, prediction_order)  # cached
    shift          = smallest s such that every |coeff_real| < 2^s
    coeffs_stored  = quantize(coeffs_q31, to: Q(15 - shift))
    residuals      = compute_residuals(samples, coeffs_stored, shift)
    for partition_order in 0..=7:
        if frame_sample_count % (1 << partition_order) != 0: continue
        rice_bits = estimate_cost(residuals, partition_order)
        total     = header_bits(prediction_order) + rice_bits
        track minimum over (prediction_order, partition_order)
    if no improvement for 2 consecutive grid entries: break

# Phase 2: fixed-predictor post-pass
for (fp_order, fp_coeffs, fp_shift) in FIXED_PREDICTORS:
    residuals = compute_residuals(samples, fp_coeffs, fp_shift)
    for partition_order in 0..=7:
        if frame_sample_count % (1 << partition_order) != 0: continue
        rice_bits = estimate_cost(residuals, partition_order)
        total     = header_bits(fp_order) + rice_bits
        track minimum over (fp_order, partition_order)

emit frame with the (order, partition_order) that minimised `total`
```

The sparse grid + early-out is a speed/compression trade-off; a
compliant encoder may still exhaustively search every integer order
`0..=32` for marginal gains at higher cost. The produced bitstreams
are interchangeable.

The fixed-predictor post-pass tries FLAC-style integer predictors
(orders 1-4 with a small static coefficient table) after the LPC
grid. These evaluate quickly and occasionally beat the Levinson-
Durbin winner on content where a low-order integer polynomial fits
better than the statistically-optimal LPC fit — silent-plus-DC, very
smooth tones, polynomial-ish sensor data. Running them second avoids
tripping the stop-when-stale heuristic in the LPC phase.

The reference encoder's `FIXED_PREDICTORS` table, materialising the
FLAC-style `[1]`, `[2, −1]`, `[3, −3, 1]`, `[4, −6, 4, −1]` integer
predictors at the smallest `coefficient_shift` that represents each
coefficient without clamping:

| `prediction_order` | `lpc_coefficients` (Q-format integers)      | `coefficient_shift` | Real-value interpretation    |
|-------------------:|---------------------------------------------|--------------------:|------------------------------|
| 1                  | `[16384]`                                    | 1 (Q14)             | `[1]`                        |
| 2                  | `[16384, −8192]`                             | 2 (Q13)             | `[2, −1]`                    |
| 3                  | `[24576, −24576, 8192]`                      | 2 (Q13)             | `[3, −3, 1]`                 |
| 4                  | `[16384, −24576, 16384, −4096]`              | 3 (Q12)             | `[4, −6, 4, −1]`             |

These are the exact wire-format bytes a second-team encoder would emit
to match the reference's fixed-predictor outputs bit-for-bit. Compliant
encoders MAY use a different set (or none), since §3.6 treats the
coefficient field as opaque — decoders apply the synthesis formula
identically regardless of source.

The `R[0] == 0` short-circuit is both a correctness requirement (§5.1
— Levinson-Durbin is undefined at zero autocorrelation) and an
encoder-cost optimisation: on digital silence, the sparse grid and
fixed-predictor pass produce identical zero residuals and order 0
wins on header size alone.

Levinson-Durbin runs once to order 32 with all intermediate orders saved
into a flat buffer (one recursion pass yields all orders 1..32 at
`O(order²)` cost), so the outer loop fetch is free and order selection
is effectively `O(orders_tried × N)`.

`shift` is determined per order by the coefficient magnitudes — there
is no shift search, as smaller shifts are always at least as good as
larger ones when they don't clamp (saturation, §3.4, is the
fallback). Rice cost at a given `partition_order` is exact and
closed-form given the per-partition `k`, so the inner search
introduces no estimation error.

#### Levinson-Durbin numerical choices (reference, non-normative)

The reference encoder runs Levinson-Durbin with i64 autocorrelation
accumulators, Q31 working coefficients, and widens to i128 for
reflection-coefficient intermediates at orders where Q31 would lose
precision (typically above order ~12). Rounding on the Q31→Q(15−shift)
quantisation step is round-half-up via `(a_q31 + bias) >> shift_amt`,
with `bias = 1 << (14 + shift)` — the direct analogue of the synthesis
formula's rounding (§3.6), chosen so analysis and synthesis agree on
tie-break direction.

None of these choices are normative. Two encoders making different
precision or rounding choices will produce different coefficient
bytes on the same input, but both bitstreams decode correctly under
§3.6 so long as the coefficients they emit faithfully represent their
own LPC decision. §3.6's "wire format doesn't distinguish coefficients
by derivation" clause is exactly what permits this freedom.

#### Rice `k` selection (reference, non-normative)

Cost is closed-form for a given `(partition, k)`: `bit_cost(k) =
N · (1 + k) + Σ (v >> k)` where `v` ranges over the zigzag-mapped
residuals of the partition. Exhaustive search over `k ∈ [0, 23]` is
always acceptable and is the simplest compliant choice.

The reference encoder uses convex descent: seed `k_seed =
⌊log₂(mean(v))⌋`, then walk either direction. Ties break **toward
smaller `k`** on descent (condition `cost ≤ best_cost`, so equal
costs at `k−1` overwrite `k`) and **strictly larger costs only** on
ascent (condition `cost < best_cost`, so equal costs at `k+1` don't
override `k`). This yields a unique `k` per partition matching an
exhaustive search's first-wins tie-break.

On the reference corpus, the sparse grid's compression matches
exhaustive-search output within ~0.2 percentage points (measured);
the differential test caps the acceptable excess at 0.5%.
Implementations that prefer tighter compression at higher encode
cost can extend the grid without wire-format consequences — decoders
do not care which orders an encoder tried.

### 7.1 Frame size (non-normative)

The codec accepts any `frame_sample_count` in `[1, 65535]`. Larger
frames amortise the 7-byte fixed header and the LPC coefficient vector
over more samples, and generally compress better. Smaller frames give
tighter latency and finer partition-order granularity on transient
content.

Recommended defaults by use case:

- **Real-time voice** (QUIC streaming, MCU mix): 160-320 samples at
  16 kHz, or 480-960 at 48 kHz — matches 10-20 ms frame periods used
  by Opus / WebRTC.
- **Real-time full-band** (game audio, music conferencing): 1024-2048
  at 48 kHz (21-43 ms).
- **Offline / archival**: 4096-8192 at 48 kHz; compression gains
  flatten past this.

Power-of-two frame sizes expose every `partition_order ∈ [0, 7]` to
the encoder's search. Non-power-of-two sizes restrict the search to
partition counts that divide evenly; a prime frame size forces
`partition_order = 0`. Encoders SHOULD prefer frame sizes with several
small-prime factors when free to choose.

## 8. Versioning

This document specifies **LAC version 1**, identified on the wire by
`sync_word = 0x1ACC`. No per-frame version byte is carried; the sync
word uniquely identifies the wire format.

Future revisions of the format **must** use a distinct `sync_word`. The
recommended allocation is `0x1ACD` for v2, `0x1ACE` for v3, and so on
inside the `0x1ACC..0x1ACF` cluster, with the cluster boundary making
casual grep / hex-dump inspection robust. A revision whose wire format
cannot be made compatible with v1 at the header level **must** pick a
sync word outside this cluster.

This approach is exhaustive by construction:

- A v1 decoder that encounters a v2 frame sees an unrecognised sync
  word on the first check (§3.1) and rejects cleanly — the same error
  path as foreign or corrupted payloads.
- A v2-aware decoder dispatches on the sync word before reading any
  further field, so it can fall back to v1 parsing when appropriate
  or decode v2 frames natively.

Because every field in §3 has a strict range with no reserved
high-bit patterns, in-place extension (flag bits inside existing
fields) is **not** a supported evolution path. New features go into a
new `sync_word`, not into reinterpreting existing field values.

Transports that multiplex LAC frames with other formats should frame
each LAC payload explicitly (length prefix or stream separator); the
sync word alone is not a framing delimiter, only a format identifier.

## 9. Implementation notes (non-normative)

### 9.1 GPU offload is out of scope

LAC is a scalar integer codec. The reference implementation, and any
conforming implementation this document anticipates, runs on a CPU.
GPU offload is deliberately not a goal:

- **Levinson-Durbin** is serial by construction (each iteration depends
  on the previous) and its intermediate accumulator needs more than 64
  bits of precision at higher orders — a poor fit for WGSL or SPIR-V
  compute shaders, which have no native 128-bit integer arithmetic.
- **Rice decode** uses a data-dependent unary run for every residual;
  on GPU execution models this diverges warps badly and its
  sequential bit-cursor progression fights SIMD lane packing.
- **LPC synthesis** has a tight per-sample feedback loop (sample `i`
  depends on samples `i-1`, `i-2`, …, `i-order`), so each channel is
  inherently serial.
- **The one plausibly GPU-parallel phase** — residual computation
  inside the encoder's order search — is also the phase where the
  CPU's autovectorized implementation is already well-served by SIMD
  on any modern target. At the measured encode latencies (P99 under
  50 µs on x86 for a 20 ms frame period, >400× headroom), there is no
  motivation to offload it.

A hypothetical future revision whose hot path genuinely benefited from
GPU execution (large-batch archival encoding across many channels at
once, for instance) would need to change the wire format to carry
enough shape metadata for batched kernels — i.e., a new sync word
under the versioning rules in §8, not a retrofit.

## 10. Conformance test vectors

The reference repository's `tests/conformance.rs` holds the canonical
test-vector set for this specification:

- **`DECODE_FIXTURES`** — `(samples, bytes)` pairs pinned at the byte
  level. A conformant decoder **must** produce the `samples` array
  when fed the `bytes` array. Encoders have latitude (§3.6, §7), so a
  second-team encoder's bytes for the same samples may differ; the
  decoder direction is the normative one. Coverage includes the
  smallest valid frames (single-sample verbatim, 4- and 8-sample
  silence), single-sample polarity boundaries (±1, ±(2²³−1)), DC
  offset, alternating-polarity Nyquist-like content, smooth polynomial
  (fixed-predictor territory), and a 16-sample growing-amplitude
  pattern exercising partition search.
- **`REJECT_FIXTURES`** — hand-constructed malformed inputs mapped to
  their expected rejection variants. Covers every class in §6 (1-10):
  bad sync, each field-range violation, verbatim + non-zero shift,
  `frame_sample_count == 0`, non-divisible partition count, header /
  coefficient / Rice-bitstream truncation, per-partition `k > 23`.
- **`reject_unary_run_above_cap`** — a programmatic test for §6 class
  10 (`q > u32::MAX >> k`). The minimal triggering payload is ~75
  bytes of mostly zeros; construction logic is in the test, not a
  const fixture.

Second-team implementations should port the decode fixtures
byte-for-byte and the reject fixtures byte-and-variant-for-variant.
`encode_matches_fixtures` in the same file is reference-specific (it
asserts the reference encoder's exact bytes) and is **not** a
conformance requirement — see §3.6's encoder-latitude clause.

### 10.1 Reference encoder exemplars (non-normative)

The same `(samples, bytes)` pairs, read in the encoder direction,
serve as **reference-encoder validation targets** for implementations
that want to match this project's reference byte-for-byte — a common
goal for porting work even though the spec does not require it. The
fixture set is deliberately chosen to pin every encoder-discretion
axis from §7:

- `single_zero`, `single_pos_one`, `single_neg_one` — single-sample
  frames. All three fall in the §5.2 "warm-up-is-whole-frame" regime
  where the encoder's order choice is nearly arbitrary; pinning the
  bytes fixes this project's choice (order 0 with the minimum-cost
  Rice encoding).
- `silence_4`, `silence_8` — force `partition_order` tie-breaks on an
  all-zero frame (every `partition_order ∈ [0, log₂(N)]` produces
  identical cost; the reference picks the smallest via its convex-
  descent tie-break).
- `dc_100_4`, `alternating_small_4` — exercise the order-vs-verbatim
  decision. DC content favours a low-order LPC fit with small
  residuals; alternating content favours order 1 with `a = −1`
  (approximated at the closest Q-format). Pinning the bytes fixes
  the reference's decision boundary.
- `single_full_scale_pos`, `single_full_scale_neg` — maximum-magnitude
  single samples. Exercise the `|sample| ≤ 2²³ − 1` boundary on both
  sides and fix the zigzag-of-extremum output.
- `linear_ramp_8` — smooth polynomial content, fixed-predictor
  territory. Pins the reference's fixed-predictor-vs-LPC tie-break.
- `lfsr_noise_16` — exercises partition search on a frame large
  enough for `partition_order > 0` to be competitive.

A second-team encoder that produces the same bytes for every entry
here is **likely** (not guaranteed) to produce matching bytes on
wider inputs, since the tie-break axes are the ones most sensitive
to encoder discretion. An encoder that produces different bytes is
still compliant so long as its own bytes round-trip — see §3.6, §7.