initial commit
Signed-off-by: Kamal Tufekcic <kamal@lo.sh>
This commit is contained in:
commit
7862cb1d9d
2884 changed files with 16797 additions and 0 deletions
784
Specification.md
Normal file
784
Specification.md
Normal file
|
|
@ -0,0 +1,784 @@
|
|||
# LAC Wire Format
|
||||
|
||||
Normative specification of the LAC bitstream. This document is the authority
|
||||
on byte layout, field semantics, and encoder/decoder constraints.
|
||||
|
||||
## 1. Conventions
|
||||
|
||||
- All multi-byte integer fields are **big-endian**.
|
||||
- Bit streams are **MSB-first**: the first bit written occupies bit 7 of its
|
||||
byte, subsequent bits fill lower positions, and a new byte begins once
|
||||
eight bits have been emitted.
|
||||
- Samples are **signed integers** passed as `i32` with magnitude bounded
|
||||
by `|sample| ≤ 2²³ − 1`. The upper 9 bits of each `i32` must be a
|
||||
consistent sign extension of the 24-bit-magnitude value. Narrower source
|
||||
formats (8-bit, 16-bit, 20-bit integer PCM) are passed through directly
|
||||
— they trivially satisfy the magnitude bound — and compress at the bit
|
||||
cost of their actual values, not a 24-bit ceiling. The codec does not
|
||||
carry bit-depth metadata; the container or application layer is
|
||||
responsible for remembering the source format.
|
||||
- Sample rate is **not** part of the bitstream. The container or transport
|
||||
carries it.
|
||||
- A frame encodes **one channel**. Stereo is two independent streams of
|
||||
frames, one per channel.
|
||||
|
||||
## 2. Frame Layout
|
||||
|
||||
A frame is a contiguous byte sequence:
|
||||
|
||||
```text
|
||||
+--------+--------------------+
|
||||
| header | rice_bitstream |
|
||||
+--------+--------------------+
|
||||
```
|
||||
|
||||
The `header` is fixed-structure (variable length because the coefficient
|
||||
array depends on `prediction_order`). The `rice_bitstream` is a bit-packed
|
||||
payload; its byte length is `ceil(total_rice_bits / 8)` with zero padding in
|
||||
the low bits of the last byte.
|
||||
|
||||
Decoder input is the complete frame. There is no intra-frame continuation or
|
||||
fragmentation — the transport layer handles that.
|
||||
|
||||
## 3. Frame Header
|
||||
|
||||
```text
|
||||
Offset Size Field Type Constraint
|
||||
------ ---- -------------------- ------- ----------------------------
|
||||
0-1 2 sync_word u16 BE == 0x1ACC
|
||||
2 1 prediction_order u8 ∈ [0, 32]
|
||||
3 1 partition_order u8 ∈ [0, 7]
|
||||
4 1 coefficient_shift u8 ∈ [0, 5]
|
||||
5-6 2 frame_sample_count u16 BE ≥ 1, % (1 << partition_order) == 0
|
||||
7+ 2·p lpc_coefficients i16 BE[] length = prediction_order = p
|
||||
```
|
||||
|
||||
Total header length: `7 + 2 · prediction_order` bytes.
|
||||
|
||||
### 3.1 `sync_word`
|
||||
|
||||
Fixed value `0x1ACC`. Present to identify a LAC frame on lightly framed
|
||||
transports and to reject foreign payloads at the first check. Decoders
|
||||
**must** reject any frame whose first two bytes are not `0x1ACC`.
|
||||
|
||||
### 3.2 `prediction_order`
|
||||
|
||||
Integer order of the LPC analysis filter used to produce residuals.
|
||||
|
||||
- Value `0` is **verbatim mode**: no prediction, residuals equal the samples,
|
||||
and the `lpc_coefficients` array is empty (zero bytes).
|
||||
- Values `1` through `32` are standard LPC orders; `lpc_coefficients` carries
|
||||
exactly that many predictor coefficients, interpreted in the Q-format
|
||||
determined by `coefficient_shift` (§3.4).
|
||||
|
||||
Decoders **must** reject values above 32.
|
||||
|
||||
### 3.3 `partition_order`
|
||||
|
||||
Controls how the residual stream is split for Rice coding.
|
||||
|
||||
- `partition_count = 1 << partition_order`.
|
||||
- The residual stream is divided into `partition_count` equal partitions of
|
||||
`frame_sample_count / partition_count` samples each.
|
||||
|
||||
Decoders **must** reject values above 7 and **must** reject frames where
|
||||
`frame_sample_count` is not a multiple of `partition_count`.
|
||||
|
||||
### 3.4 `coefficient_shift`
|
||||
|
||||
Controls the fixed-point scale of the stored Q-format LPC predictor
|
||||
coefficients. Coefficients are stored as 16-bit integers interpreted as
|
||||
`Q(15 − coefficient_shift)`:
|
||||
|
||||
| shift | Q-format | Real-value range | Use case |
|
||||
|-------|----------|-------------------|------------------------------------------------|
|
||||
| 0 | Q15 | `[−1, 1)` | Coefficients with magnitude < 1 (most orders > 1) |
|
||||
| 1 | Q14 | `[−2, 2)` | Low-frequency content, `a[1]` near −2 |
|
||||
| 2 | Q13 | `[−4, 4)` | Extreme bass / narrow resonances |
|
||||
| 3 | Q12 | `[−8, 8)` | Pathological transients |
|
||||
| 4 | Q11 | `[−16, 16)` | Reserved for synthetic signals |
|
||||
| 5 | Q10 | `[−32, 32)` | Upper bound; decoder rejects larger values |
|
||||
|
||||
The encoder **must** select the smallest `coefficient_shift` at which no
|
||||
coefficient's real value exceeds the representable range for that shift —
|
||||
i.e., the smallest scale that does not clamp. Smaller shifts give finer
|
||||
precision and thus smaller residuals when no clamping is required.
|
||||
|
||||
If no `coefficient_shift ∈ [0, 5]` suffices (the real coefficient
|
||||
magnitude exceeds the Q10 range at `shift = 5`), the encoder
|
||||
saturates each offending coefficient independently to the i16 range
|
||||
`[−32768, 32767]`. Bit-exact round-trip is preserved because the
|
||||
decoder applies the synthesis formula to whatever 16-bit values the
|
||||
wire carries; the cost of saturation is compression — the predictor
|
||||
no longer matches the encoder's ideal coefficients, so residuals
|
||||
grow. Real audio at the input-magnitude contract (§1) rarely reaches
|
||||
this case; synthetic or adversarial inputs can force it.
|
||||
|
||||
Decoders **must** reject values above 5. The shift applies uniformly to
|
||||
every coefficient in `lpc_coefficients`; there is no per-coefficient
|
||||
scale.
|
||||
|
||||
When `prediction_order == 0` (verbatim frame), `coefficient_shift`
|
||||
**must** be `0`. The shift only modifies how stored coefficients are
|
||||
interpreted, and a verbatim frame stores none. Decoders **must**
|
||||
reject frames with `prediction_order == 0` and `coefficient_shift != 0`
|
||||
as malformed; this rule closes the space of legal but meaningless
|
||||
headers so two implementations agree bit-for-bit on which inputs
|
||||
round-trip.
|
||||
|
||||
### 3.5 `frame_sample_count`
|
||||
|
||||
Number of audio samples produced by this frame (also the number of residuals
|
||||
in the Rice bitstream). The value **must** be in `[1, 65535]`.
|
||||
|
||||
Decoders **must** reject `frame_sample_count == 0`: a zero-sample frame
|
||||
trivially satisfies the partition-divisibility check below
|
||||
(`0 mod n == 0` for any `n`) but carries no audio and has no legal
|
||||
Rice payload.
|
||||
|
||||
For compliance with `partition_order`, the value **must** satisfy
|
||||
`frame_sample_count mod (1 << partition_order) == 0`. Decoders **must**
|
||||
reject frames where this does not hold.
|
||||
|
||||
### 3.6 `lpc_coefficients`
|
||||
|
||||
Array of `prediction_order` predictor coefficients, each a 16-bit big-endian
|
||||
signed integer interpreted in `Q(15 − coefficient_shift)` format — see §3.4
|
||||
for the shift semantics.
|
||||
|
||||
The wire format does not distinguish coefficients by derivation. The
|
||||
synthesis formula below applies identically whether the encoder
|
||||
obtained the values from Levinson-Durbin analysis, from a fixed
|
||||
coefficient template (e.g. FLAC-style integer predictors), from a
|
||||
trained model, or from any other strategy. What goes on the wire is
|
||||
just `prediction_order` 16-bit integers; how the encoder chose them
|
||||
is encoder-internal and not observable to the decoder.
|
||||
|
||||
Synthesis formula (applied in the decoder):
|
||||
|
||||
```text
|
||||
s = 15 − coefficient_shift
|
||||
bias = 1 << (s − 1)
|
||||
predict[i] = (Σ_{j=0..terms-1} coeff[j] · sample[i − j − 1] + bias) >> s
|
||||
sample[i] = residual[i] + predict[i]
|
||||
```
|
||||
|
||||
where `terms = min(i, prediction_order)`. The `+ bias` term implements
|
||||
round-to-nearest for the right shift and is **required** for bit-exact
|
||||
decoding. For the default `coefficient_shift = 0`: `s = 15`, `bias = 16384`.
|
||||
|
||||
The `>> s` operator **must** be an **arithmetic right shift** on
|
||||
signed integers — equivalent to floor division by `2^s`. Combined with
|
||||
the `+ bias` pre-add, this implements **round-half toward +∞**: on a
|
||||
value whose scaled form is exactly `k + 0.5`, the result is `k + 1`
|
||||
for both positive and negative `k`. Implementations using truncating
|
||||
integer division (C's `/` on signed integers, which rounds toward
|
||||
zero) **will diverge** from this on any `sum + bias` that is negative
|
||||
and not evenly divisible by `2^s`: arithmetic shift rounds further
|
||||
from zero, truncating division rounds toward zero. Concrete example:
|
||||
at `s = 15`, `sum = -32769`, `bias = 16384`, arithmetic shift gives
|
||||
`(-16385) >> 15 = -1`, truncating division gives `-16385 / 32768 = 0`.
|
||||
Decoders in languages whose native integer division does not floor
|
||||
**must** emulate arithmetic right shift explicitly on the signed
|
||||
accumulator.
|
||||
|
||||
#### Accumulator width
|
||||
|
||||
The inner sum `Σ coeff[j] · sample[i − j − 1]` **must** be computed in
|
||||
a signed integer accumulator of at least **49 bits** (equivalently: an
|
||||
`i64` or wider). Worst-case bounds at `prediction_order = 32`,
|
||||
`coefficient_shift = 5` (Q10), and full-scale samples give a product
|
||||
of magnitude `(2¹⁵) · (2²³ − 1) ≈ 2³⁸` per term, summed over 32 terms
|
||||
for a maximum of `~2⁴³`. Adding the bias keeps the result below `2⁴⁴`.
|
||||
A 32-bit accumulator overflows at orders ≥ 16 with full-scale inputs —
|
||||
implementations that reach for `int32_t` because samples are 32-bit
|
||||
will silently corrupt high-order frames.
|
||||
|
||||
JavaScript / TypeScript implementers should note that `Number` is an
|
||||
IEEE 754 double, not a signed integer: its 53-bit safe-integer range
|
||||
covers in-contract accumulator values, but adversarial bitstreams (see
|
||||
§6.2) can produce out-of-contract samples whose synthesis arithmetic
|
||||
lands in the 2⁴⁹–2⁵¹ range and beyond, where `Number` silently loses
|
||||
low bits to float rounding. For bit-exact spec compliance in JS/TS,
|
||||
**use `BigInt` for the accumulator** — it has the integer semantics
|
||||
the spec requires; `Number` does not.
|
||||
|
||||
#### Warm-up (`terms == 0`)
|
||||
|
||||
When `i == 0`, `terms = min(0, prediction_order) = 0`. The sum is
|
||||
empty and `predict[0] = 0` — the `(0 + bias) >> s` formula is **not**
|
||||
applied. Stating this explicitly avoids an implementation that
|
||||
mechanically applies the formula in the warm-up case and produces
|
||||
`predict[0] = bias >> s`, which is zero only in specific
|
||||
`(bias, s)` parametrisations and surprising in any other.
|
||||
|
||||
For `0 < i < prediction_order`, the sum truncates to the available
|
||||
`i` predecessors (`terms = i`). The formula applies as stated.
|
||||
|
||||
#### Sign convention for stored coefficients
|
||||
|
||||
The synthesis formula uses `+Σ`. Classical Levinson-Durbin
|
||||
implementations that derive LPC from the error-prediction AR model
|
||||
|
||||
```text
|
||||
x[n] = −Σ a[j] · x[n-j] + e[n] (error convention)
|
||||
```
|
||||
|
||||
produce coefficients `a[j]` whose sign is the **opposite** of what
|
||||
the synthesis formula expects; those encoders **must** negate before
|
||||
quantisation so the wire value is `coeff[j-1] = −a[j]`.
|
||||
Implementations using the predictor convention
|
||||
|
||||
```text
|
||||
x̂[n] = +Σ c[j] · x[n-j] (predictor convention)
|
||||
```
|
||||
|
||||
store `c[j]` directly.
|
||||
|
||||
Both conventions are common in DSP literature. Encoders **must**
|
||||
verify that the coefficients emitted on the wire, when substituted
|
||||
into the synthesis formula above, reproduce the encoder's own
|
||||
prediction. The reference implementation uses the error convention
|
||||
and negates at quantisation time.
|
||||
|
||||
#### Overflow semantics of the final add
|
||||
|
||||
The `residual[i] + predict[i]` add is specified as a **wrapping i32
|
||||
add** (two's complement, modulo `2³²`; in languages without native
|
||||
signed-overflow semantics, compute `(residual + predict) & 0xFFFFFFFF`
|
||||
and then re-interpret as a signed 32-bit integer via sign-extension
|
||||
of bit 31). On well-formed bitstreams — those produced by a compliant
|
||||
encoder from in-contract samples (§1) — the result stays inside the
|
||||
sample-magnitude contract and the wrap is never observable.
|
||||
Adversarial bitstreams with crafted coefficients and residuals **may**
|
||||
produce any `i32` value; the decoder **must not** panic, abort, or
|
||||
reject on the basis of this add's result. The consequences of
|
||||
out-of-contract decoder output are addressed in §6.2.
|
||||
|
||||
## 4. Rice Bitstream
|
||||
|
||||
Immediately follows the header. Flat MSB-first bitstream structured as
|
||||
consecutive partition payloads:
|
||||
|
||||
```text
|
||||
+---------+---------+-----+-----------+
|
||||
| part. 0 | part. 1 | ... | part. P-1 |
|
||||
+---------+---------+-----+-----------+
|
||||
```
|
||||
|
||||
where `P = 1 << partition_order`. Each partition has the same structure:
|
||||
|
||||
```text
|
||||
+-------+-------------+-------------+-----+-------------+
|
||||
| k (5) | codeword 0 | codeword 1 | ... | codeword M-1|
|
||||
+-------+-------------+-------------+-----+-------------+
|
||||
```
|
||||
|
||||
where `M = frame_sample_count / P` is the per-partition residual count.
|
||||
|
||||
Partitions are **bit-contiguous**: the 5-bit `k` field of partition
|
||||
`i + 1` begins at the bit immediately following the last codeword of
|
||||
partition `i`. There is no byte alignment between partitions. Only
|
||||
the final trailing padding described in §4.3 is byte-aligned.
|
||||
|
||||
Within a partition, the bit cursor likewise advances continuously:
|
||||
codeword 0 begins at the bit immediately following the 5-bit `k`
|
||||
field, codeword 1 immediately after codeword 0's remainder bit, and so
|
||||
on. A conformant decoder maintains a single bit-read position across
|
||||
the entire Rice bitstream — from the `k` of partition 0 through the
|
||||
last codeword of partition P−1 — and never realigns to a byte or bit
|
||||
boundary between fields. This is implicit in the byte-stream decoder
|
||||
design (a bit reader that consumes bits sequentially needs no
|
||||
special handling at field boundaries) but stated here so second-team
|
||||
implementations do not introduce a spurious alignment.
|
||||
|
||||
### 4.1 Per-Partition Parameter `k`
|
||||
|
||||
Five-bit unsigned integer, MSB-first, immediately before the partition's
|
||||
codewords. `k` is the Rice parameter for this partition and must be in
|
||||
`[0, 23]`. Decoders **must** reject values above 23 as malformed.
|
||||
|
||||
### 4.2 Codeword
|
||||
|
||||
Each residual is encoded by:
|
||||
|
||||
1. **Zigzag mapping** from signed to unsigned:
|
||||
|
||||
```text
|
||||
z = (r << 1) ^ (r >> 31) interpreted as u32
|
||||
```
|
||||
|
||||
where `(r >> 31)` is an **arithmetic right shift** on the i32
|
||||
residual, sign-extending the sign bit to all 32 bit positions — `0`
|
||||
for non-negative `r`, `-1` (all ones) for negative `r`. The entire
|
||||
expression is **masked to 32 bits** before being interpreted as
|
||||
`u32`; in languages with arbitrary-precision integers (e.g. Python)
|
||||
or where native bitwise ops return signed 32-bit (e.g. JavaScript /
|
||||
TypeScript `Number`, where `(x ^ y) >>> 0` coerces to u32), this
|
||||
mask is explicit (`& 0xFFFFFFFF` or `>>> 0`) and **required** —
|
||||
without it, negative residuals produce zigzag values with extra
|
||||
high bits set. Implementations in languages whose native right
|
||||
shift is logical on unsigned types **must** coerce `r` to a signed
|
||||
32-bit type first; implementations in languages where `>>` on
|
||||
signed types is implementation-defined (e.g. pre-C++20 C/C++)
|
||||
**must** emulate the arithmetic shift explicitly.
|
||||
|
||||
The `r << 1` factor is always safe on i32. Residuals on the
|
||||
encoder side are bounded by `|r| ≤ |sample| + |predict| ≤ 2·2²³
|
||||
≈ 2²⁴`, so `r << 1` has magnitude `≤ 2²⁵` and fits in i32 without
|
||||
overflow or undefined behaviour — even for `r` at the most
|
||||
negative value the encoder can ever produce from in-contract
|
||||
input (§1). Decoder-side code does not perform this shift; the
|
||||
inverse uses `z >> 1` on a u32, which is always defined.
|
||||
|
||||
The mapping sends
|
||||
|
||||
```text
|
||||
{0, −1, 1, −2, 2, −3, 3, …} → {0, 1, 2, 3, 4, 5, 6, …}
|
||||
```
|
||||
|
||||
so small magnitudes of either sign map to small unsigned values.
|
||||
|
||||
The decoder's inverse is
|
||||
|
||||
```text
|
||||
r = ((z >> 1) as i32) ^ −((z & 1) as i32)
|
||||
```
|
||||
|
||||
where both shifts here are natural (unsigned-u32 logical for the
|
||||
first, integer negation for the second). Stating this inverse
|
||||
explicitly removes any ambiguity about how an implementation must
|
||||
invert the zigzag.
|
||||
|
||||
2. **Rice code** at parameter `k`:
|
||||
- **Unary part**: `q = z >> k` zero-bits followed by a single
|
||||
terminating 1-bit.
|
||||
- **Remainder part**: `k` bits of `z & ((1 << k) − 1)`, MSB-first.
|
||||
(The remainder part is absent when `k == 0`.)
|
||||
|
||||
Total codeword length: `q + 1 + k` bits.
|
||||
|
||||
#### Decoder-side unary-run bound
|
||||
|
||||
Decoders **must** reject any codeword whose unary run length satisfies
|
||||
|
||||
```text
|
||||
q > (2³² − 1) >> k (equivalently, q > u32::MAX >> k)
|
||||
```
|
||||
|
||||
A valid codeword reconstructs `z = (q << k) | remainder` as a u32.
|
||||
`q > u32::MAX >> k` implies `q << k ≥ 2³²`, which either overflows
|
||||
u32 silently (a critical decoder bug class — corrupt output with no
|
||||
error) or indicates a malformed stream. Either way the frame **must**
|
||||
be discarded with `InvalidParameter` or equivalent rejection class
|
||||
(§6). The bound varies with `k`: at `k = 23` it is `511`, at `k = 0`
|
||||
it is `u32::MAX` (no practical constraint).
|
||||
|
||||
This rule also caps the CPU cost of unary scanning on adversarial
|
||||
input: without the cap, a decoder could be forced to scan an
|
||||
arbitrarily long run of zero bits before reaching either a `1` or
|
||||
the buffer end.
|
||||
|
||||
### 4.3 Byte Padding
|
||||
|
||||
After the last codeword of the last partition, any unused bits of the final
|
||||
byte are zero-padded on the LSB side. The encoder writes `0` for all padding
|
||||
bits; the decoder ignores them.
|
||||
|
||||
### 4.4 Bitstream Length
|
||||
|
||||
The Rice bitstream's total bit length is the sum of codeword bit
|
||||
lengths across every partition, plus 5 bits per partition for the `k`
|
||||
fields:
|
||||
|
||||
```text
|
||||
total_bits = P · 5 + Σ_{all residuals} (q + 1 + k_partition)
|
||||
```
|
||||
|
||||
This depends on every residual's quotient and cannot be computed
|
||||
from the frame header alone. A decoder **streams-decodes** until it
|
||||
has produced exactly `frame_sample_count` samples, then stops.
|
||||
Any bits remaining inside the last byte are padding (§4.3) and
|
||||
carry no information.
|
||||
|
||||
A decoder **must not** require the Rice bitstream's byte length to be
|
||||
signalled out-of-band. The header plus the zero-padded byte-aligned
|
||||
tail fully determines the frame boundary; `parse_header`'s
|
||||
`bytes_consumed` return plus streaming Rice decode locates the end of
|
||||
the frame in the input buffer.
|
||||
|
||||
## 5. Degenerate Cases
|
||||
|
||||
### 5.1 All-Zero Frame
|
||||
|
||||
For an all-zero sample vector, the encoder **must** use
|
||||
`prediction_order = 0` because the Levinson-Durbin recursion is
|
||||
undefined at `R[0] = 0`. Residuals equal the input (all zeros).
|
||||
|
||||
Partition-order and per-partition `k` selection remain at the
|
||||
encoder's discretion; any legal `(partition_order, k)` combination
|
||||
produces a bit-exact-decodable frame. The minimum-cost choice —
|
||||
`partition_order = 0`, `k = 0` — produces a Rice payload of exactly
|
||||
`5 + frame_sample_count` bits. Compliant encoders are **not**
|
||||
required to pick this minimum.
|
||||
|
||||
### 5.2 Single-Sample Frame
|
||||
|
||||
`frame_sample_count = 1` is valid but forces `partition_order = 0` (the only
|
||||
value that divides 1 evenly). The single sample is Rice-coded directly
|
||||
because no predecessors exist for any LPC order.
|
||||
|
||||
## 6. Error Recovery
|
||||
|
||||
Decoders **must** detect and reject every frame that violates the
|
||||
constraints elsewhere in this document. The exhaustive list of
|
||||
rejection classes, each of which is a distinct error condition so
|
||||
callers can distinguish them in telemetry, is:
|
||||
|
||||
1. **Sync word mismatch** — bytes `0-1` differ from `0x1ACC` (§3.1).
|
||||
2. **`prediction_order` out of range** — value > `32` (§3.2).
|
||||
3. **`partition_order` out of range** — value > `7` (§3.3).
|
||||
4. **`coefficient_shift` out of range** — value > `5` (§3.4).
|
||||
5. **Verbatim frame with non-zero shift** — `prediction_order == 0` and
|
||||
`coefficient_shift != 0` (§3.4).
|
||||
6. **`frame_sample_count == 0`** (§3.5).
|
||||
7. **`frame_sample_count` not divisible by `partition_count`** (§3.3).
|
||||
8. **Buffer truncated** — fewer bytes than the header plus coefficient
|
||||
array requires, or fewer bits than the Rice bitstream demands
|
||||
during streaming decode. This class is intentionally coarse-grained:
|
||||
a single `Truncated` variant covers header truncation, missing `k`
|
||||
fields, mid-codeword exhaustion, and every other "buffer ends early"
|
||||
shape the decoder can encounter. Sub-categorising these provides no
|
||||
caller benefit — the recovery action (discard and substitute
|
||||
silence, see §6.1) is identical regardless of where truncation
|
||||
happened.
|
||||
9. **Per-partition `k` out of range** — value > `23` (§4.1).
|
||||
10. **Unary-run cap exceeded** — any codeword with `q > u32::MAX >> k`
|
||||
(§4.2).
|
||||
|
||||
On any of these, the decoder **must** discard the frame, produce no
|
||||
output samples, and signal the error to the caller. **No partial
|
||||
state may propagate** to the next frame's decode — frames are
|
||||
independent (§2), so subsequent frames decode cleanly regardless.
|
||||
|
||||
### 6.1 Caller-side silence substitution
|
||||
|
||||
On rejection, the caller substitutes `frame_sample_count` zeros
|
||||
(silence) for the frame period. The count is obtained as follows:
|
||||
|
||||
- **Post-header rejections** (classes 8-10 above — `Truncated` in the
|
||||
Rice bitstream, `InvalidParameter` during Rice decode): the frame
|
||||
header parsed successfully before the failure, so the count is
|
||||
recoverable. The caller re-parses just the header on the same buffer
|
||||
(reference API: `parse_header(data)`) and reads `frame_sample_count`
|
||||
from the resulting `AudioFrameHeader`.
|
||||
- **Pre-header rejections** (classes 1-7 above): the header itself
|
||||
failed; the frame length is not recoverable from the bitstream. The
|
||||
caller **must** fall back to a session-level default frame size
|
||||
carried out-of-band by the container or transport (WebRTC and QUIC
|
||||
audio sessions typically negotiate this at session setup).
|
||||
|
||||
This asymmetry is inherent to the wire format: `frame_sample_count`
|
||||
lives inside the header at offset 5, so any rejection that happens
|
||||
while parsing bytes 0-4 precedes its discovery.
|
||||
|
||||
### 6.2 Decoder output magnitude
|
||||
|
||||
On well-formed bitstreams produced by a compliant encoder from
|
||||
in-contract samples (§1), decoder output satisfies
|
||||
`|sample| ≤ 2²³ − 1`.
|
||||
|
||||
Adversarial bitstreams — those with hand-crafted coefficients and
|
||||
residuals that pass every rejection check in this section yet
|
||||
produce arithmetic results outside the sample-magnitude contract —
|
||||
**may** produce output samples of any `i32` value, including values
|
||||
that exceed `2²³ − 1`. The decoder **must not** panic or reject on
|
||||
this basis: the wrapping-add semantics of §3.6 are precisely what
|
||||
makes every bit sequence produce a defined output, which is the
|
||||
ground of the "no partial state propagates" contract at the top of
|
||||
this section.
|
||||
|
||||
Callers that re-feed decoder output into LAC's encoder (for example,
|
||||
an MCU decode → PCM mix → re-encode pipeline) **should** validate or
|
||||
clamp to the input magnitude contract before re-encoding. A
|
||||
compliant encoder assumes its input satisfies `|sample| ≤ 2²³ − 1`
|
||||
and is not required to re-validate.
|
||||
|
||||
## 7. Encoder Guidance (non-normative)
|
||||
|
||||
The reference encoder's search has three phases:
|
||||
|
||||
```text
|
||||
# Phase 0: all-zero short-circuit
|
||||
R[0] = Σ sample[i]²
|
||||
if R[0] == 0:
|
||||
emit frame with prediction_order = 0, any legal partition_order (§5.1)
|
||||
return
|
||||
|
||||
# Phase 1: sparse LPC order grid with stop-when-stale early-out
|
||||
for prediction_order in [0, 2, 4, 6, 8, 10, 12, 16, 20, 24, 28, 32]:
|
||||
coeffs_q31 = levinson_durbin(samples, prediction_order) # cached
|
||||
shift = smallest s such that every |coeff_real| < 2^s
|
||||
coeffs_stored = quantize(coeffs_q31, to: Q(15 - shift))
|
||||
residuals = compute_residuals(samples, coeffs_stored, shift)
|
||||
for partition_order in 0..=7:
|
||||
if frame_sample_count % (1 << partition_order) != 0: continue
|
||||
rice_bits = estimate_cost(residuals, partition_order)
|
||||
total = header_bits(prediction_order) + rice_bits
|
||||
track minimum over (prediction_order, partition_order)
|
||||
if no improvement for 2 consecutive grid entries: break
|
||||
|
||||
# Phase 2: fixed-predictor post-pass
|
||||
for (fp_order, fp_coeffs, fp_shift) in FIXED_PREDICTORS:
|
||||
residuals = compute_residuals(samples, fp_coeffs, fp_shift)
|
||||
for partition_order in 0..=7:
|
||||
if frame_sample_count % (1 << partition_order) != 0: continue
|
||||
rice_bits = estimate_cost(residuals, partition_order)
|
||||
total = header_bits(fp_order) + rice_bits
|
||||
track minimum over (fp_order, partition_order)
|
||||
|
||||
emit frame with the (order, partition_order) that minimised `total`
|
||||
```
|
||||
|
||||
The sparse grid + early-out is a speed/compression trade-off; a
|
||||
compliant encoder may still exhaustively search every integer order
|
||||
`0..=32` for marginal gains at higher cost. The produced bitstreams
|
||||
are interchangeable.
|
||||
|
||||
The fixed-predictor post-pass tries FLAC-style integer predictors
|
||||
(orders 1-4 with a small static coefficient table) after the LPC
|
||||
grid. These evaluate quickly and occasionally beat the Levinson-
|
||||
Durbin winner on content where a low-order integer polynomial fits
|
||||
better than the statistically-optimal LPC fit — silent-plus-DC, very
|
||||
smooth tones, polynomial-ish sensor data. Running them second avoids
|
||||
tripping the stop-when-stale heuristic in the LPC phase.
|
||||
|
||||
The reference encoder's `FIXED_PREDICTORS` table, materialising the
|
||||
FLAC-style `[1]`, `[2, −1]`, `[3, −3, 1]`, `[4, −6, 4, −1]` integer
|
||||
predictors at the smallest `coefficient_shift` that represents each
|
||||
coefficient without clamping:
|
||||
|
||||
| `prediction_order` | `lpc_coefficients` (Q-format integers) | `coefficient_shift` | Real-value interpretation |
|
||||
|-------------------:|---------------------------------------------|--------------------:|------------------------------|
|
||||
| 1 | `[16384]` | 1 (Q14) | `[1]` |
|
||||
| 2 | `[16384, −8192]` | 2 (Q13) | `[2, −1]` |
|
||||
| 3 | `[24576, −24576, 8192]` | 2 (Q13) | `[3, −3, 1]` |
|
||||
| 4 | `[16384, −24576, 16384, −4096]` | 3 (Q12) | `[4, −6, 4, −1]` |
|
||||
|
||||
These are the exact wire-format bytes a second-team encoder would emit
|
||||
to match the reference's fixed-predictor outputs bit-for-bit. Compliant
|
||||
encoders MAY use a different set (or none), since §3.6 treats the
|
||||
coefficient field as opaque — decoders apply the synthesis formula
|
||||
identically regardless of source.
|
||||
|
||||
The `R[0] == 0` short-circuit is both a correctness requirement (§5.1
|
||||
— Levinson-Durbin is undefined at zero autocorrelation) and an
|
||||
encoder-cost optimisation: on digital silence, the sparse grid and
|
||||
fixed-predictor pass produce identical zero residuals and order 0
|
||||
wins on header size alone.
|
||||
|
||||
Levinson-Durbin runs once to order 32 with all intermediate orders saved
|
||||
into a flat buffer (one recursion pass yields all orders 1..32 at
|
||||
`O(order²)` cost), so the outer loop fetch is free and order selection
|
||||
is effectively `O(orders_tried × N)`.
|
||||
|
||||
`shift` is determined per order by the coefficient magnitudes — there
|
||||
is no shift search, as smaller shifts are always at least as good as
|
||||
larger ones when they don't clamp (saturation, §3.4, is the
|
||||
fallback). Rice cost at a given `partition_order` is exact and
|
||||
closed-form given the per-partition `k`, so the inner search
|
||||
introduces no estimation error.
|
||||
|
||||
#### Levinson-Durbin numerical choices (reference, non-normative)
|
||||
|
||||
The reference encoder runs Levinson-Durbin with i64 autocorrelation
|
||||
accumulators, Q31 working coefficients, and widens to i128 for
|
||||
reflection-coefficient intermediates at orders where Q31 would lose
|
||||
precision (typically above order ~12). Rounding on the Q31→Q(15−shift)
|
||||
quantisation step is round-half-up via `(a_q31 + bias) >> shift_amt`,
|
||||
with `bias = 1 << (14 + shift)` — the direct analogue of the synthesis
|
||||
formula's rounding (§3.6), chosen so analysis and synthesis agree on
|
||||
tie-break direction.
|
||||
|
||||
None of these choices are normative. Two encoders making different
|
||||
precision or rounding choices will produce different coefficient
|
||||
bytes on the same input, but both bitstreams decode correctly under
|
||||
§3.6 so long as the coefficients they emit faithfully represent their
|
||||
own LPC decision. §3.6's "wire format doesn't distinguish coefficients
|
||||
by derivation" clause is exactly what permits this freedom.
|
||||
|
||||
#### Rice `k` selection (reference, non-normative)
|
||||
|
||||
Cost is closed-form for a given `(partition, k)`: `bit_cost(k) =
|
||||
N · (1 + k) + Σ (v >> k)` where `v` ranges over the zigzag-mapped
|
||||
residuals of the partition. Exhaustive search over `k ∈ [0, 23]` is
|
||||
always acceptable and is the simplest compliant choice.
|
||||
|
||||
The reference encoder uses convex descent: seed `k_seed =
|
||||
⌊log₂(mean(v))⌋`, then walk either direction. Ties break **toward
|
||||
smaller `k`** on descent (condition `cost ≤ best_cost`, so equal
|
||||
costs at `k−1` overwrite `k`) and **strictly larger costs only** on
|
||||
ascent (condition `cost < best_cost`, so equal costs at `k+1` don't
|
||||
override `k`). This yields a unique `k` per partition matching an
|
||||
exhaustive search's first-wins tie-break.
|
||||
|
||||
On the reference corpus, the sparse grid's compression matches
|
||||
exhaustive-search output within ~0.2 percentage points (measured);
|
||||
the differential test caps the acceptable excess at 0.5%.
|
||||
Implementations that prefer tighter compression at higher encode
|
||||
cost can extend the grid without wire-format consequences — decoders
|
||||
do not care which orders an encoder tried.
|
||||
|
||||
### 7.1 Frame size (non-normative)
|
||||
|
||||
The codec accepts any `frame_sample_count` in `[1, 65535]`. Larger
|
||||
frames amortise the 7-byte fixed header and the LPC coefficient vector
|
||||
over more samples, and generally compress better. Smaller frames give
|
||||
tighter latency and finer partition-order granularity on transient
|
||||
content.
|
||||
|
||||
Recommended defaults by use case:
|
||||
|
||||
- **Real-time voice** (QUIC streaming, MCU mix): 160-320 samples at
|
||||
16 kHz, or 480-960 at 48 kHz — matches 10-20 ms frame periods used
|
||||
by Opus / WebRTC.
|
||||
- **Real-time full-band** (game audio, music conferencing): 1024-2048
|
||||
at 48 kHz (21-43 ms).
|
||||
- **Offline / archival**: 4096-8192 at 48 kHz; compression gains
|
||||
flatten past this.
|
||||
|
||||
Power-of-two frame sizes expose every `partition_order ∈ [0, 7]` to
|
||||
the encoder's search. Non-power-of-two sizes restrict the search to
|
||||
partition counts that divide evenly; a prime frame size forces
|
||||
`partition_order = 0`. Encoders SHOULD prefer frame sizes with several
|
||||
small-prime factors when free to choose.
|
||||
|
||||
## 8. Versioning
|
||||
|
||||
This document specifies **LAC version 1**, identified on the wire by
|
||||
`sync_word = 0x1ACC`. No per-frame version byte is carried; the sync
|
||||
word uniquely identifies the wire format.
|
||||
|
||||
Future revisions of the format **must** use a distinct `sync_word`. The
|
||||
recommended allocation is `0x1ACD` for v2, `0x1ACE` for v3, and so on
|
||||
inside the `0x1ACC..0x1ACF` cluster, with the cluster boundary making
|
||||
casual grep / hex-dump inspection robust. A revision whose wire format
|
||||
cannot be made compatible with v1 at the header level **must** pick a
|
||||
sync word outside this cluster.
|
||||
|
||||
This approach is exhaustive by construction:
|
||||
|
||||
- A v1 decoder that encounters a v2 frame sees an unrecognised sync
|
||||
word on the first check (§3.1) and rejects cleanly — the same error
|
||||
path as foreign or corrupted payloads.
|
||||
- A v2-aware decoder dispatches on the sync word before reading any
|
||||
further field, so it can fall back to v1 parsing when appropriate
|
||||
or decode v2 frames natively.
|
||||
|
||||
Because every field in §3 has a strict range with no reserved
|
||||
high-bit patterns, in-place extension (flag bits inside existing
|
||||
fields) is **not** a supported evolution path. New features go into a
|
||||
new `sync_word`, not into reinterpreting existing field values.
|
||||
|
||||
Transports that multiplex LAC frames with other formats should frame
|
||||
each LAC payload explicitly (length prefix or stream separator); the
|
||||
sync word alone is not a framing delimiter, only a format identifier.
|
||||
|
||||
## 9. Implementation notes (non-normative)
|
||||
|
||||
### 9.1 GPU offload is out of scope
|
||||
|
||||
LAC is a scalar integer codec. The reference implementation, and any
|
||||
conforming implementation this document anticipates, runs on a CPU.
|
||||
GPU offload is deliberately not a goal:
|
||||
|
||||
- **Levinson-Durbin** is serial by construction (each iteration depends
|
||||
on the previous) and its intermediate accumulator needs more than 64
|
||||
bits of precision at higher orders — a poor fit for WGSL or SPIR-V
|
||||
compute shaders, which have no native 128-bit integer arithmetic.
|
||||
- **Rice decode** uses a data-dependent unary run for every residual;
|
||||
on GPU execution models this diverges warps badly and its
|
||||
sequential bit-cursor progression fights SIMD lane packing.
|
||||
- **LPC synthesis** has a tight per-sample feedback loop (sample `i`
|
||||
depends on samples `i-1`, `i-2`, …, `i-order`), so each channel is
|
||||
inherently serial.
|
||||
- **The one plausibly GPU-parallel phase** — residual computation
|
||||
inside the encoder's order search — is also the phase where the
|
||||
CPU's autovectorized implementation is already well-served by SIMD
|
||||
on any modern target. At the measured encode latencies (P99 under
|
||||
50 µs on x86 for a 20 ms frame period, >400× headroom), there is no
|
||||
motivation to offload it.
|
||||
|
||||
A hypothetical future revision whose hot path genuinely benefited from
|
||||
GPU execution (large-batch archival encoding across many channels at
|
||||
once, for instance) would need to change the wire format to carry
|
||||
enough shape metadata for batched kernels — i.e., a new sync word
|
||||
under the versioning rules in §8, not a retrofit.
|
||||
|
||||
## 10. Conformance test vectors
|
||||
|
||||
The reference repository's `tests/conformance.rs` holds the canonical
|
||||
test-vector set for this specification:
|
||||
|
||||
- **`DECODE_FIXTURES`** — `(samples, bytes)` pairs pinned at the byte
|
||||
level. A conformant decoder **must** produce the `samples` array
|
||||
when fed the `bytes` array. Encoders have latitude (§3.6, §7), so a
|
||||
second-team encoder's bytes for the same samples may differ; the
|
||||
decoder direction is the normative one. Coverage includes the
|
||||
smallest valid frames (single-sample verbatim, 4- and 8-sample
|
||||
silence), single-sample polarity boundaries (±1, ±(2²³−1)), DC
|
||||
offset, alternating-polarity Nyquist-like content, smooth polynomial
|
||||
(fixed-predictor territory), and a 16-sample growing-amplitude
|
||||
pattern exercising partition search.
|
||||
- **`REJECT_FIXTURES`** — hand-constructed malformed inputs mapped to
|
||||
their expected rejection variants. Covers every class in §6 (1-10):
|
||||
bad sync, each field-range violation, verbatim + non-zero shift,
|
||||
`frame_sample_count == 0`, non-divisible partition count, header /
|
||||
coefficient / Rice-bitstream truncation, per-partition `k > 23`.
|
||||
- **`reject_unary_run_above_cap`** — a programmatic test for §6 class
|
||||
10 (`q > u32::MAX >> k`). The minimal triggering payload is ~75
|
||||
bytes of mostly zeros; construction logic is in the test, not a
|
||||
const fixture.
|
||||
|
||||
Second-team implementations should port the decode fixtures
|
||||
byte-for-byte and the reject fixtures byte-and-variant-for-variant.
|
||||
`encode_matches_fixtures` in the same file is reference-specific (it
|
||||
asserts the reference encoder's exact bytes) and is **not** a
|
||||
conformance requirement — see §3.6's encoder-latitude clause.
|
||||
|
||||
### 10.1 Reference encoder exemplars (non-normative)
|
||||
|
||||
The same `(samples, bytes)` pairs, read in the encoder direction,
|
||||
serve as **reference-encoder validation targets** for implementations
|
||||
that want to match this project's reference byte-for-byte — a common
|
||||
goal for porting work even though the spec does not require it. The
|
||||
fixture set is deliberately chosen to pin every encoder-discretion
|
||||
axis from §7:
|
||||
|
||||
- `single_zero`, `single_pos_one`, `single_neg_one` — single-sample
|
||||
frames. All three fall in the §5.2 "warm-up-is-whole-frame" regime
|
||||
where the encoder's order choice is nearly arbitrary; pinning the
|
||||
bytes fixes this project's choice (order 0 with the minimum-cost
|
||||
Rice encoding).
|
||||
- `silence_4`, `silence_8` — force `partition_order` tie-breaks on an
|
||||
all-zero frame (every `partition_order ∈ [0, log₂(N)]` produces
|
||||
identical cost; the reference picks the smallest via its convex-
|
||||
descent tie-break).
|
||||
- `dc_100_4`, `alternating_small_4` — exercise the order-vs-verbatim
|
||||
decision. DC content favours a low-order LPC fit with small
|
||||
residuals; alternating content favours order 1 with `a = −1`
|
||||
(approximated at the closest Q-format). Pinning the bytes fixes
|
||||
the reference's decision boundary.
|
||||
- `single_full_scale_pos`, `single_full_scale_neg` — maximum-magnitude
|
||||
single samples. Exercise the `|sample| ≤ 2²³ − 1` boundary on both
|
||||
sides and fix the zigzag-of-extremum output.
|
||||
- `linear_ramp_8` — smooth polynomial content, fixed-predictor
|
||||
territory. Pins the reference's fixed-predictor-vs-LPC tie-break.
|
||||
- `lfsr_noise_16` — exercises partition search on a frame large
|
||||
enough for `partition_order > 0` to be competitive.
|
||||
|
||||
A second-team encoder that produces the same bytes for every entry
|
||||
here is **likely** (not guaranteed) to produce matching bytes on
|
||||
wider inputs, since the tie-break axes are the ones most sensitive
|
||||
to encoder discretion. An encoder that produces different bytes is
|
||||
still compliant so long as its own bytes round-trip — see §3.6, §7.
|
||||
Loading…
Add table
Add a link
Reference in a new issue