# Design Specification: High-Performance UDP File Transfer Protocol (HP-UDP) v5.2

## 1. The Motivation (The "Why")

The development of HP-UDP is driven by a singular goal: to engineer the fastest possible file transfer protocol that mathematically guarantees perfect data integrity across volatile network conditions.

While TCP is the foundational workhorse of the internet, its general-purpose congestion algorithms inherently throttle performance on Long Fat Networks (LFNs). HP-UDP was built to prove that it is possible to outperform TCP in raw throughput by replacing reactive safety nets with proactive, domain-specific algorithms.

This protocol democratizes high-speed data movement, giving developers and engineers the ability to send and receive massive files cleanly, reliably, and at maximum hardware limits. It is a rigorous demonstration of advanced systems engineering, built to prove what is possible when legacy constraints are stripped away.

## 2. Architectural Overview

HP-UDP is an application-layer file transfer mechanism built on top of UDP. The design is lean, avoids unnecessary overhead, and focuses intently on its primary goal of speed while ensuring the reliability required for production use.

The architecture is built upon five core pillars:

- **Ultra-Fast Initiation:** A 0-RTT Optimistic Handshake with wire-speed calibration burst.

- **Reliable Data Transfer:** Adaptive Forward Error Correction (FEC) for proactive loss recovery.

- **Adaptive Throughput:** Loss-driven congestion control with deficit-accumulator pacing and rate-proportional Heartbeat feedback.

- **Guaranteed Integrity:** Graceful Teardown linked to pipelined, full-file checksum verification.

- **Resilient Session Management:** Timeout-driven cleanup on both sender and receiver with linger states.

**Note on Security Scope:** HP-UDP v5.0 adds optional end-to-end encryption via ephemeral X25519 key exchange and AES-128-GCM per-packet encryption (§4.5), providing confidentiality, integrity, and perfect forward secrecy. Encryption is backward-compatible: unencrypted transfers still work when the `Encrypted` flag is unset. HP-UDP intentionally omits *authentication* — it does not verify the identity of the remote endpoint. In the target deployment environment (managed networks with known infrastructure behind SDNs), endpoint identity is established at the network layer. Optional certificate or pre-shared-key authentication may be added in a future revision as a separate concern.

## 3. Packet Wire Format (Custom Header)

The protocol utilizes a tightly packed, fixed-width **32-byte** binary header for every datagram. The header is naturally aligned for 64-bit systems (four 8-byte words). The hard MTU cap is **1400 bytes total** (header + payload), yielding a maximum payload of **1368 bytes** (`MTUHardCap(1400) − HeaderSize(32)`). This ensures safe passage within standard 1500-byte ethernet MTUs without IP-level fragmentation.

**Byte order:** **All multi-byte fields in the entire protocol** — header fields, heartbeat payload fields (§6B), SESSION_REQ payload fields (§4C), PUSH_REQ/PUSH_ACCEPT payload fields (§11C), and NACK arrays — are in **big-endian (network byte order)**. C implementations must use `htonl`/`ntohl` (32-bit) and `htonll`/`ntohll` (64-bit) or equivalent for every multi-byte field on the wire. Go implementations use `binary.BigEndian` methods. This applies uniformly; there are no little-endian fields anywhere in the protocol.

``````````````````````````````

``````

````````

| Byte Offset | Size | Field Name | Description |
|---|---|---|---|
| 0x00 | 1 Byte | PacketType | 0x00 SESSION_REQ,
0x01 DATA,
0x02 PARITY,
0x03 HEARTBEAT,
0x04 SESSION_REJECT,
0x05 TRANSFER_COMPLETE,
0x06 ACK_CLOSE,
0x07 PULL_REQ,
0x08 PUSH_REQ,
0x09 PUSH_ACCEPT,
0x0A SESSION_ACCEPT,
0x0B RESUME_REQ,
0x0C RESUME_ACCEPT,
0x0D LIST_REQ,
0x0E LIST_RESP. |
| 0x01 | 4 Bytes | SessionID | Client-generated random identifier for the active transfer (see §4 for collision handling). |
| 0x05 | 8 Bytes | SequenceNum | Strictly incrementing 64-bit chunk identifier. Eliminates rollover concerns up to ~16 EB file sizes. |
| 0x0D | 8 Bytes | BlockGroup | 64-bit identifier for the FEC block this packet belongs to. Aligned with SequenceNum address space. |
| 0x15 | 2 Bytes | PayloadLen | Size of the raw data payload (max 1368 bytes). |
| 0x17 | 1 Byte | Flags | Bitmask: 0x01 = End of File, 0x02 = Calibration Burst, 0x04 = Encrypted (payload is AES-128-GCM ciphertext; see §4.5). |
| 0x18 | 8 Bytes | SenderTimestampNs | The sender's monotonic clock timestamp in nanoseconds at the moment each DATA or PARITY packet is built (C: clock_gettime(CLOCK_MONOTONIC) converted to nanoseconds; Go: time.Now().UnixNano()). Non-data control packets leave this field zero. The receiver echoes this value verbatim as EchoTimestampNs in the Heartbeat payload; the sender computes RTT = now_ns − EchoTimestampNs using only its own clock (§6B). |
| 0x20 | Variable | Payload | Raw file bytes, FEC parity data, or protocol metadata. |

## 4. Connection Establishment: 0-RTT Optimistic Handshake

To eliminate latency before data transmission begins, the protocol uses a Zero Round-Trip Time (0-RTT) handshake with a wire-speed calibration burst to probe link capacity.

### A. SessionID Generation

The **client** generates the SessionID as a cryptographically random 32-bit integer (C: `getrandom()` or `/dev/urandom`; Go: `crypto/rand`). This keeps the handshake 0-RTT since no server round-trip is required for ID assignment.

- **Collision Handling:** The server maintains a set of active SessionIDs. If an incoming `SESSION_REQ` carries a SessionID that is already in use, the server responds with a `SESSION_REJECT` (Type `0x04`) containing a reason code. The client generates a new random SessionID and retransmits the `SESSION_REQ`. At typical concurrency levels (hundreds of concurrent transfers), 32-bit random IDs yield negligible collision probability (~1 in 10 million at 200 concurrent sessions).

- **Stale Session Protection:** SessionIDs are held in a reserved pool for 10 seconds after session purge to prevent late-arriving packets from a completed transfer from being misattributed to a new session reusing the same ID.

### B. SESSION_REQ Validation

Before allocating resources, the receiver validates the `SESSION_REQ` payload:

- **File Size:** Must be non-zero and no larger than the configured maximum (default: 1 TB). Prevents out-of-memory crashes from corrupted or malicious packets.

- **File Name:** Must be non-empty. The receiver uses only the base filename (path separators stripped) to prevent path traversal.

If validation fails, the receiver sends a `SESSION_REJECT` and logs a diagnostic warning.

### C. Handshake Sequence

The handshake is **0-RTT when unencrypted** and **1-RTT when encrypted** (§4.5). Both flows are described below.

- **Step 1 (The Request):** The client transmits `Packet 0` (`SESSION_REQ`). The payload contains: `FileSize` (8 bytes), `xxHash64` checksum (8 bytes), `InitialRate` (4 bytes, 0 = use calibration mode), and `FileName` (variable, null-terminated). **If the Encrypted flag (0x04) is set in the header Flags field,** a 32-byte `SenderPublicKey` (X25519 ephemeral) is appended after `InitialRate` and before `FileName`. The null terminator is appended after the filename bytes and is **not** counted in `PayloadLen`. Filenames must not contain embedded null bytes. The receiver strips all path separators (`/` and `\`) from the filename before writing — C implementations must scan for both characters to be platform-neutral.

- **Step 1.5 (Key Exchange — encrypted mode only):** If the `Encrypted` flag is set, the receiver responds with a `SESSION_ACCEPT` (Type `0x0A`) carrying its own 32-byte ephemeral `ReceiverPublicKey` as the payload. Both sides compute the shared secret via X25519 and derive the session key via HKDF (§4.5). The sender blocks until `SESSION_ACCEPT` arrives or the sender inactivity timeout fires. **In unencrypted mode, this step is skipped entirely** — the sender proceeds directly to Step 2.

- **Step 2 (The Calibration Packet Train):** The client begins transmitting `DATA` packets with the `Calibration` flag (`0x02`) set. In unencrypted mode, this starts immediately after Step 1 (0-RTT). In encrypted mode, this starts after receiving `SESSION_ACCEPT` (1-RTT). The burst consists of **10 packets** sent back-to-back at wire speed. This small packet train probes the link without flooding router buffers, even on constrained links like satellite. The token bucket is initialized at a default starting rate of 2 MB/s. **In encrypted mode, calibration DATA packets are encrypted** — there are no plaintext data packets on the wire once the key exchange completes.

- **Step 3 (Dispersion Measurement):** The receiver timestamps the arrival of the first and last calibration-flagged packet. Although 10 packets are sent back-to-back at wire speed, they arrive *spread out* by the bottleneck link — if 10 packets arrive over 10ms, the bottleneck bandwidth is `(10 − 1) × 1368 / 0.010 = 1.23 MB/s` (or `(10 − 1) × 1352 / 0.010` in encrypted mode). The receiver reports this as the `DispersionNs` field in the first heartbeat, giving the sender a direct measurement of the path's bottleneck capacity before the first rate adjustment. The sender can use this to seed the CC's peak rate estimate.

- **Step 4 (The Alignment):** The server parses the request, validates the payload (§4B), allocates the memory ring buffer, and begins accepting incoming data. If the server cannot allocate resources in time, initial calibration packets are dropped — these will be detected and recovered via the standard Heartbeat/NACK mechanism (§6).

- **Step 5 (Steady State):** Upon receiving the first Heartbeat response (§6), the sender transitions out of calibration mode. It clears the `Calibration` flag. If the heartbeat includes a valid `DispersionNs`, the sender uses the derived bandwidth as the initial `peakRate` estimate. The loss-driven congestion controller (§6) governs the sending rate from this point forward.

#### Configurable Initial Rate Override

If the `InitialRate` field in `SESSION_REQ` is non-zero, the sender skips calibration mode and begins transmitting at the specified bytes-per-second rate immediately. This is intended for known environments (e.g., a dedicated 10 Gbps LAN) where the operator can confidently set the initial rate. The adaptive congestion controller still takes over after the first Heartbeat.

**Design History:** v2.0 used 50 packets at 1ms spacing (~1.38 MB/s probe). v3.0 changed to 100 packets at wire speed. Both had problems: the 1ms spacing made LAN ramp-up take too long, while 100 wire-speed packets (140 KB) instantly filled router buffers on Starlink and poisoned the initial peakRate measurement. The v3.1 packet train (10 packets) is small enough to avoid buffer overflow on any reasonable link, while the dispersion measurement extracts the same capacity information that 100 packets would provide.

## 4.5. End-to-End Encryption

HP-UDP optionally encrypts all DATA and PARITY payloads using **AES-128-GCM** with ephemeral X25519 key exchange. Encryption is negotiated during the handshake (§4C Step 1.5) and is all-or-nothing for a session — once the `Encrypted` flag is set, every DATA and PARITY packet in the session is encrypted. Control packets (HEARTBEAT, TRANSFER_COMPLETE, ACK_CLOSE) are **not** encrypted; their payloads contain only protocol metadata, not file content.

### A. Ephemeral Key Exchange

Both sides generate a fresh **X25519 keypair** (32-byte public key, 32-byte private key) at the start of each session. Private keys exist only in memory for the duration of the transfer and are securely zeroed on session teardown. This provides **perfect forward secrecy**: there is no persistent key material that could decrypt recorded traffic after the session ends.

Key exchange is embedded in the existing handshake flow with no additional round trips beyond the 1-RTT `SESSION_ACCEPT`:

````

| Flow | Sender Key In | Receiver Key In | Added RTTs |
|---|---|---|---|
| Direct send/recv | SESSION_REQ payload | SESSION_ACCEPT (0x0A) payload | +1 (0-RTT → 1-RTT) |
| Serve daemon push | PUSH_REQ payload | PUSH_ACCEPT payload (extended) | +0 (already 1-RTT) |
| Serve daemon pull | PULL_REQ payload | SESSION_REQ payload (server is sender) | +0 (already 1-RTT) |

For `push` and `pull` via the serve daemon, the existing round trip already accommodates the key exchange — no additional latency is introduced. Only the basic `send`/`recv` flow gains one round trip.

### B. Session Key Derivation

Both sides independently derive the same symmetric key:

```
shared_secret = X25519(my_private_key, their_public_key)     // 32 bytes
okm           = HKDF-SHA256(
ikm  = shared_secret,
salt = SessionID (4 bytes, big-endian),
info = "hp-udp-aes128-v5",
len  = 24                                  // 16-byte key + 8-byte iv_base
)
session_key   = okm[0..15]                                    // AES-128 key
iv_base       = okm[16..23]                                   // Nonce base (replaces random init)```

The 24-byte HKDF output serves a dual purpose: bytes 0–15 are the AES-128-GCM session key, and bytes 16–23 become the shared `iv_base` used in nonce construction (§4.5C). Both sides derive the same `iv_base` deterministically — no additional exchange is needed. The random `iv_base` generated during key-pair initialisation is overwritten by the HKDF-derived value at the end of `hpudp_crypto_derive()`.

The `SessionID` salt ensures that even if the same ephemeral keypair were accidentally reused (implementation bug), different sessions would derive different keys and nonce bases. The `info` string binds the key to the protocol version and cipher suite, preventing cross-protocol key reuse.

C implementations: OpenSSL `EVP_KDF` with `OSSL_KDF_NAME_HKDF`, or libsodium `crypto_kdf_hkdf_sha256_expand`. Go: `golang.org/x/crypto/hkdf`.

### C. Per-Packet Encryption

Each DATA and PARITY packet is encrypted independently using AES-128-GCM. The packet header (32 bytes) is **not encrypted** — it is passed as Additional Authenticated Data (AAD) so that the receiver can route, reorder, and identify packets before decryption. The header is authenticated by the GCM tag, preventing tampering.

#### Wire Layout (Encrypted Packet)

```
┌──────────────────┬──────────────────────────┬──────────────┐
│ Header (32 B)    │ Ciphertext (PayloadLen B) │ GCM Tag (16B)│
│ cleartext, AAD   │ AES-128-GCM output        │ auth tag     │
└──────────────────┴──────────────────────────┴──────────────┘
 Total ≤ 1400 bytes.  PayloadLen ≤ 1352 (= 1368 − 16 tag).```

`PayloadLen` in the header reflects the **plaintext length** (which equals the ciphertext length in GCM). The receiver reads `PayloadLen + 16` bytes from the payload area to get ciphertext + tag. **Encrypted MaxPayload = 1352 bytes.** Unencrypted transfers retain MaxPayload = 1368.

#### Nonce Construction (12 Bytes)

AES-GCM requires a unique nonce for every packet encrypted under the same key. Nonce reuse completely breaks GCM's confidentiality and authenticity guarantees. The nonce is constructed deterministically from the HKDF-derived `iv_base` (§4.5B) and the packet's sequence number:

````

| Bytes | Field | Purpose |
|---|---|---|
| 0–7 | iv_base (8 bytes) | Session-scoped nonce base derived from HKDF output (§4.5B). Identical on both sides; never transmitted. Binds nonce space to this session's key material. |
| 8–11 | Low 32 bits of SequenceNum (big-endian, 4 bytes) | Unique per packet within the session. Combined with iv_base, produces a distinct 96-bit nonce for every packet. Applied as htobe32(seq & 0xFFFFFFFF). |

**Nonce uniqueness proof:** `SequenceNum` is strictly incrementing and never reused within a session. The maximum file size is 1 TB; at 1352 bytes per encrypted payload, that is at most ~810 million packets — well below the 232 (≈4.29 billion) wrap point. Therefore the low 32 bits of `SequenceNum` are globally unique within any single session, and combined with the session-scoped `iv_base`, the 12-byte nonce is unique for every (session, packet) pair.

The nonce is **not transmitted on the wire**. Both sides compute it independently from `iv_base` (derived identically via HKDF) and the `SequenceNum` in the packet header. This saves 12 bytes per packet compared to an explicit nonce.

### D. Encryption Placement in the Data Path

Encryption is applied **after FEC encoding**. The FEC encoder operates on plaintext data shards and produces plaintext parity shards. Each shard (DATA or PARITY) is then encrypted independently before transmission. On the receiver side, each packet is decrypted individually, then the plaintext shards are passed to the FEC decoder for reconstruction if needed.

```
Sender:  file → chunk → FEC encode (plaintext) → encrypt each shard → transmit
Receiver: receive → decrypt each shard → FEC decode (plaintext) → reassemble → disk```

This ordering means FEC reconstruction operates on plaintext, which is correct — the Reed-Solomon math must see the original data bytes, not ciphertext (encrypting before FEC would require decrypting all `k` shards before reconstruction, which is the same work, but reconstructed ciphertext shards would then need the original plaintext to verify, creating a circular dependency).

### E. Payload Format Changes for Key Exchange

When the `Encrypted` flag (`0x04`) is set, the following payloads are extended with a 32-byte ephemeral public key:

****

**
****

****

**
****

****

| Packet Type | Standard Payload | Encrypted Payload |
|---|---|---|
| SESSION_REQ | FileSize(8B) + Hash(8B) + InitialRate(4B) + FileName(null-term) | FileSize(8B) + Hash(8B) + InitialRate(4B) + PubKey(32B) + FileName(null-term) |
| SESSION_ACCEPT | (does not exist in unencrypted mode) | PubKey(32B) |
| PUSH_REQ | FileSize(8B) + Hash(8B) + InitialRate(4B) + FileName(null-term) | FileSize(8B) + Hash(8B) + InitialRate(4B) + PubKey(32B) + FileName(null-term) |
| PUSH_ACCEPT | (no payload) | PubKey(32B) |
| PULL_REQ | FileName(null-term) | PubKey(32B) + FileName(null-term) |

The receiver determines whether to parse the public key by checking the `Encrypted` flag in the packet header. If a receiver does not support encryption and receives a request with the flag set, it responds with `SESSION_REJECT` (reason code `0x06 = ENCRYPTION_UNSUPPORTED`).

### F. Security Properties and Non-Goals

- **Confidentiality:** File content is encrypted in transit. An eavesdropper sees only ciphertext and packet headers (which contain sequence numbers and timing, but no file content).

- **Integrity:** GCM's authentication tag detects any modification to the ciphertext or header. A tampered packet fails decryption and is treated as a lost packet (NACKed for retransmission).

- **Forward secrecy:** Ephemeral keys are destroyed after each session. Recorded traffic cannot be decrypted retroactively.

- **Replay protection:** The strictly-incrementing nonce (derived from SequenceNum) combined with the per-session key means replayed packets from a previous session will fail decryption (wrong key), and replayed packets within a session are detected by the existing duplicate-sequence-number check in the receiver's ring buffer.

- **Non-goal — Authentication:** HP-UDP does not authenticate the identity of the remote endpoint. Any party that can reach the port can initiate a key exchange. This is intentional: in the target deployment environment (managed networks with known infrastructure), endpoint identity is established at the network layer, not the application layer. A future revision may add optional pre-shared-key or certificate-based authentication as a separate concern.

- **Non-goal — Metadata privacy:** Packet headers (including file size, sequence numbers, and timing) are visible to observers. Traffic analysis can reveal transfer size and duration. Metadata encryption is out of scope.

### G. Performance Budget

| Operation | Throughput (AES-NI) | Impact at 100 MB/s wire speed |
|---|---|---|
| AES-128-GCM encrypt | 4–6 GB/s single-thread | <3% CPU |
| AES-128-GCM decrypt | 4–6 GB/s single-thread | <3% CPU |
| X25519 scalar multiply | ~50 µs per session | Negligible |
| HKDF-SHA256 derivation | ~1 µs per session | Negligible |
| Payload reduction (1368→1352) | 1.2% fewer data bytes per packet | ~1.2% more packets for same file |

**Net throughput impact: <5%.** AES-NI hardware acceleration is present on every x86 CPU manufactured since ~2010 (Intel Westmere / AMD Bulldozer). C implementations should use OpenSSL's `EVP_aes_128_gcm` (which auto-detects AES-NI) or a SIMD-accelerated library. Go's `crypto/aes` + `crypto/cipher` uses AES-NI on amd64 automatically.

**Implementation Note:** Do not allocate and free GCM cipher contexts per packet. Pre-allocate one context at session start and reuse it across packets, updating only the nonce via `EVP_EncryptInit_ex(ctx, NULL, NULL, NULL, nonce)` (OpenSSL) or by resetting the `cipher.AEAD` seal call with a new nonce (Go). Context reuse eliminates ~25,000 allocations/sec at 35 MB/s throughput.

## 5. Core Reliability Mechanisms

### A. Sequence Buffering (Zero-Blocking Receiver)

The receiver will never halt reading from the network socket. Incoming data is immediately mapped into memory.

- **Memory Architecture:** The receiver allocates a contiguous ring buffer based on the initial `FileSize`.

- **Placement:** As datagrams arrive, their `SequenceNum` dictates their exact memory offset. Out-of-order packets are slotted into their correct positions seamlessly.

- **Disk I/O:** Contiguous blocks are flushed from the buffer to disk asynchronously. Go: a dedicated goroutine reads and writes sequentially. C (Linux): `io_uring` submission queue — the main `epoll` event loop submits write SQEs for contiguous regions and reaps completions without blocking the network read path. This eliminates the flush thread entirely.

### B. Adaptive Forward Error Correction (FEC)

To eliminate latency penalties from round-trip retransmissions, the protocol proactively embeds mathematical redundancy that adapts to observed network conditions.

#### Block Grouping

Data packets are organized into sequential `BlockGroups`. The default block size is 100 data packets per group. The `BlockGroup` identifier for a DATA packet is computed as:

```
BlockGroup = SequenceNum / BlockSize   (integer division)```

PARITY packets use the same `BlockGroup` value as the data packets they protect. The `SequenceNum` field of a PARITY packet is its **zero-based index within the block** (0, 1, 2, …, m−1), not a global sequence number. A C receiver must distinguish PARITY from DATA packets using the `PacketType` field (`0x02`) and interpret `SequenceNum` accordingly.

#### Dynamic Parity Ratio

The parity packet count per block is dynamically adjusted based on the observed packet loss rate, reported via Heartbeat metrics (§6).

| Observed Loss Rate | Parity Ratio | Parity Packets per 100-Packet Block |
|---|---|---|
| < 0.5% | 2% | 2 |
| 0.5% – 2% | 5% | 5 |
| 2% – 5% | 10% | 10 |
| 5% – 10% | 15% | 15 |
| > 10% | 20% | 20 |

The sender initializes at **5% parity** during calibration and adjusts after the first Heartbeat containing loss data. Adjustments are applied on **block group boundaries** — mid-block changes are not permitted, as this would invalidate the Reed-Solomon coding parameters for that group.

#### Parity Generation

Parity packets are generated using **Reed-Solomon erasure coding** over GF(28) with the irreducible polynomial `x8 + x4 + x3 + x2 + 1` (0x11d). For a block of `k` data packets with `m` parity packets, any `k` of the `k+m` total packets are sufficient to reconstruct the original data. The encoding uses a Vandermonde-derived matrix whose top `k` rows form an identity matrix, ensuring data packets pass through unchanged and only parity is computed.

#### On-the-Fly Recovery

If the receiver detects missing packets within a completed block group (all expected sequence numbers accounted for or timed out), it attempts FEC reconstruction immediately. Only packets that cannot be recovered via FEC are reported as NACKs in the next Heartbeat.

#### Tail Block Handling

The final block group of a file transfer will almost certainly contain fewer than 100 data packets. The FEC parameters adapt as follows:

- The tail block size `k_tail` equals the remaining packet count after the last full block.

- The parity count `m_tail` is calculated using the current adaptive parity ratio, with a **minimum of 2 parity packets** regardless of block size (even a 1-packet tail block gets 2 parity packets). This ensures the most loss-vulnerable portion of the transfer — the tail — has adequate redundancy.

- The sender sets the `EndOfFile` flag (`0x01`) on the final data packet and the final parity packet of the tail block. This signals to the receiver that no further block groups will follow.

The key formulas for computing totals and boundaries are:

```
TotalChunks      = ceil(FileSize / MaxPayload)          // number of DATA packets (MaxPayload = 1368)
k_tail           = TotalChunks % BlockSize               // 0 means last block is full
(if k_tail == 0: k_tail = BlockSize)
FinalPayloadSize = FileSize % MaxPayload                 // bytes in last DATA packet
(if FinalPayloadSize == 0: FinalPayloadSize = MaxPayload)```

**EOF Detection in FEC-Recovered Packets:** The `EndOfFile` flag is only meaningful on the wire. When the final DATA packet is *recovered via FEC reconstruction* rather than received directly, the flag is not propagated into the reconstructed shard. Receivers must therefore detect end-of-transfer by checking `SequenceNum == TotalChunks − 1` (known from the `SESSION_REQ` `FileSize` field) rather than relying solely on the Flags field.

## 6. Adaptive Congestion and Flow Control

HP-UDP separates **congestion control** (network path capacity) from **flow control** (receiver processing capacity). The congestion controller is **loss-driven**: the primary signal is the observed packet loss rate, not the ratio of delivery rate to send rate. The delivery rate acts as a ceiling, not a decision driver.

**Design Rationale (v3.0):** The v2.0 spec used a delivery-rate-ratio algorithm where the sender increased only if `EffectiveRate ≥ 0.95 × SendRate`. In practice, this threshold was unreachable: delivery rate is bounded by send rate (the receiver cannot report receiving more than was sent), and timing jitter in heartbeat measurement windows made the ratio consistently fall below 0.95 even on a clean link. This caused the sender to spiral to the rate floor. Loss rate is the correct primary signal because it reflects whether the network has headroom independently of the send rate.

### A. The Heartbeat Packet

The receiver periodically sends a `HEARTBEAT` (Type `0x03`) packet to the sender. The heartbeat interval is **rate-proportional** based on the last measured `NetworkDeliveryRate`:

| Last Measured Network Delivery Rate | Heartbeat Interval |
|---|---|
| < 10 MB/s | 100ms |
| 10 – 100 MB/s | 50ms |
| 100 MB/s – 1 GB/s | 25ms |
| > 1 GB/s | 10ms |

**Implementation Note:** The v2.0 spec described the interval as based on "current effective send rate" inferred from packet arrival rate. The implementation initially used cumulative bytes written to disk, which gave incorrect results (e.g., passing 136 MB to a function expecting bytes/sec). The correct approach is to track the `NetworkDeliveryRate` computed in each heartbeat and use that value directly for interval selection.

### B. Heartbeat Payload

The Heartbeat payload contains dual metrics, RTT echo, and a NACK array. All multi-byte fields below are **big-endian (network byte order)** — C implementations must serialize/deserialize with `htonl`/`ntohl` and `htonll`/`ntohll`:

****

````

````

``````

| Field | Size | Description |
|---|---|---|
| NetworkDeliveryRate | 4 Bytes | Bytes per second successfully received from the socket into the ring buffer during the last heartbeat interval. Reflects network path capacity. |
| StorageFlushRate | 4 Bytes | Bytes per second flushed from the ring buffer to disk during the last heartbeat interval. Reflects receiver I/O capacity. |
| LossRate | 2 Bytes | Packet loss percentage for the current reporting window, encoded as basis points (e.g., 150 = 1.50%). Primary signal for congestion control. |
| EchoTimestampNs | 8 Bytes | The verbatim value of SenderTimestampNs from the most recently received DATA or PARITY packet header. The sender computes RTT = now_ns − EchoTimestampNs using only its own monotonic clock, eliminating cross-machine clock-skew error entirely. If no DATA/PARITY packets were received in the interval, the receiver echoes the previous value unchanged (the sender's frozen-timestamp guard ignores stale repeats — see §6B). |
| DispersionNs | 8 Bytes | Calibration burst dispersion measurement: the time (nanoseconds) between the first and last received calibration-flagged packet. Zero outside of calibration. The sender uses this to compute bottleneck bandwidth: BW = (BurstSize − 1) × MaxPayload / DispersionNs (where MaxPayload = 1368). See §4C. |
| HighestContiguous | 8 Bytes | The highest SequenceNum such that all packets 0..N have been received or FEC-recovered. Allows sender to track receiver progress. |
| NACKCount | 2 Bytes | Number of unrecoverable sequence numbers in the NACK array that follows. |
| NACKArray | 8 Bytes × N | Array of 64-bit SequenceNum values that were not recoverable via FEC and require retransmission. Bounded to sequences between HighestContiguous+1 and the highest received sequence number (never NACKs packets the sender hasn't transmitted yet). The array is physically limited to fit within one packet: MaxNACKs = (MaxPayload − HeartbeatFixedSize) / 8 = (1368 − 36) / 8 = 166. For spec simplicity this is rounded down and stated as 167 in rate-limiting contexts; implementations must not exceed the true computed limit. If more than 167 sequences are pending, the receiver sends the highest-priority subset and the remainder appear in a subsequent heartbeat. |

#### RTT Measurement — Same-Clock Design

Each DATA and PARITY packet carries a **sender timestamp** in the fixed `SenderTimestampNs` header field (offset `0x18`, 8 bytes, unix nanoseconds set at packet-build time). Non-data control packets leave this field zero.

**Problem with prior versions:** v3.x had the receiver set `EchoTimestampNs = receiver's time.Now()`. The sender then computed `RTT = sender_now − receiver_now`, a cross-machine clock comparison. A 4-second clock skew between machines produced a 5-second RTT estimate, permanently locking the NACK cooldown above the receiver's 5-second inactivity timeout and killing the transfer.

**Fix:** The receiver now echoes the sender's own timestamp verbatim: `EchoTimestampNs = pkt.Header.SenderTimestampNs`. The sender computes `RTT = now_ns − EchoTimestampNs` using only its own monotonic clock. Cross-clock error is eliminated entirely.

#### Frozen-Timestamp RTT Guard

**Problem:** When the sender is idle (honouring a NACK cooldown), the receiver keeps echoing the same stale `SenderTimestampNs` from the last packet it received. Each heartbeat makes `RTT = now − staleTs` grow by one heartbeat interval. After enough heartbeats the RTT inflates past 5 seconds, the NACK cooldown exceeds the receiver's inactivity timeout, and the transfer dies.

**Fix:** The `TokenBucket` tracks `lastEchoNs` — the highest `EchoTimestampNs` it has processed. RTT is updated only when `echoNs > lastEchoNs`. Stale repeated echoes are silently ignored; the RTT estimate stays locked at its last valid measurement until the sender transmits a new packet and the receiver reflects a fresh timestamp.

### C. Loss-Driven Rate Adjustment Algorithm

The sender adjusts its rate based on the `LossRate` reported in each Heartbeat. The effective delivery rate (`rawEffective = NetworkDeliveryRate`) serves as a ceiling for decreases. `StorageFlushRate` is still reported in heartbeats for observability but is no longer an input to the rate controller: the receiver uses a pre-allocated full-file ring buffer, so disk lag cannot cause packet loss. Including `StorageFlushRate` in the minimum caused out-of-order packets to stall the contiguous flush frontier to 0, making `StorageFlushRate ≈ 0` and falsely triggering the delivery-collapse guard on every heartbeat. Rate increases are **gated to once per RTT** to prevent the sender from making dozens of upward adjustments before any feedback arrives (critical on high-latency links like satellite).

#### RTT-Aware Rate Gating

The sender tracks the time of its last rate increase. When a heartbeat signals an increase, the sender checks whether at least one RTT has elapsed since the previous increase. If not, the increase is suppressed (treated as a hold). Decreases are **not gated** — they are applied immediately upon consecutive loss signals to clear the pipe as fast as possible.

#### Phased Growth Model

The congestion controller operates in two phases, similar to TCP's slow start and congestion avoidance but adapted for loss-driven UDP with FEC:

**Phase 1 — Probe (Multiplicative Increase):** While loss is < 1%, the sender has never observed the link ceiling. It probes aggressively with multiplicative increase, applied once per RTT:

```
S_new = S_current × 1.25```

This is more conservative than the v3.0 1.5× multiplier but is applied per RTT rather than per heartbeat, giving the network time to signal back before each step.

**Phase 2 — Congestion Avoidance (Additive Increase):** Once the sender observes loss entering the 1%–5% hold zone for the first time, it has found the approximate ceiling of the link. The controller permanently transitions to Phase 2 and never returns to Phase 1 for this session. Probing uses additive increase, applied once per RTT:

```
S_new = S_current + (MaxPayload / RTT)```

This adds approximately one packet per RTT of additional bandwidth, gently probing for headroom without risking a burst of loss.

#### Decision Logic

Let `L` = reported `LossRate` in basis points, `E` = effective delivery rate (`NetworkDeliveryRate` — see §6C rationale above), `S` = current send rate. On each Heartbeat reception, the **delivery-collapse guard is checked first**:

````
********

****````

****``****

``****
****``

| Condition | Action | Rationale |
|---|---|---|
| NACKCount > 0 AND E < S × 0.25 | Hold + permanently transition to Phase 2. Evaluated before loss thresholds. | OS socket buffer overflow. Packets are dropped before reaching the receiver application — reported LossRate stays 0% (no FEC failures counted) while delivery collapses. NACKs confirm real packet loss. Entering Phase 2 fires the 1.5× ceiling immediately, cutting the target rate to near actual link capacity. Threshold lowered from 50% to 25%: on high-latency paths (50ms+ RTT) approximately 50% of packets are legitimately in-flight during warm-up, causing the old 50% threshold to fire prematurely on measurement lag. |
| L < 100 (loss < 1%) | Increase (once per RTT): Phase 1: S × 1.25. Phase 2: S + MaxPayload/RTT. | Link has headroom. FEC absorbs transient loss. RTT gating prevents runaway probing on high-latency paths. |
| 100 ≤ L ≤ 500 (1% – 5%) | Hold: S = S. If first time entering this zone, transition to Phase 2 permanently. | FEC is handling the loss. The link ceiling has been discovered. Switch to additive probing from this point forward. |
| L > 500 (loss > 5%), consecutive confirmation | Decrease: S = smoothed(E) × 0.85 | Drop to 85% of the EWMA-smoothed effective delivery rate. The 15% undershoot allows router queues to drain; FEC bridges the gap during recovery. Requires two consecutive above-threshold heartbeats to trigger (see below). |

**Design Rationale (v3.1 / v4.0):** Five cumulative changes. (1) The v3.0 decrease formula `E × 1.05` set the new rate *above* the rate that just caused severe loss, sustaining congestion. Dropping to `E × 0.85` gives queues time to drain. (2) The 1.5× multiplicative increase per heartbeat was replaced with 1.25× per RTT — on a Starlink link with 100ms heartbeats and 40ms RTT, the old algorithm made 2.5 increases per RTT, compounding to ~1.95× per RTT. (3) The permanent transition to additive increase after discovering the link ceiling prevents repeated boom-bust oscillation at the capacity boundary. (4) `StorageFlushRate` removed from effective-rate formula (v4.0): pre-allocated ring buffers always stall flush at 0 for out-of-order arrivals, making `min(NetworkDeliveryRate, StorageFlushRate) ≈ 0` and permanently tripping the delivery-collapse guard. (5) Delivery-collapse threshold lowered 50%→25% (v4.0): legitimate in-flight packets on 50ms+ RTT paths account for ~50% of the window, causing false collapses during ramp-up with the old threshold.

#### EWMA Smoothing

Raw delivery rate measurements from individual heartbeats are noisy due to timing jitter, especially on high-latency links. The sender maintains an exponentially weighted moving average (EWMA) of the effective delivery rate:

```
smoothed = α × raw_sample + (1 − α) × smoothed_previous```

The default smoothing factor is `α = 0.3`, which provides moderate dampening (converges in ~3 samples). The smoothed rate is used as the target when decreasing, preventing single-heartbeat noise from crashing the rate.

#### Consecutive Decrease Requirement

A single heartbeat reporting > 5% loss may be a transient spike (e.g., a router briefly queuing). The sender requires **two consecutive** above-threshold heartbeats before executing a decrease. The first signal starts a "decrease streak" counter; the second confirms it. Any increase or hold resets the streak to zero.

#### Auto-Ceiling

The ceiling is two-tiered based on phase:

```
Phase 1:  if rate > peakDeliveryRate × 4.0: rate = peakDeliveryRate × 4.0
Phase 2:  if rate > peakDeliveryRate × 1.5: rate = peakDeliveryRate × 1.5```

**Phase 1 (4× — runaway prevention):** During the multiplicative probe, delivery-rate measurements lag the send rate because the sender increases 25% per heartbeat and the receiver's measurement window has not stabilised. For example, at a 7.63 MB/s send rate the receiver may report only 4.19 MB/s delivery. A tight multiplier like 1.5× would fire immediately, giving a 6.28 MB/s ceiling below the current rate and locking the sender at ~5.68 MB/s for the entire transfer on a 110 MB/s Gigabit link. The generous 4× multiplier prevents this while still bounding the exponential: on a clean link where FEC absorbs all drops (LossRate remains 0% throughout), Phase 2 is never entered and without any Phase 1 ceiling the target rate grows without bound (observed: 345 trillion MB/s). With 4×, the target is capped at ~400 MB/s on a Gigabit LAN — effectively the same as nodelay (pacing is disabled at that rate anyway), but without the absurd log output.

**Phase 2 (1.5× — avoidance bound):** Once Phase 2 is entered, the delivery rate was measured near actual link capacity — the loss event that triggered the transition occurred at or near the ceiling — so 1.5× is a tight and reliable upper bound for the additive probing that follows.

**Implementation Note:** Earlier revisions gated the ceiling on a fixed warmup period (originally 3 heartbeats, later extended to 5) to avoid locking in a low `peakRate` from cold-start measurements. This approach was superseded by Phase 2 gating, which is more principled: the ceiling is irrelevant during Phase 1 (the sender is still discovering the link ceiling) and correct during Phase 2 (the delivery rate was measured near capacity). The warmup constant no longer exists in the implementation.

### D. Deficit-Accumulator Pacing

The sender must convert the target rate (bytes/sec) into inter-packet timing. The naive approach — computing a per-packet interval and sleeping for that duration — fails in practice because OS timer granularity (~1ms on Windows, ~100µs on Linux) and language-runtime preemption (e.g., Go's asynchronous goroutine preemption, or signal-based preemption in C with certain threading models) make sub-millisecond sleeps unreliable.

Instead, the sender uses a **deficit accumulator**:

- A `tokens` balance (in bytes) accrues credit at the target rate over elapsed wall-clock time. Elapsed time must use a monotonic clock source (C: `clock_gettime(CLOCK_MONOTONIC)`; Go: `time.Now()` which uses the monotonic component internally).

- Each packet send debits `tokens` by the packet size.

- Tokens are capped at a **2ms burst budget**: `max_tokens = rate_bytes_per_sec × 0.002`. This prevents idle periods from banking enough credit to burst a large backlog of packets instantly.

- When the deficit grows large enough that the corresponding sleep would be ≥ 1ms, the sender sleeps and resets the deficit to zero. Go: `time.Sleep()`. C: `clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &target_ts, NULL)` — the `TIMER_ABSTIME` flag prevents drift from accumulating across consecutive sleeps.

- Deficits smaller than 1ms are carried forward — they accumulate across multiple packets and eventually trigger a single coarser sleep.

This produces the correct long-term average rate without relying on sub-millisecond timer precision.

**C Implementation Note:** Linux `clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME)` provides ~50–100µs precision on default kernels, significantly better than Go's ~1ms floor. For sub-100µs precision (useful at 10 GbE+ rates), run the sender thread under `SCHED_FIFO` real-time scheduling via `sched_setscheduler()`. The deficit accumulator is still the correct architecture — even with precise timers, per-packet nanosleep at 67,000 pps would burn CPU on syscall overhead. The accumulator batches the timing debt and issues one sleep per deficit threshold crossing, regardless of timer precision.

**Design Rationale:** The v2.0 spec described a "dynamic Token Bucket" with "microsecond interval between packet dispatches." The Go prototype initially used `runtime.Gosched()`-based busy-wait spin loops for sub-ms pacing. Go 1.14+ introduces asynchronous goroutine preemption that signals goroutines at safe points — even inside tight loops — causing the spin to overshoot to ~1ms per packet. At 1400 bytes/1ms = 1.4 MB/s, this created an artificial throughput ceiling regardless of the target rate. The deficit accumulator was developed to solve this without platform-specific timer hacks.

### E. NACK-Driven Retransmission

When the sender receives a NACK array in a Heartbeat, it queues the identified packets for retransmission. Retransmitted packets carry their original `SequenceNum` and `BlockGroup`.

Retransmissions are **interleaved with forward progress**: the sender processes at most **3 NACKed packets per send-loop iteration**, then sends the next new data packet. This prevents NACK storms (e.g., 169 NACKs on a satellite link) from monopolizing bandwidth and stalling `seqNum` advancement.

The pending NACK set must be maintained as a **deduplicated set** (hash set or bitset), not a FIFO queue. Each heartbeat may report the same sequence numbers as the previous one (the receiver keeps NACKing until the packet arrives). If the same sequence is appended to a plain list on every heartbeat, the retransmit queue grows without bound. A set ensures each sequence is queued at most once regardless of how many heartbeats repeat it. If a NACK arrives for a sequence that has already been pruned from the sender's chunk cache (because `HighestContiguous` advanced past it via FEC recovery), the sender silently skips it — the receiver has already recovered the packet and the NACK is stale.

**Implementation Note:** The v2.0 spec stated that retransmissions are "injected ahead of new data." In practice, draining the entire NACK queue before sending any new data caused a fatal stall on Starlink: at 0.38 MB/s with 169 NACKs, each send-loop iteration spent ~0.62s on retransmits, and `seqNum` never advanced. The 3-per-iteration cap ensures the transfer always makes forward progress even under heavy loss.

## 7. Session Timeout and Failure Recovery

### A. Receiver Inactivity Timeout

If the receiver does not receive any packets (data, parity, or retransmission) for a period of **5 consecutive expected heartbeat intervals**, with a **minimum floor of 5 seconds**, it declares the session dead:

- The ring buffer is freed.

- Any partially written file is deleted from disk (or retained with a `.partial` suffix if configured for resume support in a future version).

- The SessionID is moved to the reserved pool for the standard 10-second stale protection window.

- A warning is logged with the last received `SequenceNum` and `HighestContiguous` for diagnostics.

**Implementation Note:** The 5-second floor was added after integration testing showed that at the lowest heartbeat tier (100ms), the computed timeout of 500ms was too aggressive — the natural gap between the sender finishing its data blast and NACK retransmissions arriving through a lossy proxy would trigger a false timeout. The floor gives the NACK retransmission cycle time to recover.

### B. Sender Inactivity Timeout

The sender monitors receiver liveness through two complementary mechanisms:

#### Hard Abort — Heartbeat Staleness (Data Phase)

During active data transmission, the sender tracks the timestamp of the most recent Heartbeat received. If no Heartbeat arrives for **3 seconds** (`SenderHeartbeatTimeout`), the sender immediately aborts the session and frees all resources. This covers the common case of a client disconnecting mid-transfer (e.g. Ctrl+C) and ensures the serve daemon's busy flag is released within seconds rather than minutes.

The client may also send a `SESSION_REJECT` with reason `CLIENT_DISCONNECT (0x08)` to signal a graceful shutdown; the sender releases resources immediately on receipt without waiting for the timeout.

#### Probe State — Intermittent Loss (Pre-completion)

If the sender does not receive a Heartbeat for **5 consecutive expected heartbeat intervals**, it assumes the receiver has failed or the return path is temporarily broken:

- The sender pauses transmission and enters a **Probe state**, sending a single DATA packet every 500ms.

- If no Heartbeat arrives within **10 seconds** of entering Probe state, the sender tears down the session and frees resources.

- If a Heartbeat does arrive during the Probe state, normal transmission resumes from the last unacknowledged sequence number.

## 8. Graceful Teardown (The Final Handshake)

To prevent "tail drop" issues where the final packets are lost and the connection deadlocks, the protocol implements a synchronized teardown with timeout-driven cleanup on both sides.

### A. Normal Teardown Sequence

- **The EOF Signal:** The sender sets the `EndOfFile` flag (`0x01`) on the final data packet and the final parity packet of the tail block. It stops reading new data but keeps the socket open, listening for Heartbeats. **Socket ownership must be exclusive before entering teardown:** Go implementations must stop the heartbeat listener goroutine to prevent it from consuming packets the teardown loop needs (see Lesson F). C implementations using a single-threaded `epoll` event loop have exclusive socket access by construction — no action is needed.

- **The Verification (Pipelined Hashing):** The receiver continuously updates a streaming `xxHash64` as contiguous blocks are flushed from the ring buffer to disk. Upon receiving the EOF-flagged packets and verifying the completed hash against the `SESSION_REQ` metadata, it sends a `TRANSFER_COMPLETE` (Type `0x05`) packet.

- **The Sender Linger (Zombie State):** The sender receives `TRANSFER_COMPLETE`, responds with `ACK_CLOSE` (Type `0x06`), frees heavy memory allocations (including the sliding window ring buffer), and enters a **3-second Linger state**. Duplicate `TRANSFER_COMPLETE` packets arriving in this window are answered with a repeat `ACK_CLOSE`.

- **The Receiver Linger:** After sending `TRANSFER_COMPLETE`, the receiver enters its own **3-second Linger state**. If `ACK_CLOSE` is not received within this window, the receiver retransmits `TRANSFER_COMPLETE` up to **3 times** at 1-second intervals. If no `ACK_CLOSE` is received after all retries, the receiver considers the transfer successful (the file hash was verified) and performs a unilateral teardown.

- **The Purge:** After the Linger timeout expires on either side, the SessionID is moved to the reserved pool (10-second stale protection), and all local file descriptors are closed.

### B. Teardown NACK Handling

During the teardown wait (after all data is sent, before TRANSFER_COMPLETE arrives), the sender continues to process Heartbeat packets synchronously. If a Heartbeat contains NACKs, the sender retransmits the requested packets from the sliding window ring buffer and resets the read deadline. This ensures the receiver can complete even if some late packets were lost.

Teardown retransmits are **paced through the token bucket** at the congestion controller's current rate. Without pacing, a backlog of queued Heartbeats (e.g., after a brief receive gap) can be drained all at once, causing the sender to fire hundreds of retransmit packets in tens of milliseconds — a burst that overwhelms the same congested link that caused the NACKs in the first place. See Lesson J for the observed failure case.

#### RTT-Aware NACK Cooldown

**Problem:** The teardown retransmit loop had no cooldown. On a 50ms-RTT path with a 50ms heartbeat interval, every in-flight heartbeat triggered a redundant retransmit of the same lost packets. The retransmit flood caused fresh congestion, which caused more NACKs, creating a self-reinforcing spiral. Observed: 59,908 reported NACKs for approximately 780 actual losses.

**Fix:** A `nackCooldown map[uint64]time.Time` gates each sequence number to at most one retransmit per `RTT × 1.25` (RTT plus 25% margin). The map is seeded with all NACKs outstanding at the moment the main send loop ends. On each teardown heartbeat, a sequence is only retransmitted if its cooldown timestamp has elapsed; otherwise it is silently skipped until the next eligible window.

#### Tail-Drop Deadlock Prevention

**Problem:** The receiver's NACK scan window is bounded by `HighestReceived`. If the last packets of the file are dropped, `HighestReceived` never advances to the end of the file, so the receiver's NACK list is empty. The sender sees 0 NACKs, sends nothing, and the receiver hits its 5-second inactivity timeout — a deadlock neither side can break without external intervention.

**Fix:** In the teardown loop, if a heartbeat arrives with `NACKCount == 0` but `hb.HighestContiguous < totalChunks−1`, the sender proactively computes up to 167 missing tail sequences (from `HighestContiguous+1` through `totalChunks−1`) and injects them into the retransmit pipeline. These injected sequences flow through the NACK cooldown gate exactly like receiver-reported NACKs, preventing the same tail sequences from being re-injected on every heartbeat.

#### Teardown Micro-Burst Prevention

**Problem:** At high transfer speeds (100+ MB/s) the TokenBucket's 2ms burst allowance is approximately 200 KB. A full 167-packet retransmit batch is approximately 232 KB at `MaxPayload = 1368 bytes` — it fires as a near-simultaneous burst, flooding the OS UDP socket buffer and the serve daemon's 256-slot receive channel, causing most retransmits to be silently dropped.

**Fix:** Teardown retransmits are chunked into **batches of 10 packets** with a **2ms sleep** between batches. 167 packets are spread over approximately 34ms — invisible to the user, well within any heartbeat interval, and guaranteed to fit through any buffer in the path.

### C. Hash Mismatch

If the receiver completes all data reception but the `xxHash64` does not match the expected value from `SESSION_REQ`:

- The receiver sends a `SESSION_REJECT` (Type `0x04`) with a reason code indicating hash failure.

- The partially written file is deleted.

- The sender receives the rejection, tears down the session, and may optionally retry the entire transfer.

## 9. Development Phases

### Progress Bar — Repair State

Once the main send loop completes and the sender enters the teardown loop (§8B), the progress bar changes from the normal `100% | 40.0 MB/s | NACKs: N` display to `100% | Repairing... | NACKs: N`. This tells the user the network is actively recovering dropped tail packets rather than hanging. The repair state persists until `TRANSFER_COMPLETE` is received or the teardown timeout expires.

1. **Phase 1 (Go Prototype) — COMPLETE:** Validated FEC mathematics (GF(28) Reed-Solomon), Heartbeat state machine, adaptive FEC tuning, wire-speed calibration burst, loss-driven congestion control with deficit-accumulator pacing, NACK retransmission with forward-progress interleaving. Proved 100% file integrity under simulated packet loss at 0%, 1%, 5%, 10%, 15%. Tested on LAN (1 Gbps Ethernet, 41 MB/s throughput) and WAN (Starlink satellite, variable latency/loss). 86 unit tests across protocol, sender, and receiver packages. Phase 1 throughput bottleneck identified: per-packet `conn.Write()` syscall overhead limits Go to ~30–41 MB/s on LAN versus 93 MB/s for FTP (which uses kernel-level TCP segmentation with a single large `write()`).

2. **Phase 2 (C Productionization):** Translate the validated logic into C. Key optimizations:

  3. **Syscall batching:** Use `sendmmsg()` (Linux) or `GSO` (Generic Segmentation Offload) to submit multiple UDP packets per kernel transition. At 1376 bytes per packet and 93 MB/s target, the sender must dispatch ~67,000 packets/sec. Per-packet `sendto()` incurs ~15µs of context-switch overhead each, consuming ~1 second of CPU per second of transfer. `sendmmsg()` with batches of 16–64 packets amortizes this to ~1,000–4,000 syscalls/sec. On the receiver side, `recvmmsg()` provides the same benefit.

  4. **Zero-copy receive:** `mmap()`-backed ring buffer with `PACKET_RX_RING`/`AF_XDP` to avoid kernel-to-userspace copy on receive.

  5. **SIMD Reed-Solomon:** Intel ISA-L or hand-rolled AVX2/NEON for FEC encode/decode. The Go prototype achieves ~60 MB/s encode and ~100 MB/s reconstruct; SIMD should reach multi-GB/s.

  6. **io_uring:** Asynchronous disk I/O for the receiver's write path, eliminating the flush thread.

**Phase 1 Throughput Analysis:** FTP achieves 93 MB/s on the same LAN because TCP's `write()` pushes megabytes at once — the kernel handles segmentation into ~1500-byte frames internally. HP-UDP's per-packet `conn.Write()` makes ~67,000 syscalls/sec at full speed, each costing a user-kernel context switch. This is not a fundamental protocol limitation — it's a syscall-overhead problem that `sendmmsg()` batching in Phase 2 will eliminate.

## 10. Lessons Learned (Phase 1)

The following empirical findings emerged during Phase 1 implementation and testing. They are documented here to guide the Phase 2 C port and future protocol revisions.

### A. Delivery-Rate-Ratio CC Is Fundamentally Broken

The v2.0 algorithm increased the rate only when `EffectiveRate ≥ 0.95 × SendRate`. Since delivery rate is bounded by send rate and measurement windows never align perfectly, this ratio consistently falls below 0.95 even on a lossless link. The sender spirals to the rate floor. Loss rate is the correct primary signal.

### B. Sub-Millisecond Pacing Is Unreliable in Userspace

Both `time.Sleep(<1ms)` and busy-wait spin loops fail for sub-millisecond pacing on Windows (minimum ~1ms granularity) and on any platform using Go 1.14+ (asynchronous goroutine preemption interrupts tight loops at ~1ms intervals). The deficit accumulator sidesteps this entirely by sleeping only when the accumulated deficit justifies a ≥1ms sleep. The C port should use `clock_nanosleep()` or similar, but should still avoid relying on sub-millisecond precision for correctness.

### C. Calibration Burst Must Not Flood the Link

A 50 MB/s starting rate on a Starlink connection (~10 Mbps effective uplink) caused massive packet loss during the first 100ms, which poisoned the `peakRate` measurement and locked the auto-ceiling at ~0.38 MB/s for the entire transfer. The starting rate must be conservative enough for the worst expected link (2 MB/s default), while the calibration burst itself runs at wire speed to discover the actual capacity.

### D. NACK Storms Stall Forward Progress

On a satellite link with ~30ms RTT and 5% loss, each heartbeat reported ~169 NACKs. Processing all NACKs before each new data packet caused the send loop to spend its entire bandwidth on retransmissions, preventing `seqNum` from advancing. Capping retransmissions at 3 per iteration restored forward progress.

### E. Early Delivery-Rate Measurements Are Unreliable

The first few heartbeats arrive during or immediately after the calibration burst, when the receiver is still allocating buffers and the network path hasn't stabilized. More broadly, during Phase 1 ramp-up the receiver's measurement window hasn't caught up to the sender's current rate — at a 7.63 MB/s send rate the receiver may only report 4.19 MB/s delivery because the sender had only been at that rate for one 100ms heartbeat interval. Any ceiling derived from these measurements will be artificially low. The correct solution is to gate the auto-ceiling on Phase 2 entry rather than a fixed warmup period: by the time Phase 2 is entered, the sender has been near the link ceiling long enough for delivery measurements to be meaningful.

### F. Socket Ownership During Teardown

The sender's heartbeat listener goroutine and the teardown synchronous read loop compete for the same socket. If the goroutine is still running when the sender enters teardown, it consumes packets (including TRANSFER_COMPLETE) that the teardown loop needs. The goroutine must be stopped before entering teardown, and any queued NACKs must be drained synchronously.

### G. Per-Packet Syscall Overhead Is the Phase 1 Bottleneck

FTP achieves 93 MB/s on the same Gigabit LAN where HP-UDP reaches ~30–41 MB/s. The difference is not the protocol or the language — it's the syscall pattern. FTP writes large buffers to a TCP socket; the kernel segments them into packets internally. HP-UDP calls `conn.Write()` for every 1376-byte packet, requiring ~67,000 user-kernel context switches per second at full speed. Each syscall costs ~15µs of overhead, consuming nearly 100% of available CPU time at target throughput. The Phase 2 C port must use `sendmmsg()`/`recvmmsg()` or equivalent batching to amortize this cost across 16–64 packets per syscall.

### H. Go-Specific Memory Pressure

Three allocations patterns in the Go prototype created significant memory and CPU overhead that are not visible in the protocol design but directly impacted measured throughput. All three were fixed in Phase 1 and their equivalents must be avoided in the Phase 2 C port.

1. 
**FEC Encoder Construction Cost.** `NewRSEncoder(k, m)` builds a `k×k` Vandermonde matrix and inverts it using O(k³) GF(2⁸) operations. For the default block size of `k=100` this is ~2 million GF operations, measured at ~4 ms per call. With ~1,720 FEC blocks in a 237 MB transfer, constructing a fresh encoder for each block costs ~6.9 seconds of CPU time — roughly equal to the entire transfer duration at 35 MB/s. The fix is to cache the encoder keyed on `(dataShards, parityShards)` and reuse it across blocks. The matrix is deterministic for a given `(k, m)` pair and encoding only reads from it (no mutation), so the cached instance is safe for concurrent use. The Phase 2 C port must pre-build encoder matrices at session start and reuse them.

2. 
**FEC Shard Buffer Allocations.** Each data packet that enters the FEC encoder requires a `MaxPayload`-sized buffer padded to equal length before RS encoding. At the default block size of 100 and a 35 MB/s transfer rate, this produces ~25,000 allocations per second, generating ~470 MB of heap churn per 237 MB transfer and placing the garbage collector on the critical path. The fix is a `sync.Pool` of pre-allocated `MaxPayload`-sized buffers checked out at encode time and returned immediately after the parity computation completes. The Phase 2 C port should maintain a fixed pool of shard-sized stack or heap buffers reused across blocks.

3. 
**Unbounded Retransmit Cache Growth.** The original sender cached every transmitted chunk in a `map[uint64][]byte` to service NACK retransmissions. Without eviction, this map retained all ~172,000 chunks for a 237 MB transfer, consuming ~470 MB of heap memory for the entire session duration. **Fixed in Phase 1:** replaced with a bounded `SlidingWindow` ring buffer (50,000 slots, ~68 MB peak). Entries are evicted when `HighestContiguous` advances (received from each heartbeat), since the receiver has already confirmed contiguous receipt up to that point and will never NACK those sequences. When the window is full (all 50,000 un-acknowledged slots occupied), the sender pauses sending new packets until the receiver's `HighestContiguous` advances and frees slots — providing memory-bounded backpressure. The Phase 2 C port should use the same ring-buffer pattern with a **power-of-2 slot count** (65,536 recommended, ~89 MB peak) so that index wrapping uses a bitmask (`idx & 0xFFFF`) instead of a modulo operation — eliminating a division on the hot path for every packet sent and every `HighestContiguous` advance.

### I. Auto-Ceiling Overshoot Causes Persistent NACK Storms

With the original auto-ceiling multiplier of 4×, the target rate on a Gigabit LAN reached 396 MB/s (4× a measured peak delivery of ~99 MB/s). At this target the token bucket's pacing budget was so large that no sleep was ever triggered — the sender fired packets as fast as the CPU allowed. The OS NIC queue and the receiver's socket buffer were overwhelmed, systematically dropping the same ~167 packets per heartbeat interval. Because these drops were clustered within FEC blocks (the sender had burst through several consecutive blocks before the receiver could drain them), the parity overhead was insufficient to recover them. The result: every heartbeat for the entire transfer carried the same 167-entry NACK list, the sender retransmitted them repeatedly, and the teardown phase required ~3.5 seconds of retransmit cycles before the receiver could recover all blocks and issue `TRANSFER_COMPLETE`.

Three fixes address this together:

1. **Two-tier ceiling (Phase 1: 4×, Phase 2: 1.5×).** The fix went through two iterations. Applying a uniform 1.5× ceiling during Phase 1 ramp-up caused the opposite problem: with a 4.19 MB/s peak delivery measurement at heartbeat 6, a 1.5× ceiling produced a 6.28 MB/s cap — *below* the current send rate — locking the sender at ~5.68 MB/s for a 42-second transfer on a 110 MB/s Gigabit link. Making the ceiling Phase 2-only solved the ramp-up problem but exposed a new failure: on a clean Gigabit LAN where FEC absorbs all OS socket buffer drops, LossRate remains 0% throughout, Phase 2 is never entered, and without any ceiling the multiplicative probe compounded to 345 trillion MB/s (observed). The two-tier ceiling resolves both: Phase 1 uses 4× as a loose runaway brake (capping at ~400 MB/s on a Gigabit LAN, which effectively disables pacing without absurd log output), and Phase 2 uses 1.5× as a tight avoidance bound (by Phase 2 entry the delivery rate was measured near actual link capacity, so 1.5× is meaningful).

2. **NACK deduplication.** The sender's retransmit queue was a plain slice. Each heartbeat appended the full NACK list again (167 entries × every ~25 ms = the queue grew without bound). The fix is to use a set (hash map of pending sequence numbers): a sequence already waiting for retransmission is not added a second time. This prevents the queue from accumulating thousands of stale duplicate entries and ensures the 3-retransmits-per-iteration cap is spent on distinct missing packets. The Phase 2 C port should maintain a fixed-size bitset or hash set of pending NACK sequences rather than a FIFO queue.

### J. OS Socket Buffer Drops Are Invisible to the Loss Rate Signal (LFN)

On a real Long Fat Network (1 GB file sent over a ~20 MB/s WAN link), Phase 1 multiplicative probing ramped the sender from 2 MB/s to 71 MB/s (the 4× Phase 1 ceiling) in under 5 seconds. The link could only sustain ~20 MB/s. The excess was absorbed by OS socket buffers, which then overflowed. From that point, packets were dropped at the OS layer before reaching the receiver application. The observed effect:

- **LossRate stayed 0.00% for the entire 3-minute transfer.** The receiver's loss counter is FEC-failure-based: `packetsLost / (packetsReceived + packetsLost)`. When the OS drops all packets in a heartbeat window, `packetsReceived = 0` and `packetsLost = 0`, so `totalPackets = 0` and the guard clause produces 0% loss. The CC never detected any congestion.

- **167 NACKs persisted for the entire transfer.** The early burst blew through ~800 FEC blocks before any heartbeat arrived. With the OS dropping most packets, those blocks could not be FEC-recovered. The sender retransmitted them on every heartbeat, but each retransmit went out at the 71 MB/s ceiling rate, flooding the link again and preventing recovery.

- **Teardown never completed.** The sender sent all data but the receiver could never verify the hash because the 167 missing sequences were never recovered. After 3.5 minutes the sender timed out waiting for TRANSFER_COMPLETE.

Additionally, a secondary failure compounded the problem during teardown: a 7-second gap in heartbeat reception caused a backlog to accumulate. When heartbeats resumed, the sender drained the entire queue at once — 39 calls to the retransmit function in 45ms, each firing all 167 packets at wire speed (~199 MB/s burst on a 20 MB/s link).

Two fixes address this:

1. **Delivery-collapse guard.** Before evaluating the loss-rate thresholds, `OnHeartbeat` checks: `NACKCount > 0 AND E < S × 0.25` (threshold lowered from 0.5 in v4.0 — see §6C). If both are true, the sender holds and permanently enters Phase 2. The Phase 2 ceiling (1.5× peak delivery) fires immediately, cutting the target from 71 MB/s to ~26 MB/s. With the rate near actual link capacity, the normal loss signals take over and back it down further. The `NACKCount` condition is critical — it prevents false holds on cold-start windows where delivery is transiently near zero but the link is healthy.

2. **Paced teardown retransmits.** The retransmit function now accepts the token bucket and calls `Pace()` for each packet. A backlog of 39 queued heartbeats retransmitting 167 packets each is spread over ~600ms at 15 MB/s rather than firing in 45ms, giving the receiver time to process them and advance `HighestContiguous`.

### K. Backpressure Starves NACK Retransmits (Window-Full Deadlock)

**Problem:** When the sender's sliding window fills (50,000 un-acknowledged slots), the main send loop must pause to avoid growing memory without bound. The original implementation used a bare `for sw.IsFull(seqNum) { time.Sleep(1ms) }` inner loop. While spinning there, the outer loop never returned to its NACK-processing step at the top. If the very first DATA packet was dropped, `HighestContiguous` on the receiver stayed at 0. Because `Advance(0)` is a no-op (guarding against the zero-value case), `hc` in the sliding window stayed at its sentinel value (`MaxUint64`). With that sentinel, `IsFull` returned `true` at exactly seq 50,000, which corresponds to ~68 MB — roughly 6% of a 1 GB file. The sender froze, NACKs queued in `nackPending` went unserviced, and the receiver detected no incoming packets for 5 seconds and declared an inactivity timeout.

**Fix:** Replace `for sw.IsFull(seqNum) { sleep }` with `if sw.IsFull(seqNum) { sleep; continue }`. The `continue` jumps back to the top of the outer `for seqNum < totalChunks` loop, so every backpressure iteration still drains `nackPending` and retransmits pending sequences before sleeping. The retransmit of the lost first packet allows `HighestContiguous` to advance on the next heartbeat, which unblocks `IsFull` and resumes the main send loop normally.

**Observed signature:** Transfer ramps to full speed (100+ MB/s), freezes at approximately 6% of a 1 GB file, receiver reports inactivity timeout 5 seconds later. NACKs counter shows 0 during the freeze (the NACK retransmit loop never ran). The freeze point scales exactly with window size: `windowSlots × MaxPayload / fileSize`.

### L. Cooldown Spin-Lock During Teardown

**Problem:** The §8B NACK cooldown gate (`RTT × 1.25`) can trap the teardown retransmit loop in an infinite spin. After the sender retransmits a batch of NACKed sequences, every entry in the NACK set has a fresh `lastRetransmitTime`. The retransmit loop condition (`nackSet.count > 0`) remains true because entries are only removed after successful retransmit, but the cooldown gate skips every entry with `continue`. Because the loop is unbounded (a `while`, not a `for` with an iteration cap), `continue` restarts the loop indefinitely. The clock variable `now` is only refreshed inside the batch-sleep path, which is never reached when `sent == 0`. The sender is permanently trapped in the retransmit function and never returns to the teardown main loop to check its deadline or read the next heartbeat.

**Fix:** Track consecutive cooldown skips. When the skip count reaches `nackSet.count`, all entries are in cooldown — break out of the retransmit loop and return to the teardown main loop. The next heartbeat will either confirm the sequences were received (removing them from the set) or re-NACK them (at which point their cooldown timestamps will have elapsed). This is the same class of bug as Lesson K: a tight inner loop that starves the outer control loop because its exit condition cannot be reached from within.

**Observed signature:** Sender enters teardown, processes 1–2 heartbeats, retransmits NACKed packets, then hangs indefinitely. The receiver reports the session complete (all data received), but the sender process never exits. CPU usage spikes to 100% on one core. Ctrl-C is required to terminate.

### M. Receiver-Side Self-Completion

**Problem:** §8A defines a two-phase teardown where the sender sends `TRANSFER_COMPLETE` after the receiver's heartbeats confirm all data received. In the serve daemon architecture (§11), the sender's linger timeout fires a single `TRANSFER_COMPLETE` and immediately exits. Under WAN conditions (50ms RTT, 1–5% loss via `tc netem`), this single fire-and-forget control packet is frequently dropped. The receiver has all data (`HighestContiguous + 1 == TotalChunks`, `NACKCount == 0`) but remains stuck in the DATA state waiting for a `TRANSFER_COMPLETE` that will never arrive, eventually hitting the 5-second inactivity timeout.

**Fix:** The receiver must autonomously detect transfer completion. When constructing a heartbeat with `NACKCount == 0` and `HighestContiguous + 1 >= TotalChunks`, the receiver self-initiates teardown: it flushes and fsyncs the output file, sends `ACK_CLOSE`, and transitions to the linger state. This makes the receiver resilient to lost control packets and removes the dependency on a single `TRANSFER_COMPLETE` arriving reliably. The sender's `TRANSFER_COMPLETE` still accelerates the handshake when it does arrive, but is no longer required for correctness.

**Observed signature:** Transfer runs at full speed, sender exits normally, but the daemon session hangs for 5 seconds and then reports `timed out`. The daemon's debug log shows `pkts == TotalChunks` and `HighestContiguous + 1 == TotalChunks` — all data is present but the session never transitions to teardown.

### N. Serve Daemon Must Handle PARITY Packets

**Problem:** §11 specifies that the serve daemon runs the "normal receiver" flow for push transfers, but does not explicitly require PARITY packet handling. An implementation that dispatches DATA but ignores PARITY (e.g., `case PARITY: break`) will function on lossless links but degrade severely under WAN conditions. Without FEC recovery, every lost DATA packet requires a full NACK → retransmit round trip (`RTT × 1.25` cooldown + propagation). At 1–5% loss with 50ms RTT, this converts what should be a transparent FEC correction into a NACK storm that slows teardown convergence and can interact with Lesson L to cause hangs.

**Fix:** The serve daemon's packet dispatch must handle `PARITY` packets identically to the standalone receiver: maintain a per-session FEC block pool, track parity shard arrival, and attempt Reed-Solomon recovery after each new parity shard. Recovered data shards are written to the mmap, marked in the receive bitset, and advance `HighestContiguous` — exactly as if they had arrived as DATA packets. PARITY reception should also update the inactivity timer to prevent timeout during FEC-dominated tail phases where the only arriving packets are parity shards.

## 11. Serve Daemon — Bidirectional File Hub

The serve daemon is a persistent multi-session UDP server that manages a file directory and services both pull requests (clients fetch files) and push requests (clients deposit files). It listens on a **single UDP socket** and dispatches packets by `(src_addr, session_id)` tuple. Up to **`HPUDP_MAX_SESSIONS = 16` concurrent transfers** are supported. Additional requests beyond that limit receive `SERVER_BUSY (0x03)` and may retry. `LIST_REQ` is always answered regardless of the current session count.

### A. SESSION_REJECT Reason Codes

All `SESSION_REJECT` packets carry a single reason-code byte in the payload:

```
````
```
````
```

| Code | Name | Meaning |
|---|---|---|
| 0x01 | SESSION_ID_COLLISION | The submitted SessionID is already active on the server. |
| 0x02 | HASH_MISMATCH | Received file hash does not match the value declared in SESSION_REQ. |
| 0x03 | SERVER_BUSY | A transfer is already in progress; try again later. |
| 0x04 | FILE_NOT_FOUND | The requested filename is not in the serve manifest. |
| 0x05 | FILE_EXISTS | A push was rejected because the filename already exists on the server (no-overwrite policy). |
| 0x06 | ENCRYPTION_UNSUPPORTED | The receiver does not support encryption and received a request with the Encrypted flag (0x04) set. |
| 0x07 | RESUME_HASH_MISMATCH | The sender rejected a RESUME_REQ because the FullHash or PartialHash in the request did not match the file on disk. |
| 0x08 | CLIENT_DISCONNECT | Sent by the client as a graceful disconnect signal (e.g. Ctrl+C). The sender releases the session immediately upon receipt rather than waiting for the heartbeat timeout. |

### B. PULL_REQ — Client-Initiated Pull (NAT Traversal)

The PULL_REQ mechanism allows a client behind NAT to retrieve a file from a serve daemon that has a public IP address, without any port-forwarding configuration on the client side.

**Wire format:** Packet type `0x07`. Payload is a null-terminated UTF-8 filename with no fixed-size prefix.

**Flow:**

1. **Client** generates a random `SessionID` (same CSPRNG path as §4A), binds a local UDP socket on an OS-assigned ephemeral port, and sends `PULL_REQ` to the server's control port. This outbound packet punches the NAT hole: the NAT mapping records *client-ip:ephemeral-port → server-ip:control-port*.

2. **Server** receives `PULL_REQ`. If busy or filename not in manifest, sends `SESSION_REJECT` back to the client's address. Otherwise, it fires a normal `SESSION_REQ` to the client's address from a new outbound connection. Because the server's IP was the destination of the punching packet, port-restricted-cone NAT routers (the most common home router class) allow this inbound from the same IP on any source port.

3. **Client** receives the `SESSION_REQ` on its bound socket, records the sender's address (server ephemeral port), and enters the normal receiver flow using its existing socket — no rebind required. The same socket that sent `PULL_REQ` carries the entire transfer.

4. The server's sender thread uses the `SessionID` supplied in the `PULL_REQ` header, eliminating an extra round-trip for ID assignment.

### C. LIST_REQ / LIST_RESP — Catalog Query

`LIST_REQ (0x0D)` allows a client to query the serve daemon's file manifest without initiating a transfer. The serve daemon responds with `LIST_RESP (0x0E)` immediately, regardless of whether a transfer is currently in progress (`SERVER_BUSY` does not apply to listing).

**Wire format — LIST_REQ:** Packet type `0x0D`. No payload. The client generates a random `SessionID` used solely to match the response.

**Wire format — LIST_RESP:** Packet type `0x0E`. Payload is a UTF-8 string where each entry is a tab-separated `filename\tsize\n` line (filename, a tab character, the file's byte count as a decimal integer, and a newline). Empty payload means no files are available. The list is truncated to fit within a single MTU payload (`MaxPayload = 1368` bytes) if the directory is very large. Example: `backup.tar.gz\t524288000\nreport.pdf\t2097152\n`.

**Flow:**

1. **Client** generates a random `SessionID`, binds a local UDP socket, and sends `LIST_REQ` to the server's control port.

2. **Server** reads the manifest under a read lock and immediately writes a `LIST_RESP` to the client's address with the same `SessionID`. The busy flag is not checked.

3. **Client** waits up to 2 seconds for a matching `LIST_RESP` (matching by `SessionID`), retrying up to 3 times on timeout.

### D. PUSH_REQ / PUSH_ACCEPT — Client-Initiated Push

The push flow allows a client to deposit a file into the serve daemon's directory. Three security invariants are always enforced:

1. **Base-name-only rule:** The filename from the `PUSH_REQ` payload is sanitized to its base name — the substring after the last `/` or `\` separator (C: manual reverse scan for both separators). Path traversal sequences such as `../../etc/passwd` or `/absolute/path` are reduced to just the final filename component before any further processing.

2. **No-overwrite rule:** If a file with the sanitized name already exists in the serve directory, the request is rejected with reason code `FILE_EXISTS (0x05)`.

3. **Post-hash atomic rename:** The incoming file is written to `filename.tmp` during the transfer. Only after a successful `TRANSFER_COMPLETE` (xxHash64 verified) is the `.tmp` renamed to its final path. A failed or interrupted transfer leaves no partial file in the directory.

**PUSH_REQ wire format** (type `0x08`): `FileSize(8B, big-endian)` + `FileHash(8B, big-endian)` + `InitialRate(4B, big-endian)` + optional `PubKey(32B)` (encrypted mode only) + null-terminated filename. The layout mirrors SESSION_REQ (§4C Step 1).

**PUSH_ACCEPT wire format** (type `0x09`): No payload in unencrypted mode. In encrypted mode, a single `PubKey(32B)` — the server's ephemeral X25519 public key for completing the key exchange.

**Flow:**

1. **Client** generates a `SessionID`, sends `PUSH_REQ` to the daemon port containing the file metadata and filename. If encrypted, includes the client's X25519 public key in the payload (§4.5E).

2. **Server** validates (not over session limit, base-name safe, file does not already exist). If encrypted, generates its own X25519 keypair and derives the shared session key. Sends `PUSH_ACCEPT` (including its pub key in encrypted mode). Allocates a session slot and begins receiving DATA packets on the same shared socket.

3. **Client** receives `PUSH_ACCEPT`. If encrypted, calls `hpudp_crypto_derive()` with the server's pub key. Begins the calibration burst and data transmission to the same daemon port using the same `SessionID`.

4. **Server** receives DATA/PARITY packets, dispatches them by `(src_addr, session_id)` to the allocated session slot, writes to `filename.tmp` via mmap. On `TRANSFER_COMPLETE`: verifies hash, atomically renames `filename.tmp` to final path. On failure: tmp is deleted and the slot is cleared.

### D. Manifest Lifecycle

The manifest is a filename-to-absolute-path map built at daemon startup by a non-recursive directory scan. Symlinks and directories are excluded. Files added to the directory after startup are invisible until the daemon is restarted — this is an intentional security boundary. Successful push transfers atomically add their promoted file to the in-memory manifest under a write-lock (Go: `sync.RWMutex`; C: `pthread_rwlock_t`). PULL_REQ handlers acquire a read-lock when consulting the manifest.

## 12. Resumable Transfers

A transfer interrupted mid-flight (network loss, client Ctrl+C, daemon restart) can be resumed transparently without any wire-level negotiation. The mechanism is entirely receiver-side: a binary checkpoint sidecar written periodically by the receiver is loaded automatically when a matching new session arrives.

### A. Checkpoint Sidecar

While receiving, the receiver writes a checkpoint sidecar (filename: `<output>.hpudp-ckpt`) on every heartbeat. The sidecar is a binary file with the following layout. **All fields are little-endian** — this is a local file format, not a wire format.

```
````
````````

| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 bytes | magic | Always 0x48505543 ("HPUC"). Identifies the file as a valid checkpoint. |
| 4 | 4 bytes | version | Always 1 in the current implementation. |
| 8 | 8 bytes | file_size | Total declared file size from the original SESSION_REQ. |
| 16 | 8 bytes | file_hash | xxHash64 of the complete file, as declared in SESSION_REQ. Used to match the checkpoint to a resumed session. |
| 24 | 8 bytes | total_chunks | ceil(file_size / MaxPayload) — total expected DATA packets. |
| 32 | 8 bytes | highest_contiguous | Highest sequence number N such that all sequences 0…N have been received. |
| 40 | variable | recv_bits | Receive bitset: ceil(total_chunks / 8) bytes. Bit i is set if sequence i has been received. Allows the receiver to skip already-received out-of-order packets on resume. |

The sidecar is written atomically (write to `<output>.hpudp-ckpt.tmp`, then rename) to prevent partial reads. It is deleted on successful transfer completion.

### B. Transparent Resume Flow

Resume is fully transparent to the sender — the sender always initiates with a normal `SESSION_REQ` from sequence 0. The `hpudp resume` command is identical to `hpudp send`; the negotiation happens at the receiver.

1. **Receiver** receives `SESSION_REQ`. It checks for a sidecar at `<output_path>.hpudp-ckpt`. A sidecar matches if its `file_hash` equals the hash in the incoming SESSION_REQ. If no sidecar is found, the receiver proceeds with a fresh transfer.

2. **Receiver** loads the sidecar: restores the receive bitset and `highest_contiguous`. The mmap file already contains the previously-received data at the correct offsets.

3. **Sender** transmits from seq 0 as normal. The receiver skips (discards) packets whose bit is already set in `recv_bits`. New packets are written normally.

4. On the first heartbeat, the receiver reports the restored `HighestContiguous`. The sender's sliding window advances past all already-acknowledged sequences, freeing ring buffer slots.

5. Transfer completes and hash is verified as normal. The sidecar is deleted on success.

**Design Rationale:** Receiver-side transparent resume is simpler and more robust than wire-level RESUME_REQ / RESUME_ACCEPT negotiation. The sender needs no changes. The receiver handles hash validation locally via the stored `file_hash`. There is no cross-machine hash-of-partial-data coordination that could mismatch due to FEC recovery reordering partial shard data.

### C. Packet Types 0x0B / 0x0C — Reserved

Packet types `0x0B RESUME_REQ` and `0x0C RESUME_ACCEPT` are defined in the protocol header and assigned their type codes, but are **not used in the current implementation**. They are reserved for a future wire-level resume negotiation protocol (e.g., for scenarios where the sender and receiver are on different machines and the receiver wants to skip already-received bytes at the wire level to save bandwidth). Receivers must silently ignore packets with these type codes.

## Appendix A: Protocol Constants (Defaults)

``****``

````````

| Parameter | Default Value | Configurable |
|---|---|---|
| MTU Hard Cap | 1400 bytes (total) | No |
| Header Size | 32 bytes (4 × 64-bit aligned, includes SenderTimestampNs) | No |
| Max Payload | 1368 bytes unencrypted (MTUHardCap(1400) − HeaderSize(32)); 1352 bytes encrypted (1368 − GCM_TagSize(16)) | No |
| FEC Block Size | 100 data packets | Yes |
| FEC Initial Parity | 5% | Yes |
| FEC Tail Min Parity | 2 packets | Yes |
| Calibration Burst Size | 10 packets (packet train) | Yes |
| Calibration Burst Spacing | 0 (wire speed) | Yes |
| Default Starting Rate | 2 MB/s | Yes |
| EWMA Smoothing Factor (α) | 0.3 | Yes |
| Loss Threshold: Increase | < 100 bp (1%) | Yes |
| Loss Threshold: Hold / Phase Transition | 100–500 bp (1–5%) | Yes |
| Loss Threshold: Decrease | > 500 bp (5%) | Yes |
| Consecutive Decrease Signals | 2 | Yes |
| Phase 1 Increase Multiplier | 1.25× per RTT | Yes |
| Phase 2 Additive Increase | MaxPayload / RTT per RTT | Yes |
| Decrease Factor | 0.85× smoothed delivery rate | Yes |
| Auto-Ceiling Multiplier (Phase 1) | 4× peak delivery rate (runaway prevention) | Yes |
| Auto-Ceiling Multiplier (Phase 2) | 1.5× peak delivery rate (avoidance bound) | Yes |
| Deficit Accumulator Burst Cap | 2ms of credit | Yes |
| Deficit Sleep Threshold | ≥ 1ms | No (OS-dependent) |
| Max NACKs per Send Iteration | 3 | Yes |
| Rate Floor | 10 KB/s | Yes |
| Inactivity Timeout | max(5 × heartbeat interval, 5s) | Yes |
| Sender Probe Interval | 500ms | Yes |
| Sender Probe Timeout | 10 seconds | Yes |
| Linger Duration (both sides) | 3 seconds | Yes |
| Receiver Teardown Retries | 3 | Yes |
| Stale SessionID Reservation | 10 seconds | Yes |
| Max SESSION_REQ File Size | 1 TB | Yes |
| Teardown Batch Size | 10 packets per sleep | Yes |
| Teardown Batch Sleep | 2 ms | Yes |
| Delivery-Collapse Threshold | 25% of current send rate (was 50%) | Yes |
| NACK Cooldown Margin | RTT × 1.25 (RTT + 25%) | Yes |
| Tail-Drop Injection Limit | 167 sequences per heartbeat | Yes |
| Sliding Window Slots | 65,536 (216, ~89 MB peak). Go prototype uses 50,000; C implementations should use power-of-2 for bitmask index wrapping. | Yes |
| Encryption Cipher | AES-128-GCM (128-bit key, 128-bit auth tag, 96-bit nonce) | No |
| GCM Tag Size | 16 bytes | No |
| GCM Nonce Size | 12 bytes: iv_base(8B) || seq_low32_be(4B). iv_base is derived from HKDF output bytes 16–23; seq_low32_be is the low 32 bits of SequenceNum in big-endian. Not transmitted. | No |
| Key Exchange | X25519 ephemeral (32-byte public key per side) | No |
| Key Derivation | HKDF-SHA256, salt = SessionID (4B big-endian), info = "hp-udp-aes128-v5", output = 24 bytes (bytes 0–15: AES-128 session key; bytes 16–23: iv_base) | No |

## Appendix B: Revision Log

- **v5.2** — C implementation alignment: HKDF output extended to 24 bytes (adds HKDF-derived `iv_base` for nonce construction); GCM nonce redesigned to `iv_base(8B) || seq_low32_be(4B)` replacing the 3-field `SessionID+PacketType+UniqueID` scheme; PUSH_ACCEPT payload simplified (no ephemeral port — daemon is single-socket); PUSH_REQ payload aligned with SESSION_REQ (adds `FileHash` and `InitialRate`); serve daemon upgraded to 16 concurrent sessions; LIST_RESP format updated to tab-separated `filename\tsize\n` lines; §12 Resume rewritten as transparent receiver-side checkpoint mechanism (no wire RESUME_REQ/RESUME_ACCEPT — types 0x0B/0x0C reserved); checkpoint sidecar renamed `.hpudp-ckpt` with binary format (magic + version + bitset).

- **v5.1** — Resumable transfers (§12): checkpoint sidecar (`.hpuft-resume`), `RESUME_REQ (0x0B)` / `RESUME_ACCEPT (0x0C)` packet types, FEC block-boundary alignment, PartialHash cross-validation; graceful client disconnect: `CLIENT_DISCONNECT` reject reason `(0x08)`, sender hard-abort on 3s heartbeat staleness (`SenderHeartbeatTimeout`); catalog query: `LIST_REQ (0x0D)` / `LIST_RESP (0x0E)`, served regardless of busy state; `RESUME_HASH_MISMATCH (0x07)` reject code.

- **v5.0** — End-to-end encryption: ephemeral X25519 key exchange + AES-128-GCM per-packet encryption (§4.5); `SESSION_ACCEPT` (0x0A) packet type; `Encrypted` flag (0x04); 1-RTT handshake when encrypted (0-RTT preserved for unencrypted); deterministic nonce from header fields; encrypt-after-FEC data path; extended payloads for SESSION_REQ, PUSH_REQ, PUSH_ACCEPT, PULL_REQ; `ENCRYPTION_UNSUPPORTED` reject code; encrypted MaxPayload 1352 bytes; backward compatible with v4.x unencrypted transfers.

- **v4.2** — C implementation readiness: universal big-endian byte order declaration covering all payload fields (not just header); language-neutral wording throughout (monotonic clock, `clock_nanosleep`, `pthread_rwlock_t`, manual basename scan, ephemeral `bind()`+`getsockname()`); expanded deficit-accumulator C guidance (`TIMER_ABSTIME`, `SCHED_FIFO`); architecture-neutral teardown socket ownership (epoll single-thread vs goroutine); power-of-2 sliding window slot count (65,536) for bitmask index wrapping; `io_uring` receiver disk I/O note.

- **v4.1** — Sender sliding window: bounded ring buffer (50,000 slots) replaces unbounded `map`; backpressure NACK starvation deadlock fix (`if IsFull { continue }` replaces bare spin).

- **v4.0** — WAN reliability overhaul: same-clock RTT measurement via `SenderTimestampNs` header field (HeaderSize 24→32, MaxPayload 1376→1368); frozen-timestamp RTT guard (`lastEchoNs`); RTT-aware NACK cooldown map; tail-drop deadlock prevention via proactive tail injection; teardown micro-burst prevention (batch-10 / 2ms); `StorageFlushRate` removed from CC effective-rate; delivery-collapse threshold 50%→25%; progress bar repair state.

- **v3.1** — Calibration burst reduced to 10 packets (packet train); dispersion-based bottleneck measurement (`DispersionNs`); sender timestamps for RTT measurement; two-phase congestion controller (Phase 1 multiplicative 1.25×/RTT, Phase 2 additive); decrease formula changed from `E × 1.05` to `E × 0.85`; rate-increase gating to once per RTT; Phase 2 throughput analysis; `sendmmsg()`/`recvmmsg()`/`io_uring` roadmap.

- **v3.0** — Delivery-rate-ratio CC replaced with loss-driven algorithm; delivery-collapse guard; paced teardown retransmits; NACK deduplication (set instead of queue); two-tier auto-ceiling (Phase 1: 4×, Phase 2: 1.5×); consecutive decrease requirement; EWMA smoothing for delivery rate.

- **v2.0** — 50-packet calibration burst at 1ms spacing; per-packet sender timestamps in payload metadata; 1.5× multiplicative increase per heartbeat; initial delivery-rate-ratio CC algorithm.

- **v1.0** — Initial specification: 0-RTT handshake, SessionID generation/collision handling, FEC (Reed-Solomon GF(2⁸)), adaptive parity ratio, deficit-accumulator pacing, Heartbeat/NACK mechanism, inactivity timeouts, graceful teardown with linger states, serve daemon (PULL_REQ/PUSH_REQ).