Design Specification: High-Performance UDP File Transfer Protocol (HP-UDP) v5.2

Changelog from v5.1 → v5.2 (C Implementation Alignment):

§4.5B HKDF Key Derivation: HKDF-SHA256 now derives 24 bytes (not 16): bytes 0–15 are the AES-128 session key; bytes 16–23 replace the randomly-generated iv_base from the init step. Both sides derive the same iv_base deterministically from the shared secret — no additional wire transmission of iv_base is needed.
§4.5C Nonce Construction: Redesigned. The 12-byte GCM nonce is now iv_base(8B) || seq_low32_be(4B) — the HKDF-derived iv_base in bytes 0–7 and the low 32 bits of SequenceNum in big-endian in bytes 8–11. The previous 3-field scheme (SessionID + PacketType + UniqueID) has been replaced. Uniqueness is preserved: seq is monotonically increasing per session; 32-bit overflow cannot occur within the 1 TB file size limit (~810M packets < 2³²).
§4.5E PUSH_ACCEPT payload: The Port(2B) prefix has been removed. The serve daemon uses a single shared UDP socket — no ephemeral data port is negotiated. Encrypted PUSH_ACCEPT now carries only PubKey(32B); unencrypted PUSH_ACCEPT carries no payload.
§4.5E PUSH_REQ payload: Standard PUSH_REQ payload now includes FileHash(8B) and InitialRate(4B) (matching SESSION_REQ layout) in addition to FileSize(8B) and FileName. The encrypted variant inserts PubKey(32B) after InitialRate.
§11 Serve Daemon: Updated from single-lane to 16 concurrent sessions (HPUDP_MAX_SESSIONS). Single-socket model: all sessions share one UDP socket, dispatched by (src_addr, session_id) tuple. No ephemeral per-session socket.
§11C LIST_RESP: Each line is tab-separated filename\tsize\n (not just a filename). The size is the file's byte count as a decimal integer.
§11D PUSH flow: Rewritten for single-socket architecture. PUSH_ACCEPT no longer carries a data port; the data flows to the same socket and port that received the PUSH_REQ.
§12 Resumable Transfers: Complete redesign. The wire-level RESUME_REQ / RESUME_ACCEPT negotiation (v5.1) has been replaced by a transparent receiver-side checkpoint mechanism. The receiver saves a binary sidecar (.hpudp-ckpt) every heartbeat. On a matching SESSION_REQ (same file hash), the receiver restores its receive bitset and HighestContiguous from the sidecar; the sender needs no knowledge of the resume. Packet types 0x0B RESUME_REQ and 0x0C RESUME_ACCEPT are reserved but unused in the current implementation.
Appendix A: GCM Nonce Size description and Key Derivation output updated to reflect the 24-byte HKDF and iv_base-based nonce.

Changelog from v4.2 → v5.0 (End-to-End Encryption):

§4.5 End-to-End Encryption (new section): Optional AES-128-GCM per-packet encryption with ephemeral X25519 key exchange. Pure ephemeral keys — no persistent key material, perfect forward secrecy. Encryption is negotiated via the Encrypted flag (0x04) in the packet header.
§3 PacketType: Added 0x0A SESSION_ACCEPT — carries receiver's ephemeral public key during encrypted handshake.
§3 Flags: Added 0x04 = Encrypted flag bit.
§4C Handshake: 0-RTT (unencrypted) / 1-RTT (encrypted). New Step 1.5: SESSION_ACCEPT key exchange. Calibration burst is encrypted in encrypted mode.
§4.5A-G: Full encryption specification — key exchange flow for all modes (send/recv, push, pull), HKDF session key derivation, nonce construction (SessionID + PacketType + UniqueID, not transmitted), encrypt-after-FEC data path, extended payload formats, security properties, performance budget.
§11A Reason Codes: Added 0x06 ENCRYPTION_UNSUPPORTED.
Appendix A: Encrypted MaxPayload (1352 bytes), cipher/nonce/key-exchange constants.
Backward compatible: Unencrypted transfers are unchanged. The Encrypted flag defaults to 0.

Changelog from v4.1 → v4.2 (C Implementation Readiness):

§3 Wire Format / Byte Order: Explicit statement that all multi-byte fields — header, heartbeat payload, SESSION_REQ payload, PUSH_REQ payload — use big-endian (network byte order). Previously only the header stated this; payload byte order was implicit from Go's binary.BigEndian.
§3 Header / SenderTimestampNs: Language-neutral wording. References to time.Now().UnixNano() replaced with "monotonic nanosecond timestamp" and platform guidance for C (clock_gettime(CLOCK_MONOTONIC)) and Go (time.Now().UnixNano()).
§5A / Receiver Disk I/O: C implementation note added: io_uring submission queue replaces the Go async flush goroutine. Single-threaded epoll event loop architecture eliminates the socket ownership race described in Lesson F.
§6B / Heartbeat payload byte order: Explicit network-byte-order requirement added to the Heartbeat payload table preamble. C implementations must htonl/ntohl every field.
§6D / Deficit Accumulator: Expanded C-specific pacing guidance: clock_gettime(CLOCK_MONOTONIC) for elapsed time, clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME) for sleeps, optional SCHED_FIFO for sub-100µs precision.
§8A / Teardown socket ownership: Architecture-neutral rewrite. Go: stop heartbeat goroutine. C (single-threaded epoll): no action needed — one reader by construction.
§10.H.3 / Sliding Window slot count: Recommended power-of-2 slot count (65,536) for C implementations. Bitmask index wrapping (idx & 0xFFFF) replaces modulo, eliminating a division per packet on the hot path.
§11C / Filename sanitization: Language-neutral: manual scan for / and \ to extract base name, replacing Go-specific filepath.Base().
§11D / Manifest locking: Language-neutral: read-write lock (C: pthread_rwlock_t, Go: sync.RWMutex). Ephemeral socket bind: C uses bind() with port 0.

Changelog from v4.0 → v4.1 (Sender Sliding Window):

§10.H.3 / Sender retransmit buffer: sentChunks map[uint64][]byte replaced by a bounded SlidingWindow ring buffer (50,000 slots, ~68 MB peak). Entries are evicted on each HighestContiguous advance from incoming heartbeats; the tail pointer never holds confirmed data. The sender blocks new packet sends — but continues processing NACK retransmits — when the window is full, providing natural memory-safe backpressure.
§10.K / Backpressure NACK starvation deadlock fix: The original backpressure implementation used for sw.IsFull(seqNum) { time.Sleep(1ms) } — a bare spin that never returned to the NACK-processing step at the top of the outer loop. If the first DATA packet was dropped, HighestContiguous stayed at 0 (Advance(0) is a no-op), the window filled at exactly seq 50,000 (~68 MB = ~6% of a 1 GB file), and NACKs in nackPending went unserviced — causing receiver inactivity timeout. Fixed by replacing for IsFull { sleep } with if IsFull { sleep; continue }, so every backpressure iteration still executes NACK retransmits before sleeping.

Changelog from v3.1 → v4.0 (WAN Reliability Overhaul):

§3 Header: SenderTimestampNs (8 bytes, unix nanoseconds) added at offset 0x18. HeaderSize 24→32 bytes. MaxPayload 1376→1368 bytes. Sender timestamp is now in the fixed header (not the payload) for all DATA and PARITY packets.
§6 RTT Measurement — same-clock design: Receiver echoes pkt.Header.SenderTimestampNs verbatim as EchoTimestampNs. Sender computes RTT = time.Now().UnixNano() − EchoTimestampNs using only its own clock, eliminating cross-machine clock-skew error entirely.
§6 Frozen-timestamp RTT guard: TokenBucket tracks lastEchoNs; RTT is only updated when echoNs > lastEchoNs. Stale repeated echoes (sender idle during NACK cooldown) are silently ignored, preventing RTT inflation past the receiver's 5-second inactivity timeout.
§8 RTT-aware NACK cooldown: Teardown retransmit loop gates each sequence to at most one retransmit per RTT × 1.25, eliminating self-reinforcing congestion spirals from redundant retransmits on short-RTT paths.
§8 Tail-drop deadlock prevention: If a teardown heartbeat reports 0 NACKs but HighestContiguous < totalChunks−1, the sender proactively injects up to 167 missing tail sequences into the retransmit pipeline, breaking the deadlock where tail drops prevent HighestReceived from advancing.
§8 Teardown micro-burst prevention: Retransmits are chunked into batches of 10 with a 2ms sleep between batches, preventing ~232 KB bursts from overwhelming the OS UDP socket buffer and the serve daemon's 256-slot receive channel.
§6 CC effective rate: rawEffective = NetworkDeliveryRate only. StorageFlushRate removed from the rate formula; disk-flush stalls (always 0 due to pre-allocated ring buffer out-of-order writes) no longer falsely trigger the delivery-collapse guard.
§6 Delivery-collapse threshold: 50%→25%. High-latency paths have ~50% of packets in-flight during warm-up; the old threshold fired prematurely on measurement lag.
§9 Progress bar repair state: After main send loop completes, progress bar shows Repairing... instead of a speed figure, indicating tail recovery rather than a hang.

Changelog from v3.0 → v3.1 (retained):

§4 Calibration: Burst reduced from 100 to 10 packets (packet train). Receiver measures inter-arrival dispersion and reports bottleneck bandwidth in the first heartbeat via new DispersionNs field.
§6 RTT Measurement: DATA packets carry sender timestamps; heartbeats echo them back in new EchoTimestamp field. Sender computes RTT for rate-gating.
§6 Phased Growth: Replaced single-phase 1.5× multiplicative increase with two-phase model: Phase 1 (probe, 1.25× per RTT) transitions permanently to Phase 2 (additive, +MaxPayload/RTT per RTT) upon first observation of 1–5% loss.
§6 Decrease Formula: Changed from E × 1.05 to E × 0.85. The old formula sustained congestion by targeting above the capacity that caused loss; the new formula drops below to drain router queues, relying on FEC to bridge the gap.
§6 Rate Gating: Increases are limited to once per RTT. Decreases remain immediate. Prevents runaway probing on high-latency links.
§9 Phase 2: Added sendmmsg()/recvmmsg() batching, io_uring, and throughput analysis explaining the Phase 1 FTP speed gap (~30 MB/s vs 93 MB/s due to per-packet syscall overhead).

Changelog from v1.0 (retained):

Specified SessionID generation and collision handling for 0-RTT.
Added receiver-side inactivity timeout and sender failure recovery.
Separated flow control (receiver capacity) from congestion control (network capacity).
Added receiver-side linger timeout for teardown resilience.
Clarified MTU hard cap: 1400 total, 1376 max payload.
Widened SequenceNum and BlockGroup to 64-bit (8 bytes each). Header is now 24 bytes, naturally aligned for 64-bit systems.
Introduced adaptive FEC ratio based on observed loss.
Made heartbeat interval rate-proportional.
Specified FEC tail block handling for partial final groups.

1. The Motivation (The "Why")

The development of HP-UDP is driven by a singular goal: to engineer the fastest possible file transfer protocol that mathematically guarantees perfect data integrity across volatile network conditions.

While TCP is the foundational workhorse of the internet, its general-purpose congestion algorithms inherently throttle performance on Long Fat Networks (LFNs). HP-UDP was built to prove that it is possible to outperform TCP in raw throughput by replacing reactive safety nets with proactive, domain-specific algorithms.

This protocol democratizes high-speed data movement, giving developers and engineers the ability to send and receive massive files cleanly, reliably, and at maximum hardware limits. It is a rigorous demonstration of advanced systems engineering, built to prove what is possible when legacy constraints are stripped away.

2. Architectural Overview

HP-UDP is an application-layer file transfer mechanism built on top of UDP. The design is lean, avoids unnecessary overhead, and focuses intently on its primary goal of speed while ensuring the reliability required for production use.

The architecture is built upon five core pillars:

Ultra-Fast Initiation: A 0-RTT Optimistic Handshake with wire-speed calibration burst.
Reliable Data Transfer: Adaptive Forward Error Correction (FEC) for proactive loss recovery.
Adaptive Throughput: Loss-driven congestion control with deficit-accumulator pacing and rate-proportional Heartbeat feedback.
Guaranteed Integrity: Graceful Teardown linked to pipelined, full-file checksum verification.
Resilient Session Management: Timeout-driven cleanup on both sender and receiver with linger states.

Note on Security Scope: HP-UDP v5.0 adds optional end-to-end encryption via ephemeral X25519 key exchange and AES-128-GCM per-packet encryption (§4.5), providing confidentiality, integrity, and perfect forward secrecy. Encryption is backward-compatible: unencrypted transfers still work when the Encrypted flag is unset. HP-UDP intentionally omits authentication — it does not verify the identity of the remote endpoint. In the target deployment environment (managed networks with known infrastructure behind SDNs), endpoint identity is established at the network layer. Optional certificate or pre-shared-key authentication may be added in a future revision as a separate concern.

3. Packet Wire Format (Custom Header)

The protocol utilizes a tightly packed, fixed-width 32-byte binary header for every datagram. The header is naturally aligned for 64-bit systems (four 8-byte words). The hard MTU cap is 1400 bytes total (header + payload), yielding a maximum payload of 1368 bytes (MTUHardCap(1400) − HeaderSize(32)). This ensures safe passage within standard 1500-byte ethernet MTUs without IP-level fragmentation.

Byte order: All multi-byte fields in the entire protocol — header fields, heartbeat payload fields (§6B), SESSION_REQ payload fields (§4C), PUSH_REQ/PUSH_ACCEPT payload fields (§11C), and NACK arrays — are in big-endian (network byte order). C implementations must use htonl/ntohl (32-bit) and htonll/ntohll (64-bit) or equivalent for every multi-byte field on the wire. Go implementations use binary.BigEndian methods. This applies uniformly; there are no little-endian fields anywhere in the protocol.

Byte Offset	Size	Field Name	Description
`0x00`	1 Byte	`PacketType`	`0x00` SESSION_REQ, `0x01` DATA, `0x02` PARITY, `0x03` HEARTBEAT, `0x04` SESSION_REJECT, `0x05` TRANSFER_COMPLETE, `0x06` ACK_CLOSE, `0x07` PULL_REQ, `0x08` PUSH_REQ, `0x09` PUSH_ACCEPT, `0x0A` SESSION_ACCEPT, `0x0B` RESUME_REQ, `0x0C` RESUME_ACCEPT, `0x0D` LIST_REQ, `0x0E` LIST_RESP.
`0x01`	4 Bytes	`SessionID`	Client-generated random identifier for the active transfer (see §4 for collision handling).
`0x05`	8 Bytes	`SequenceNum`	Strictly incrementing 64-bit chunk identifier. Eliminates rollover concerns up to ~16 EB file sizes.
`0x0D`	8 Bytes	`BlockGroup`	64-bit identifier for the FEC block this packet belongs to. Aligned with SequenceNum address space.
`0x15`	2 Bytes	`PayloadLen`	Size of the raw data payload (max 1368 bytes).
`0x17`	1 Byte	`Flags`	Bitmask: `0x01` = End of File, `0x02` = Calibration Burst, `0x04` = Encrypted (payload is AES-128-GCM ciphertext; see §4.5).
`0x18`	8 Bytes	`SenderTimestampNs`	The sender's monotonic clock timestamp in nanoseconds at the moment each DATA or PARITY packet is built (C: `clock_gettime(CLOCK_MONOTONIC)` converted to nanoseconds; Go: `time.Now().UnixNano()`). Non-data control packets leave this field zero. The receiver echoes this value verbatim as `EchoTimestampNs` in the Heartbeat payload; the sender computes `RTT = now_ns − EchoTimestampNs` using only its own clock (§6B).
`0x20`	Variable	`Payload`	Raw file bytes, FEC parity data, or protocol metadata.

4. Connection Establishment: 0-RTT Optimistic Handshake

To eliminate latency before data transmission begins, the protocol uses a Zero Round-Trip Time (0-RTT) handshake with a wire-speed calibration burst to probe link capacity.

A. SessionID Generation

The client generates the SessionID as a cryptographically random 32-bit integer (C: getrandom() or /dev/urandom; Go: crypto/rand). This keeps the handshake 0-RTT since no server round-trip is required for ID assignment.

Collision Handling: The server maintains a set of active SessionIDs. If an incoming SESSION_REQ carries a SessionID that is already in use, the server responds with a SESSION_REJECT (Type 0x04) containing a reason code. The client generates a new random SessionID and retransmits the SESSION_REQ. At typical concurrency levels (hundreds of concurrent transfers), 32-bit random IDs yield negligible collision probability (~1 in 10 million at 200 concurrent sessions).
Stale Session Protection: SessionIDs are held in a reserved pool for 10 seconds after session purge to prevent late-arriving packets from a completed transfer from being misattributed to a new session reusing the same ID.

B. SESSION_REQ Validation

Before allocating resources, the receiver validates the SESSION_REQ payload:

File Size: Must be non-zero and no larger than the configured maximum (default: 1 TB). Prevents out-of-memory crashes from corrupted or malicious packets.
File Name: Must be non-empty. The receiver uses only the base filename (path separators stripped) to prevent path traversal.

If validation fails, the receiver sends a SESSION_REJECT and logs a diagnostic warning.

C. Handshake Sequence

The handshake is 0-RTT when unencrypted and 1-RTT when encrypted (§4.5). Both flows are described below.

Step 1 (The Request): The client transmits Packet 0 (SESSION_REQ). The payload contains: FileSize (8 bytes), xxHash64 checksum (8 bytes), InitialRate (4 bytes, 0 = use calibration mode), and FileName (variable, null-terminated). If the Encrypted flag (0x04) is set in the header Flags field, a 32-byte SenderPublicKey (X25519 ephemeral) is appended after InitialRate and before FileName. The null terminator is appended after the filename bytes and is not counted in PayloadLen. Filenames must not contain embedded null bytes. The receiver strips all path separators (/ and \) from the filename before writing — C implementations must scan for both characters to be platform-neutral.
Step 1.5 (Key Exchange — encrypted mode only): If the Encrypted flag is set, the receiver responds with a SESSION_ACCEPT (Type 0x0A) carrying its own 32-byte ephemeral ReceiverPublicKey as the payload. Both sides compute the shared secret via X25519 and derive the session key via HKDF (§4.5). The sender blocks until SESSION_ACCEPT arrives or the sender inactivity timeout fires. In unencrypted mode, this step is skipped entirely — the sender proceeds directly to Step 2.
Step 2 (The Calibration Packet Train): The client begins transmitting DATA packets with the Calibration flag (0x02) set. In unencrypted mode, this starts immediately after Step 1 (0-RTT). In encrypted mode, this starts after receiving SESSION_ACCEPT (1-RTT). The burst consists of 10 packets sent back-to-back at wire speed. This small packet train probes the link without flooding router buffers, even on constrained links like satellite. The token bucket is initialized at a default starting rate of 2 MB/s. In encrypted mode, calibration DATA packets are encrypted — there are no plaintext data packets on the wire once the key exchange completes.
Step 3 (Dispersion Measurement): The receiver timestamps the arrival of the first and last calibration-flagged packet. Although 10 packets are sent back-to-back at wire speed, they arrive spread out by the bottleneck link — if 10 packets arrive over 10ms, the bottleneck bandwidth is (10 − 1) × 1368 / 0.010 = 1.23 MB/s (or (10 − 1) × 1352 / 0.010 in encrypted mode). The receiver reports this as the DispersionNs field in the first heartbeat, giving the sender a direct measurement of the path's bottleneck capacity before the first rate adjustment. The sender can use this to seed the CC's peak rate estimate.
Step 4 (The Alignment): The server parses the request, validates the payload (§4B), allocates the memory ring buffer, and begins accepting incoming data. If the server cannot allocate resources in time, initial calibration packets are dropped — these will be detected and recovered via the standard Heartbeat/NACK mechanism (§6).
Step 5 (Steady State): Upon receiving the first Heartbeat response (§6), the sender transitions out of calibration mode. It clears the Calibration flag. If the heartbeat includes a valid DispersionNs, the sender uses the derived bandwidth as the initial peakRate estimate. The loss-driven congestion controller (§6) governs the sending rate from this point forward.

Configurable Initial Rate Override

If the InitialRate field in SESSION_REQ is non-zero, the sender skips calibration mode and begins transmitting at the specified bytes-per-second rate immediately. This is intended for known environments (e.g., a dedicated 10 Gbps LAN) where the operator can confidently set the initial rate. The adaptive congestion controller still takes over after the first Heartbeat.

Design History: v2.0 used 50 packets at 1ms spacing (~1.38 MB/s probe). v3.0 changed to 100 packets at wire speed. Both had problems: the 1ms spacing made LAN ramp-up take too long, while 100 wire-speed packets (140 KB) instantly filled router buffers on Starlink and poisoned the initial peakRate measurement. The v3.1 packet train (10 packets) is small enough to avoid buffer overflow on any reasonable link, while the dispersion measurement extracts the same capacity information that 100 packets would provide.

4.5. End-to-End Encryption

HP-UDP optionally encrypts all DATA and PARITY payloads using AES-128-GCM with ephemeral X25519 key exchange. Encryption is negotiated during the handshake (§4C Step 1.5) and is all-or-nothing for a session — once the Encrypted flag is set, every DATA and PARITY packet in the session is encrypted. Control packets (HEARTBEAT, TRANSFER_COMPLETE, ACK_CLOSE) are not encrypted; their payloads contain only protocol metadata, not file content.

A. Ephemeral Key Exchange

Both sides generate a fresh X25519 keypair (32-byte public key, 32-byte private key) at the start of each session. Private keys exist only in memory for the duration of the transfer and are securely zeroed on session teardown. This provides perfect forward secrecy: there is no persistent key material that could decrypt recorded traffic after the session ends.

Key exchange is embedded in the existing handshake flow with no additional round trips beyond the 1-RTT SESSION_ACCEPT:

Flow	Sender Key In	Receiver Key In	Added RTTs
Direct `send`/`recv`	`SESSION_REQ` payload	`SESSION_ACCEPT` (0x0A) payload	+1 (0-RTT → 1-RTT)
Serve daemon `push`	`PUSH_REQ` payload	`PUSH_ACCEPT` payload (extended)	+0 (already 1-RTT)
Serve daemon `pull`	`PULL_REQ` payload	`SESSION_REQ` payload (server is sender)	+0 (already 1-RTT)

For push and pull via the serve daemon, the existing round trip already accommodates the key exchange — no additional latency is introduced. Only the basic send/recv flow gains one round trip.

B. Session Key Derivation

Both sides independently derive the same symmetric key:

shared_secret = X25519(my_private_key, their_public_key)     // 32 bytes
okm           = HKDF-SHA256(
                    ikm  = shared_secret,
                    salt = SessionID (4 bytes, big-endian),
                    info = "hp-udp-aes128-v5",
                    len  = 24                                  // 16-byte key + 8-byte iv_base
                )
session_key   = okm[0..15]                                    // AES-128 key
iv_base       = okm[16..23]                                   // Nonce base (replaces random init)

The 24-byte HKDF output serves a dual purpose: bytes 0–15 are the AES-128-GCM session key, and bytes 16–23 become the shared iv_base used in nonce construction (§4.5C). Both sides derive the same iv_base deterministically — no additional exchange is needed. The random iv_base generated during key-pair initialisation is overwritten by the HKDF-derived value at the end of hpudp_crypto_derive().

The SessionID salt ensures that even if the same ephemeral keypair were accidentally reused (implementation bug), different sessions would derive different keys and nonce bases. The info string binds the key to the protocol version and cipher suite, preventing cross-protocol key reuse.

C implementations: OpenSSL EVP_KDF with OSSL_KDF_NAME_HKDF, or libsodium crypto_kdf_hkdf_sha256_expand. Go: golang.org/x/crypto/hkdf.

C. Per-Packet Encryption

Each DATA and PARITY packet is encrypted independently using AES-128-GCM. The packet header (32 bytes) is not encrypted — it is passed as Additional Authenticated Data (AAD) so that the receiver can route, reorder, and identify packets before decryption. The header is authenticated by the GCM tag, preventing tampering.

Wire Layout (Encrypted Packet)

┌──────────────────┬──────────────────────────┬──────────────┐
│ Header (32 B)    │ Ciphertext (PayloadLen B) │ GCM Tag (16B)│
│ cleartext, AAD   │ AES-128-GCM output        │ auth tag     │
└──────────────────┴──────────────────────────┴──────────────┘
 Total ≤ 1400 bytes.  PayloadLen ≤ 1352 (= 1368 − 16 tag).

PayloadLen in the header reflects the plaintext length (which equals the ciphertext length in GCM). The receiver reads PayloadLen + 16 bytes from the payload area to get ciphertext + tag. Encrypted MaxPayload = 1352 bytes. Unencrypted transfers retain MaxPayload = 1368.

Nonce Construction (12 Bytes)

AES-GCM requires a unique nonce for every packet encrypted under the same key. Nonce reuse completely breaks GCM's confidentiality and authenticity guarantees. The nonce is constructed deterministically from the HKDF-derived iv_base (§4.5B) and the packet's sequence number:

Bytes	Field	Purpose
0–7	`iv_base` (8 bytes)	Session-scoped nonce base derived from HKDF output (§4.5B). Identical on both sides; never transmitted. Binds nonce space to this session's key material.
8–11	Low 32 bits of `SequenceNum` (big-endian, 4 bytes)	Unique per packet within the session. Combined with `iv_base`, produces a distinct 96-bit nonce for every packet. Applied as `htobe32(seq & 0xFFFFFFFF)`.

Nonce uniqueness proof: SequenceNum is strictly incrementing and never reused within a session. The maximum file size is 1 TB; at 1352 bytes per encrypted payload, that is at most ~810 million packets — well below the 2³² (≈4.29 billion) wrap point. Therefore the low 32 bits of SequenceNum are globally unique within any single session, and combined with the session-scoped iv_base, the 12-byte nonce is unique for every (session, packet) pair.

The nonce is not transmitted on the wire. Both sides compute it independently from iv_base (derived identically via HKDF) and the SequenceNum in the packet header. This saves 12 bytes per packet compared to an explicit nonce.

D. Encryption Placement in the Data Path

Encryption is applied after FEC encoding. The FEC encoder operates on plaintext data shards and produces plaintext parity shards. Each shard (DATA or PARITY) is then encrypted independently before transmission. On the receiver side, each packet is decrypted individually, then the plaintext shards are passed to the FEC decoder for reconstruction if needed.

Sender:  file → chunk → FEC encode (plaintext) → encrypt each shard → transmit
Receiver: receive → decrypt each shard → FEC decode (plaintext) → reassemble → disk

This ordering means FEC reconstruction operates on plaintext, which is correct — the Reed-Solomon math must see the original data bytes, not ciphertext (encrypting before FEC would require decrypting all k shards before reconstruction, which is the same work, but reconstructed ciphertext shards would then need the original plaintext to verify, creating a circular dependency).

E. Payload Format Changes for Key Exchange

When the Encrypted flag (0x04) is set, the following payloads are extended with a 32-byte ephemeral public key:

Packet Type	Standard Payload	Encrypted Payload
`SESSION_REQ`	FileSize(8B) + Hash(8B) + InitialRate(4B) + FileName(null-term)	FileSize(8B) + Hash(8B) + InitialRate(4B) + PubKey(32B) + FileName(null-term)
`SESSION_ACCEPT`	(does not exist in unencrypted mode)	PubKey(32B)
`PUSH_REQ`	FileSize(8B) + Hash(8B) + InitialRate(4B) + FileName(null-term)	FileSize(8B) + Hash(8B) + InitialRate(4B) + PubKey(32B) + FileName(null-term)
`PUSH_ACCEPT`	(no payload)	PubKey(32B)
`PULL_REQ`	FileName(null-term)	PubKey(32B) + FileName(null-term)

The receiver determines whether to parse the public key by checking the Encrypted flag in the packet header. If a receiver does not support encryption and receives a request with the flag set, it responds with SESSION_REJECT (reason code 0x06 = ENCRYPTION_UNSUPPORTED).

F. Security Properties and Non-Goals

Confidentiality: File content is encrypted in transit. An eavesdropper sees only ciphertext and packet headers (which contain sequence numbers and timing, but no file content).
Integrity: GCM's authentication tag detects any modification to the ciphertext or header. A tampered packet fails decryption and is treated as a lost packet (NACKed for retransmission).
Forward secrecy: Ephemeral keys are destroyed after each session. Recorded traffic cannot be decrypted retroactively.
Replay protection: The strictly-incrementing nonce (derived from SequenceNum) combined with the per-session key means replayed packets from a previous session will fail decryption (wrong key), and replayed packets within a session are detected by the existing duplicate-sequence-number check in the receiver's ring buffer.
Non-goal — Authentication: HP-UDP does not authenticate the identity of the remote endpoint. Any party that can reach the port can initiate a key exchange. This is intentional: in the target deployment environment (managed networks with known infrastructure), endpoint identity is established at the network layer, not the application layer. A future revision may add optional pre-shared-key or certificate-based authentication as a separate concern.
Non-goal — Metadata privacy: Packet headers (including file size, sequence numbers, and timing) are visible to observers. Traffic analysis can reveal transfer size and duration. Metadata encryption is out of scope.

G. Performance Budget

Operation	Throughput (AES-NI)	Impact at 100 MB/s wire speed
AES-128-GCM encrypt	4–6 GB/s single-thread	<3% CPU
AES-128-GCM decrypt	4–6 GB/s single-thread	<3% CPU
X25519 scalar multiply	~50 µs per session	Negligible
HKDF-SHA256 derivation	~1 µs per session	Negligible
Payload reduction (1368→1352)	1.2% fewer data bytes per packet	~1.2% more packets for same file

Net throughput impact: <5%. AES-NI hardware acceleration is present on every x86 CPU manufactured since ~2010 (Intel Westmere / AMD Bulldozer). C implementations should use OpenSSL's EVP_aes_128_gcm (which auto-detects AES-NI) or a SIMD-accelerated library. Go's crypto/aes + crypto/cipher uses AES-NI on amd64 automatically.

Implementation Note: Do not allocate and free GCM cipher contexts per packet. Pre-allocate one context at session start and reuse it across packets, updating only the nonce via EVP_EncryptInit_ex(ctx, NULL, NULL, NULL, nonce) (OpenSSL) or by resetting the cipher.AEAD seal call with a new nonce (Go). Context reuse eliminates ~25,000 allocations/sec at 35 MB/s throughput.

5. Core Reliability Mechanisms

A. Sequence Buffering (Zero-Blocking Receiver)

The receiver will never halt reading from the network socket. Incoming data is immediately mapped into memory.

Memory Architecture: The receiver allocates a contiguous ring buffer based on the initial FileSize.
Placement: As datagrams arrive, their SequenceNum dictates their exact memory offset. Out-of-order packets are slotted into their correct positions seamlessly.
Disk I/O: Contiguous blocks are flushed from the buffer to disk asynchronously. Go: a dedicated goroutine reads and writes sequentially. C (Linux): io_uring submission queue — the main epoll event loop submits write SQEs for contiguous regions and reaps completions without blocking the network read path. This eliminates the flush thread entirely.

B. Adaptive Forward Error Correction (FEC)

To eliminate latency penalties from round-trip retransmissions, the protocol proactively embeds mathematical redundancy that adapts to observed network conditions.

Block Grouping

Data packets are organized into sequential BlockGroups. The default block size is 100 data packets per group. The BlockGroup identifier for a DATA packet is computed as:

BlockGroup = SequenceNum / BlockSize   (integer division)

PARITY packets use the same BlockGroup value as the data packets they protect. The SequenceNum field of a PARITY packet is its zero-based index within the block (0, 1, 2, …, m−1), not a global sequence number. A C receiver must distinguish PARITY from DATA packets using the PacketType field (0x02) and interpret SequenceNum accordingly.

Dynamic Parity Ratio

The parity packet count per block is dynamically adjusted based on the observed packet loss rate, reported via Heartbeat metrics (§6).

Observed Loss Rate	Parity Ratio	Parity Packets per 100-Packet Block
< 0.5%	2%	2
0.5% – 2%	5%	5
2% – 5%	10%	10
5% – 10%	15%	15
> 10%	20%	20

The sender initializes at 5% parity during calibration and adjusts after the first Heartbeat containing loss data. Adjustments are applied on block group boundaries — mid-block changes are not permitted, as this would invalidate the Reed-Solomon coding parameters for that group.

Parity Generation

Parity packets are generated using Reed-Solomon erasure coding over GF(2⁸) with the irreducible polynomial x⁸ + x⁴ + x³ + x² + 1 (0x11d). For a block of k data packets with m parity packets, any k of the k+m total packets are sufficient to reconstruct the original data. The encoding uses a Vandermonde-derived matrix whose top k rows form an identity matrix, ensuring data packets pass through unchanged and only parity is computed.

On-the-Fly Recovery

If the receiver detects missing packets within a completed block group (all expected sequence numbers accounted for or timed out), it attempts FEC reconstruction immediately. Only packets that cannot be recovered via FEC are reported as NACKs in the next Heartbeat.

Tail Block Handling

The final block group of a file transfer will almost certainly contain fewer than 100 data packets. The FEC parameters adapt as follows:

The tail block size k_tail equals the remaining packet count after the last full block.
The parity count m_tail is calculated using the current adaptive parity ratio, with a minimum of 2 parity packets regardless of block size (even a 1-packet tail block gets 2 parity packets). This ensures the most loss-vulnerable portion of the transfer — the tail — has adequate redundancy.
The sender sets the EndOfFile flag (0x01) on the final data packet and the final parity packet of the tail block. This signals to the receiver that no further block groups will follow.

The key formulas for computing totals and boundaries are:

TotalChunks      = ceil(FileSize / MaxPayload)          // number of DATA packets (MaxPayload = 1368)
k_tail           = TotalChunks % BlockSize               // 0 means last block is full
                   (if k_tail == 0: k_tail = BlockSize)
FinalPayloadSize = FileSize % MaxPayload                 // bytes in last DATA packet
                   (if FinalPayloadSize == 0: FinalPayloadSize = MaxPayload)

EOF Detection in FEC-Recovered Packets: The EndOfFile flag is only meaningful on the wire. When the final DATA packet is recovered via FEC reconstruction rather than received directly, the flag is not propagated into the reconstructed shard. Receivers must therefore detect end-of-transfer by checking SequenceNum == TotalChunks − 1 (known from the SESSION_REQ FileSize field) rather than relying solely on the Flags field.

6. Adaptive Congestion and Flow Control

HP-UDP separates congestion control (network path capacity) from flow control (receiver processing capacity). The congestion controller is loss-driven: the primary signal is the observed packet loss rate, not the ratio of delivery rate to send rate. The delivery rate acts as a ceiling, not a decision driver.

Design Rationale (v3.0): The v2.0 spec used a delivery-rate-ratio algorithm where the sender increased only if EffectiveRate ≥ 0.95 × SendRate. In practice, this threshold was unreachable: delivery rate is bounded by send rate (the receiver cannot report receiving more than was sent), and timing jitter in heartbeat measurement windows made the ratio consistently fall below 0.95 even on a clean link. This caused the sender to spiral to the rate floor. Loss rate is the correct primary signal because it reflects whether the network has headroom independently of the send rate.

A. The Heartbeat Packet

The receiver periodically sends a HEARTBEAT (Type 0x03) packet to the sender. The heartbeat interval is rate-proportional based on the last measured NetworkDeliveryRate:

Last Measured Network Delivery Rate	Heartbeat Interval
< 10 MB/s	100ms
10 – 100 MB/s	50ms
100 MB/s – 1 GB/s	25ms
> 1 GB/s	10ms

Implementation Note: The v2.0 spec described the interval as based on "current effective send rate" inferred from packet arrival rate. The implementation initially used cumulative bytes written to disk, which gave incorrect results (e.g., passing 136 MB to a function expecting bytes/sec). The correct approach is to track the NetworkDeliveryRate computed in each heartbeat and use that value directly for interval selection.

B. Heartbeat Payload

The Heartbeat payload contains dual metrics, RTT echo, and a NACK array. All multi-byte fields below are big-endian (network byte order) — C implementations must serialize/deserialize with htonl/ntohl and htonll/ntohll:

Field	Size	Description
`NetworkDeliveryRate`	4 Bytes	Bytes per second successfully received from the socket into the ring buffer during the last heartbeat interval. Reflects network path capacity.
`StorageFlushRate`	4 Bytes	Bytes per second flushed from the ring buffer to disk during the last heartbeat interval. Reflects receiver I/O capacity.
`LossRate`	2 Bytes	Packet loss percentage for the current reporting window, encoded as basis points (e.g., 150 = 1.50%). Primary signal for congestion control.
`EchoTimestampNs`	8 Bytes	The verbatim value of `SenderTimestampNs` from the most recently received DATA or PARITY packet header. The sender computes `RTT = now_ns − EchoTimestampNs` using only its own monotonic clock, eliminating cross-machine clock-skew error entirely. If no DATA/PARITY packets were received in the interval, the receiver echoes the previous value unchanged (the sender's frozen-timestamp guard ignores stale repeats — see §6B).
`DispersionNs`	8 Bytes	Calibration burst dispersion measurement: the time (nanoseconds) between the first and last received calibration-flagged packet. Zero outside of calibration. The sender uses this to compute bottleneck bandwidth: `BW = (BurstSize − 1) × MaxPayload / DispersionNs` (where `MaxPayload = 1368`). See §4C.
`HighestContiguous`	8 Bytes	The highest `SequenceNum` such that all packets 0..N have been received or FEC-recovered. Allows sender to track receiver progress.
`NACKCount`	2 Bytes	Number of unrecoverable sequence numbers in the NACK array that follows.
`NACKArray`	8 Bytes × N	Array of 64-bit `SequenceNum` values that were not recoverable via FEC and require retransmission. Bounded to sequences between `HighestContiguous+1` and the highest received sequence number (never NACKs packets the sender hasn't transmitted yet). The array is physically limited to fit within one packet: `MaxNACKs = (MaxPayload − HeartbeatFixedSize) / 8 = (1368 − 36) / 8 = 166`. For spec simplicity this is rounded down and stated as 167 in rate-limiting contexts; implementations must not exceed the true computed limit. If more than 167 sequences are pending, the receiver sends the highest-priority subset and the remainder appear in a subsequent heartbeat.

RTT Measurement — Same-Clock Design

Each DATA and PARITY packet carries a sender timestamp in the fixed SenderTimestampNs header field (offset 0x18, 8 bytes, unix nanoseconds set at packet-build time). Non-data control packets leave this field zero.

Problem with prior versions: v3.x had the receiver set EchoTimestampNs = receiver's time.Now(). The sender then computed RTT = sender_now − receiver_now, a cross-machine clock comparison. A 4-second clock skew between machines produced a 5-second RTT estimate, permanently locking the NACK cooldown above the receiver's 5-second inactivity timeout and killing the transfer.

Fix: The receiver now echoes the sender's own timestamp verbatim: EchoTimestampNs = pkt.Header.SenderTimestampNs. The sender computes RTT = now_ns − EchoTimestampNs using only its own monotonic clock. Cross-clock error is eliminated entirely.

Frozen-Timestamp RTT Guard

Problem: When the sender is idle (honouring a NACK cooldown), the receiver keeps echoing the same stale SenderTimestampNs from the last packet it received. Each heartbeat makes RTT = now − staleTs grow by one heartbeat interval. After enough heartbeats the RTT inflates past 5 seconds, the NACK cooldown exceeds the receiver's inactivity timeout, and the transfer dies.

Fix: The TokenBucket tracks lastEchoNs — the highest EchoTimestampNs it has processed. RTT is updated only when echoNs > lastEchoNs. Stale repeated echoes are silently ignored; the RTT estimate stays locked at its last valid measurement until the sender transmits a new packet and the receiver reflects a fresh timestamp.

C. Loss-Driven Rate Adjustment Algorithm

The sender adjusts its rate based on the LossRate reported in each Heartbeat. The effective delivery rate (rawEffective = NetworkDeliveryRate) serves as a ceiling for decreases. StorageFlushRate is still reported in heartbeats for observability but is no longer an input to the rate controller: the receiver uses a pre-allocated full-file ring buffer, so disk lag cannot cause packet loss. Including StorageFlushRate in the minimum caused out-of-order packets to stall the contiguous flush frontier to 0, making StorageFlushRate ≈ 0 and falsely triggering the delivery-collapse guard on every heartbeat. Rate increases are gated to once per RTT to prevent the sender from making dozens of upward adjustments before any feedback arrives (critical on high-latency links like satellite).

RTT-Aware Rate Gating

The sender tracks the time of its last rate increase. When a heartbeat signals an increase, the sender checks whether at least one RTT has elapsed since the previous increase. If not, the increase is suppressed (treated as a hold). Decreases are not gated — they are applied immediately upon consecutive loss signals to clear the pipe as fast as possible.

Phased Growth Model

The congestion controller operates in two phases, similar to TCP's slow start and congestion avoidance but adapted for loss-driven UDP with FEC:

Phase 1 — Probe (Multiplicative Increase): While loss is < 1%, the sender has never observed the link ceiling. It probes aggressively with multiplicative increase, applied once per RTT:

S_new = S_current × 1.25

This is more conservative than the v3.0 1.5× multiplier but is applied per RTT rather than per heartbeat, giving the network time to signal back before each step.

Phase 2 — Congestion Avoidance (Additive Increase): Once the sender observes loss entering the 1%–5% hold zone for the first time, it has found the approximate ceiling of the link. The controller permanently transitions to Phase 2 and never returns to Phase 1 for this session. Probing uses additive increase, applied once per RTT:

S_new = S_current + (MaxPayload / RTT)

This adds approximately one packet per RTT of additional bandwidth, gently probing for headroom without risking a burst of loss.

Decision Logic

Let L = reported LossRate in basis points, E = effective delivery rate (NetworkDeliveryRate — see §6C rationale above), S = current send rate. On each Heartbeat reception, the delivery-collapse guard is checked first:

Condition	Action	Rationale
`NACKCount > 0` AND `E < S × 0.25`	Hold + permanently transition to Phase 2. Evaluated before loss thresholds.	OS socket buffer overflow. Packets are dropped before reaching the receiver application — reported LossRate stays 0% (no FEC failures counted) while delivery collapses. NACKs confirm real packet loss. Entering Phase 2 fires the 1.5× ceiling immediately, cutting the target rate to near actual link capacity. Threshold lowered from 50% to 25%: on high-latency paths (50ms+ RTT) approximately 50% of packets are legitimately in-flight during warm-up, causing the old 50% threshold to fire prematurely on measurement lag.
`L < 100` (loss < 1%)	Increase (once per RTT): Phase 1: `S × 1.25`. Phase 2: `S + MaxPayload/RTT`.	Link has headroom. FEC absorbs transient loss. RTT gating prevents runaway probing on high-latency paths.
`100 ≤ L ≤ 500` (1% – 5%)	Hold: `S = S`. If first time entering this zone, transition to Phase 2 permanently.	FEC is handling the loss. The link ceiling has been discovered. Switch to additive probing from this point forward.
`L > 500` (loss > 5%), consecutive confirmation	Decrease: `S = smoothed(E) × 0.85`	Drop to 85% of the EWMA-smoothed effective delivery rate. The 15% undershoot allows router queues to drain; FEC bridges the gap during recovery. Requires two consecutive above-threshold heartbeats to trigger (see below).

Design Rationale (v3.1 / v4.0): Five cumulative changes. (1) The v3.0 decrease formula E × 1.05 set the new rate above the rate that just caused severe loss, sustaining congestion. Dropping to E × 0.85 gives queues time to drain. (2) The 1.5× multiplicative increase per heartbeat was replaced with 1.25× per RTT — on a Starlink link with 100ms heartbeats and 40ms RTT, the old algorithm made 2.5 increases per RTT, compounding to ~1.95× per RTT. (3) The permanent transition to additive increase after discovering the link ceiling prevents repeated boom-bust oscillation at the capacity boundary. (4) StorageFlushRate removed from effective-rate formula (v4.0): pre-allocated ring buffers always stall flush at 0 for out-of-order arrivals, making min(NetworkDeliveryRate, StorageFlushRate) ≈ 0 and permanently tripping the delivery-collapse guard. (5) Delivery-collapse threshold lowered 50%→25% (v4.0): legitimate in-flight packets on 50ms+ RTT paths account for ~50% of the window, causing false collapses during ramp-up with the old threshold.

EWMA Smoothing

Raw delivery rate measurements from individual heartbeats are noisy due to timing jitter, especially on high-latency links. The sender maintains an exponentially weighted moving average (EWMA) of the effective delivery rate:

smoothed = α × raw_sample + (1 − α) × smoothed_previous

The default smoothing factor is α = 0.3, which provides moderate dampening (converges in ~3 samples). The smoothed rate is used as the target when decreasing, preventing single-heartbeat noise from crashing the rate.

Consecutive Decrease Requirement

A single heartbeat reporting > 5% loss may be a transient spike (e.g., a router briefly queuing). The sender requires two consecutive above-threshold heartbeats before executing a decrease. The first signal starts a "decrease streak" counter; the second confirms it. Any increase or hold resets the streak to zero.

Auto-Ceiling

The ceiling is two-tiered based on phase:

Phase 1:  if rate > peakDeliveryRate × 4.0: rate = peakDeliveryRate × 4.0
Phase 2:  if rate > peakDeliveryRate × 1.5: rate = peakDeliveryRate × 1.5

Phase 1 (4× — runaway prevention): During the multiplicative probe, delivery-rate measurements lag the send rate because the sender increases 25% per heartbeat and the receiver's measurement window has not stabilised. For example, at a 7.63 MB/s send rate the receiver may report only 4.19 MB/s delivery. A tight multiplier like 1.5× would fire immediately, giving a 6.28 MB/s ceiling below the current rate and locking the sender at ~5.68 MB/s for the entire transfer on a 110 MB/s Gigabit link. The generous 4× multiplier prevents this while still bounding the exponential: on a clean link where FEC absorbs all drops (LossRate remains 0% throughout), Phase 2 is never entered and without any Phase 1 ceiling the target rate grows without bound (observed: 345 trillion MB/s). With 4×, the target is capped at ~400 MB/s on a Gigabit LAN — effectively the same as nodelay (pacing is disabled at that rate anyway), but without the absurd log output.

Phase 2 (1.5× — avoidance bound): Once Phase 2 is entered, the delivery rate was measured near actual link capacity — the loss event that triggered the transition occurred at or near the ceiling — so 1.5× is a tight and reliable upper bound for the additive probing that follows.

Implementation Note: Earlier revisions gated the ceiling on a fixed warmup period (originally 3 heartbeats, later extended to 5) to avoid locking in a low peakRate from cold-start measurements. This approach was superseded by Phase 2 gating, which is more principled: the ceiling is irrelevant during Phase 1 (the sender is still discovering the link ceiling) and correct during Phase 2 (the delivery rate was measured near capacity). The warmup constant no longer exists in the implementation.

D. Deficit-Accumulator Pacing

The sender must convert the target rate (bytes/sec) into inter-packet timing. The naive approach — computing a per-packet interval and sleeping for that duration — fails in practice because OS timer granularity (~1ms on Windows, ~100µs on Linux) and language-runtime preemption (e.g., Go's asynchronous goroutine preemption, or signal-based preemption in C with certain threading models) make sub-millisecond sleeps unreliable.

Instead, the sender uses a deficit accumulator:

A tokens balance (in bytes) accrues credit at the target rate over elapsed wall-clock time. Elapsed time must use a monotonic clock source (C: clock_gettime(CLOCK_MONOTONIC); Go: time.Now() which uses the monotonic component internally).
Each packet send debits tokens by the packet size.
Tokens are capped at a 2ms burst budget: max_tokens = rate_bytes_per_sec × 0.002. This prevents idle periods from banking enough credit to burst a large backlog of packets instantly.
When the deficit grows large enough that the corresponding sleep would be ≥ 1ms, the sender sleeps and resets the deficit to zero. Go: time.Sleep(). C: clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &target_ts, NULL) — the TIMER_ABSTIME flag prevents drift from accumulating across consecutive sleeps.
Deficits smaller than 1ms are carried forward — they accumulate across multiple packets and eventually trigger a single coarser sleep.

This produces the correct long-term average rate without relying on sub-millisecond timer precision.

C Implementation Note: Linux clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME) provides ~50–100µs precision on default kernels, significantly better than Go's ~1ms floor. For sub-100µs precision (useful at 10 GbE+ rates), run the sender thread under SCHED_FIFO real-time scheduling via sched_setscheduler(). The deficit accumulator is still the correct architecture — even with precise timers, per-packet nanosleep at 67,000 pps would burn CPU on syscall overhead. The accumulator batches the timing debt and issues one sleep per deficit threshold crossing, regardless of timer precision.

Design Rationale: The v2.0 spec described a "dynamic Token Bucket" with "microsecond interval between packet dispatches." The Go prototype initially used runtime.Gosched()-based busy-wait spin loops for sub-ms pacing. Go 1.14+ introduces asynchronous goroutine preemption that signals goroutines at safe points — even inside tight loops — causing the spin to overshoot to ~1ms per packet. At 1400 bytes/1ms = 1.4 MB/s, this created an artificial throughput ceiling regardless of the target rate. The deficit accumulator was developed to solve this without platform-specific timer hacks.

E. NACK-Driven Retransmission

When the sender receives a NACK array in a Heartbeat, it queues the identified packets for retransmission. Retransmitted packets carry their original SequenceNum and BlockGroup.

Retransmissions are interleaved with forward progress: the sender processes at most 3 NACKed packets per send-loop iteration, then sends the next new data packet. This prevents NACK storms (e.g., 169 NACKs on a satellite link) from monopolizing bandwidth and stalling seqNum advancement.

The pending NACK set must be maintained as a deduplicated set (hash set or bitset), not a FIFO queue. Each heartbeat may report the same sequence numbers as the previous one (the receiver keeps NACKing until the packet arrives). If the same sequence is appended to a plain list on every heartbeat, the retransmit queue grows without bound. A set ensures each sequence is queued at most once regardless of how many heartbeats repeat it. If a NACK arrives for a sequence that has already been pruned from the sender's chunk cache (because HighestContiguous advanced past it via FEC recovery), the sender silently skips it — the receiver has already recovered the packet and the NACK is stale.

Implementation Note: The v2.0 spec stated that retransmissions are "injected ahead of new data." In practice, draining the entire NACK queue before sending any new data caused a fatal stall on Starlink: at 0.38 MB/s with 169 NACKs, each send-loop iteration spent ~0.62s on retransmits, and seqNum never advanced. The 3-per-iteration cap ensures the transfer always makes forward progress even under heavy loss.

7. Session Timeout and Failure Recovery

A. Receiver Inactivity Timeout

If the receiver does not receive any packets (data, parity, or retransmission) for a period of 5 consecutive expected heartbeat intervals, with a minimum floor of 5 seconds, it declares the session dead:

The ring buffer is freed.
Any partially written file is deleted from disk (or retained with a .partial suffix if configured for resume support in a future version).
The SessionID is moved to the reserved pool for the standard 10-second stale protection window.
A warning is logged with the last received SequenceNum and HighestContiguous for diagnostics.

Implementation Note: The 5-second floor was added after integration testing showed that at the lowest heartbeat tier (100ms), the computed timeout of 500ms was too aggressive — the natural gap between the sender finishing its data blast and NACK retransmissions arriving through a lossy proxy would trigger a false timeout. The floor gives the NACK retransmission cycle time to recover.

B. Sender Inactivity Timeout

The sender monitors receiver liveness through two complementary mechanisms:

Hard Abort — Heartbeat Staleness (Data Phase)

During active data transmission, the sender tracks the timestamp of the most recent Heartbeat received. If no Heartbeat arrives for 3 seconds (SenderHeartbeatTimeout), the sender immediately aborts the session and frees all resources. This covers the common case of a client disconnecting mid-transfer (e.g. Ctrl+C) and ensures the serve daemon's busy flag is released within seconds rather than minutes.

The client may also send a SESSION_REJECT with reason CLIENT_DISCONNECT (0x08) to signal a graceful shutdown; the sender releases resources immediately on receipt without waiting for the timeout.

Probe State — Intermittent Loss (Pre-completion)

If the sender does not receive a Heartbeat for 5 consecutive expected heartbeat intervals, it assumes the receiver has failed or the return path is temporarily broken:

The sender pauses transmission and enters a Probe state, sending a single DATA packet every 500ms.
If no Heartbeat arrives within 10 seconds of entering Probe state, the sender tears down the session and frees resources.
If a Heartbeat does arrive during the Probe state, normal transmission resumes from the last unacknowledged sequence number.

8. Graceful Teardown (The Final Handshake)

To prevent "tail drop" issues where the final packets are lost and the connection deadlocks, the protocol implements a synchronized teardown with timeout-driven cleanup on both sides.

A. Normal Teardown Sequence

The EOF Signal: The sender sets the EndOfFile flag (0x01) on the final data packet and the final parity packet of the tail block. It stops reading new data but keeps the socket open, listening for Heartbeats. Socket ownership must be exclusive before entering teardown: Go implementations must stop the heartbeat listener goroutine to prevent it from consuming packets the teardown loop needs (see Lesson F). C implementations using a single-threaded epoll event loop have exclusive socket access by construction — no action is needed.
The Verification (Pipelined Hashing): The receiver continuously updates a streaming xxHash64 as contiguous blocks are flushed from the ring buffer to disk. Upon receiving the EOF-flagged packets and verifying the completed hash against the SESSION_REQ metadata, it sends a TRANSFER_COMPLETE (Type 0x05) packet.
The Sender Linger (Zombie State): The sender receives TRANSFER_COMPLETE, responds with ACK_CLOSE (Type 0x06), frees heavy memory allocations (including the sliding window ring buffer), and enters a 3-second Linger state. Duplicate TRANSFER_COMPLETE packets arriving in this window are answered with a repeat ACK_CLOSE.
The Receiver Linger: After sending TRANSFER_COMPLETE, the receiver enters its own 3-second Linger state. If ACK_CLOSE is not received within this window, the receiver retransmits TRANSFER_COMPLETE up to 3 times at 1-second intervals. If no ACK_CLOSE is received after all retries, the receiver considers the transfer successful (the file hash was verified) and performs a unilateral teardown.
The Purge: After the Linger timeout expires on either side, the SessionID is moved to the reserved pool (10-second stale protection), and all local file descriptors are closed.

B. Teardown NACK Handling

During the teardown wait (after all data is sent, before TRANSFER_COMPLETE arrives), the sender continues to process Heartbeat packets synchronously. If a Heartbeat contains NACKs, the sender retransmits the requested packets from the sliding window ring buffer and resets the read deadline. This ensures the receiver can complete even if some late packets were lost.

Teardown retransmits are paced through the token bucket at the congestion controller's current rate. Without pacing, a backlog of queued Heartbeats (e.g., after a brief receive gap) can be drained all at once, causing the sender to fire hundreds of retransmit packets in tens of milliseconds — a burst that overwhelms the same congested link that caused the NACKs in the first place. See Lesson J for the observed failure case.

RTT-Aware NACK Cooldown

Problem: The teardown retransmit loop had no cooldown. On a 50ms-RTT path with a 50ms heartbeat interval, every in-flight heartbeat triggered a redundant retransmit of the same lost packets. The retransmit flood caused fresh congestion, which caused more NACKs, creating a self-reinforcing spiral. Observed: 59,908 reported NACKs for approximately 780 actual losses.

Fix: A nackCooldown map[uint64]time.Time gates each sequence number to at most one retransmit per RTT × 1.25 (RTT plus 25% margin). The map is seeded with all NACKs outstanding at the moment the main send loop ends. On each teardown heartbeat, a sequence is only retransmitted if its cooldown timestamp has elapsed; otherwise it is silently skipped until the next eligible window.

Tail-Drop Deadlock Prevention

Problem: The receiver's NACK scan window is bounded by HighestReceived. If the last packets of the file are dropped, HighestReceived never advances to the end of the file, so the receiver's NACK list is empty. The sender sees 0 NACKs, sends nothing, and the receiver hits its 5-second inactivity timeout — a deadlock neither side can break without external intervention.

Fix: In the teardown loop, if a heartbeat arrives with NACKCount == 0 but hb.HighestContiguous < totalChunks−1, the sender proactively computes up to 167 missing tail sequences (from HighestContiguous+1 through totalChunks−1) and injects them into the retransmit pipeline. These injected sequences flow through the NACK cooldown gate exactly like receiver-reported NACKs, preventing the same tail sequences from being re-injected on every heartbeat.

Teardown Micro-Burst Prevention

Problem: At high transfer speeds (100+ MB/s) the TokenBucket's 2ms burst allowance is approximately 200 KB. A full 167-packet retransmit batch is approximately 232 KB at MaxPayload = 1368 bytes — it fires as a near-simultaneous burst, flooding the OS UDP socket buffer and the serve daemon's 256-slot receive channel, causing most retransmits to be silently dropped.

Fix: Teardown retransmits are chunked into batches of 10 packets with a 2ms sleep between batches. 167 packets are spread over approximately 34ms — invisible to the user, well within any heartbeat interval, and guaranteed to fit through any buffer in the path.

C. Hash Mismatch

If the receiver completes all data reception but the xxHash64 does not match the expected value from SESSION_REQ:

The receiver sends a SESSION_REJECT (Type 0x04) with a reason code indicating hash failure.
The partially written file is deleted.
The sender receives the rejection, tears down the session, and may optionally retry the entire transfer.

9. Development Phases

Progress Bar — Repair State

Once the main send loop completes and the sender enters the teardown loop (§8B), the progress bar changes from the normal 100% | 40.0 MB/s | NACKs: N display to 100% | Repairing... | NACKs: N. This tells the user the network is actively recovering dropped tail packets rather than hanging. The repair state persists until TRANSFER_COMPLETE is received or the teardown timeout expires.

Phase 1 (Go Prototype) — COMPLETE: Validated FEC mathematics (GF(2⁸) Reed-Solomon), Heartbeat state machine, adaptive FEC tuning, wire-speed calibration burst, loss-driven congestion control with deficit-accumulator pacing, NACK retransmission with forward-progress interleaving. Proved 100% file integrity under simulated packet loss at 0%, 1%, 5%, 10%, 15%. Tested on LAN (1 Gbps Ethernet, 41 MB/s throughput) and WAN (Starlink satellite, variable latency/loss). 86 unit tests across protocol, sender, and receiver packages. Phase 1 throughput bottleneck identified: per-packet conn.Write() syscall overhead limits Go to ~30–41 MB/s on LAN versus 93 MB/s for FTP (which uses kernel-level TCP segmentation with a single large write()).
Phase 2 (C Productionization): Translate the validated logic into C. Key optimizations:
- Syscall batching: Use sendmmsg() (Linux) or GSO (Generic Segmentation Offload) to submit multiple UDP packets per kernel transition. At 1376 bytes per packet and 93 MB/s target, the sender must dispatch ~67,000 packets/sec. Per-packet sendto() incurs ~15µs of context-switch overhead each, consuming ~1 second of CPU per second of transfer. sendmmsg() with batches of 16–64 packets amortizes this to ~1,000–4,000 syscalls/sec. On the receiver side, recvmmsg() provides the same benefit.
- Zero-copy receive: mmap()-backed ring buffer with PACKET_RX_RING/AF_XDP to avoid kernel-to-userspace copy on receive.
- SIMD Reed-Solomon: Intel ISA-L or hand-rolled AVX2/NEON for FEC encode/decode. The Go prototype achieves ~60 MB/s encode and ~100 MB/s reconstruct; SIMD should reach multi-GB/s.
- io_uring: Asynchronous disk I/O for the receiver's write path, eliminating the flush thread.

Phase 1 Throughput Analysis: FTP achieves 93 MB/s on the same LAN because TCP's write() pushes megabytes at once — the kernel handles segmentation into ~1500-byte frames internally. HP-UDP's per-packet conn.Write() makes ~67,000 syscalls/sec at full speed, each costing a user-kernel context switch. This is not a fundamental protocol limitation — it's a syscall-overhead problem that sendmmsg() batching in Phase 2 will eliminate.

10. Lessons Learned (Phase 1)

The following empirical findings emerged during Phase 1 implementation and testing. They are documented here to guide the Phase 2 C port and future protocol revisions.

A. Delivery-Rate-Ratio CC Is Fundamentally Broken

The v2.0 algorithm increased the rate only when EffectiveRate ≥ 0.95 × SendRate. Since delivery rate is bounded by send rate and measurement windows never align perfectly, this ratio consistently falls below 0.95 even on a lossless link. The sender spirals to the rate floor. Loss rate is the correct primary signal.

B. Sub-Millisecond Pacing Is Unreliable in Userspace

Both time.Sleep(<1ms) and busy-wait spin loops fail for sub-millisecond pacing on Windows (minimum ~1ms granularity) and on any platform using Go 1.14+ (asynchronous goroutine preemption interrupts tight loops at ~1ms intervals). The deficit accumulator sidesteps this entirely by sleeping only when the accumulated deficit justifies a ≥1ms sleep. The C port should use clock_nanosleep() or similar, but should still avoid relying on sub-millisecond precision for correctness.

C. Calibration Burst Must Not Flood the Link

A 50 MB/s starting rate on a Starlink connection (~10 Mbps effective uplink) caused massive packet loss during the first 100ms, which poisoned the peakRate measurement and locked the auto-ceiling at ~0.38 MB/s for the entire transfer. The starting rate must be conservative enough for the worst expected link (2 MB/s default), while the calibration burst itself runs at wire speed to discover the actual capacity.

D. NACK Storms Stall Forward Progress

On a satellite link with ~30ms RTT and 5% loss, each heartbeat reported ~169 NACKs. Processing all NACKs before each new data packet caused the send loop to spend its entire bandwidth on retransmissions, preventing seqNum from advancing. Capping retransmissions at 3 per iteration restored forward progress.

E. Early Delivery-Rate Measurements Are Unreliable

The first few heartbeats arrive during or immediately after the calibration burst, when the receiver is still allocating buffers and the network path hasn't stabilized. More broadly, during Phase 1 ramp-up the receiver's measurement window hasn't caught up to the sender's current rate — at a 7.63 MB/s send rate the receiver may only report 4.19 MB/s delivery because the sender had only been at that rate for one 100ms heartbeat interval. Any ceiling derived from these measurements will be artificially low. The correct solution is to gate the auto-ceiling on Phase 2 entry rather than a fixed warmup period: by the time Phase 2 is entered, the sender has been near the link ceiling long enough for delivery measurements to be meaningful.

F. Socket Ownership During Teardown

The sender's heartbeat listener goroutine and the teardown synchronous read loop compete for the same socket. If the goroutine is still running when the sender enters teardown, it consumes packets (including TRANSFER_COMPLETE) that the teardown loop needs. The goroutine must be stopped before entering teardown, and any queued NACKs must be drained synchronously.

G. Per-Packet Syscall Overhead Is the Phase 1 Bottleneck

FTP achieves 93 MB/s on the same Gigabit LAN where HP-UDP reaches ~30–41 MB/s. The difference is not the protocol or the language — it's the syscall pattern. FTP writes large buffers to a TCP socket; the kernel segments them into packets internally. HP-UDP calls conn.Write() for every 1376-byte packet, requiring ~67,000 user-kernel context switches per second at full speed. Each syscall costs ~15µs of overhead, consuming nearly 100% of available CPU time at target throughput. The Phase 2 C port must use sendmmsg()/recvmmsg() or equivalent batching to amortize this cost across 16–64 packets per syscall.

H. Go-Specific Memory Pressure

Three allocations patterns in the Go prototype created significant memory and CPU overhead that are not visible in the protocol design but directly impacted measured throughput. All three were fixed in Phase 1 and their equivalents must be avoided in the Phase 2 C port.

FEC Encoder Construction Cost. NewRSEncoder(k, m) builds a k×k Vandermonde matrix and inverts it using O(k³) GF(2⁸) operations. For the default block size of k=100 this is ~2 million GF operations, measured at ~4 ms per call. With ~1,720 FEC blocks in a 237 MB transfer, constructing a fresh encoder for each block costs ~6.9 seconds of CPU time — roughly equal to the entire transfer duration at 35 MB/s. The fix is to cache the encoder keyed on (dataShards, parityShards) and reuse it across blocks. The matrix is deterministic for a given (k, m) pair and encoding only reads from it (no mutation), so the cached instance is safe for concurrent use. The Phase 2 C port must pre-build encoder matrices at session start and reuse them.
FEC Shard Buffer Allocations. Each data packet that enters the FEC encoder requires a MaxPayload-sized buffer padded to equal length before RS encoding. At the default block size of 100 and a 35 MB/s transfer rate, this produces ~25,000 allocations per second, generating ~470 MB of heap churn per 237 MB transfer and placing the garbage collector on the critical path. The fix is a sync.Pool of pre-allocated MaxPayload-sized buffers checked out at encode time and returned immediately after the parity computation completes. The Phase 2 C port should maintain a fixed pool of shard-sized stack or heap buffers reused across blocks.
Unbounded Retransmit Cache Growth. The original sender cached every transmitted chunk in a map[uint64][]byte to service NACK retransmissions. Without eviction, this map retained all ~172,000 chunks for a 237 MB transfer, consuming ~470 MB of heap memory for the entire session duration. Fixed in Phase 1: replaced with a bounded SlidingWindow ring buffer (50,000 slots, ~68 MB peak). Entries are evicted when HighestContiguous advances (received from each heartbeat), since the receiver has already confirmed contiguous receipt up to that point and will never NACK those sequences. When the window is full (all 50,000 un-acknowledged slots occupied), the sender pauses sending new packets until the receiver's HighestContiguous advances and frees slots — providing memory-bounded backpressure. The Phase 2 C port should use the same ring-buffer pattern with a power-of-2 slot count (65,536 recommended, ~89 MB peak) so that index wrapping uses a bitmask (idx & 0xFFFF) instead of a modulo operation — eliminating a division on the hot path for every packet sent and every HighestContiguous advance.

I. Auto-Ceiling Overshoot Causes Persistent NACK Storms

With the original auto-ceiling multiplier of 4×, the target rate on a Gigabit LAN reached 396 MB/s (4× a measured peak delivery of ~99 MB/s). At this target the token bucket's pacing budget was so large that no sleep was ever triggered — the sender fired packets as fast as the CPU allowed. The OS NIC queue and the receiver's socket buffer were overwhelmed, systematically dropping the same ~167 packets per heartbeat interval. Because these drops were clustered within FEC blocks (the sender had burst through several consecutive blocks before the receiver could drain them), the parity overhead was insufficient to recover them. The result: every heartbeat for the entire transfer carried the same 167-entry NACK list, the sender retransmitted them repeatedly, and the teardown phase required ~3.5 seconds of retransmit cycles before the receiver could recover all blocks and issue TRANSFER_COMPLETE.

Three fixes address this together:

Two-tier ceiling (Phase 1: 4×, Phase 2: 1.5×). The fix went through two iterations. Applying a uniform 1.5× ceiling during Phase 1 ramp-up caused the opposite problem: with a 4.19 MB/s peak delivery measurement at heartbeat 6, a 1.5× ceiling produced a 6.28 MB/s cap — below the current send rate — locking the sender at ~5.68 MB/s for a 42-second transfer on a 110 MB/s Gigabit link. Making the ceiling Phase 2-only solved the ramp-up problem but exposed a new failure: on a clean Gigabit LAN where FEC absorbs all OS socket buffer drops, LossRate remains 0% throughout, Phase 2 is never entered, and without any ceiling the multiplicative probe compounded to 345 trillion MB/s (observed). The two-tier ceiling resolves both: Phase 1 uses 4× as a loose runaway brake (capping at ~400 MB/s on a Gigabit LAN, which effectively disables pacing without absurd log output), and Phase 2 uses 1.5× as a tight avoidance bound (by Phase 2 entry the delivery rate was measured near actual link capacity, so 1.5× is meaningful).
NACK deduplication. The sender's retransmit queue was a plain slice. Each heartbeat appended the full NACK list again (167 entries × every ~25 ms = the queue grew without bound). The fix is to use a set (hash map of pending sequence numbers): a sequence already waiting for retransmission is not added a second time. This prevents the queue from accumulating thousands of stale duplicate entries and ensures the 3-retransmits-per-iteration cap is spent on distinct missing packets. The Phase 2 C port should maintain a fixed-size bitset or hash set of pending NACK sequences rather than a FIFO queue.

J. OS Socket Buffer Drops Are Invisible to the Loss Rate Signal (LFN)

On a real Long Fat Network (1 GB file sent over a ~20 MB/s WAN link), Phase 1 multiplicative probing ramped the sender from 2 MB/s to 71 MB/s (the 4× Phase 1 ceiling) in under 5 seconds. The link could only sustain ~20 MB/s. The excess was absorbed by OS socket buffers, which then overflowed. From that point, packets were dropped at the OS layer before reaching the receiver application. The observed effect:

LossRate stayed 0.00% for the entire 3-minute transfer. The receiver's loss counter is FEC-failure-based: packetsLost / (packetsReceived + packetsLost). When the OS drops all packets in a heartbeat window, packetsReceived = 0 and packetsLost = 0, so totalPackets = 0 and the guard clause produces 0% loss. The CC never detected any congestion.
167 NACKs persisted for the entire transfer. The early burst blew through ~800 FEC blocks before any heartbeat arrived. With the OS dropping most packets, those blocks could not be FEC-recovered. The sender retransmitted them on every heartbeat, but each retransmit went out at the 71 MB/s ceiling rate, flooding the link again and preventing recovery.
Teardown never completed. The sender sent all data but the receiver could never verify the hash because the 167 missing sequences were never recovered. After 3.5 minutes the sender timed out waiting for TRANSFER_COMPLETE.

Additionally, a secondary failure compounded the problem during teardown: a 7-second gap in heartbeat reception caused a backlog to accumulate. When heartbeats resumed, the sender drained the entire queue at once — 39 calls to the retransmit function in 45ms, each firing all 167 packets at wire speed (~199 MB/s burst on a 20 MB/s link).

Two fixes address this:

Delivery-collapse guard. Before evaluating the loss-rate thresholds, OnHeartbeat checks: NACKCount > 0 AND E < S × 0.25 (threshold lowered from 0.5 in v4.0 — see §6C). If both are true, the sender holds and permanently enters Phase 2. The Phase 2 ceiling (1.5× peak delivery) fires immediately, cutting the target from 71 MB/s to ~26 MB/s. With the rate near actual link capacity, the normal loss signals take over and back it down further. The NACKCount condition is critical — it prevents false holds on cold-start windows where delivery is transiently near zero but the link is healthy.
Paced teardown retransmits. The retransmit function now accepts the token bucket and calls Pace() for each packet. A backlog of 39 queued heartbeats retransmitting 167 packets each is spread over ~600ms at 15 MB/s rather than firing in 45ms, giving the receiver time to process them and advance HighestContiguous.

K. Backpressure Starves NACK Retransmits (Window-Full Deadlock)

Problem: When the sender's sliding window fills (50,000 un-acknowledged slots), the main send loop must pause to avoid growing memory without bound. The original implementation used a bare for sw.IsFull(seqNum) { time.Sleep(1ms) } inner loop. While spinning there, the outer loop never returned to its NACK-processing step at the top. If the very first DATA packet was dropped, HighestContiguous on the receiver stayed at 0. Because Advance(0) is a no-op (guarding against the zero-value case), hc in the sliding window stayed at its sentinel value (MaxUint64). With that sentinel, IsFull returned true at exactly seq 50,000, which corresponds to ~68 MB — roughly 6% of a 1 GB file. The sender froze, NACKs queued in nackPending went unserviced, and the receiver detected no incoming packets for 5 seconds and declared an inactivity timeout.

Fix: Replace for sw.IsFull(seqNum) { sleep } with if sw.IsFull(seqNum) { sleep; continue }. The continue jumps back to the top of the outer for seqNum < totalChunks loop, so every backpressure iteration still drains nackPending and retransmits pending sequences before sleeping. The retransmit of the lost first packet allows HighestContiguous to advance on the next heartbeat, which unblocks IsFull and resumes the main send loop normally.

Observed signature: Transfer ramps to full speed (100+ MB/s), freezes at approximately 6% of a 1 GB file, receiver reports inactivity timeout 5 seconds later. NACKs counter shows 0 during the freeze (the NACK retransmit loop never ran). The freeze point scales exactly with window size: windowSlots × MaxPayload / fileSize.

L. Cooldown Spin-Lock During Teardown

Problem: The §8B NACK cooldown gate (RTT × 1.25) can trap the teardown retransmit loop in an infinite spin. After the sender retransmits a batch of NACKed sequences, every entry in the NACK set has a fresh lastRetransmitTime. The retransmit loop condition (nackSet.count > 0) remains true because entries are only removed after successful retransmit, but the cooldown gate skips every entry with continue. Because the loop is unbounded (a while, not a for with an iteration cap), continue restarts the loop indefinitely. The clock variable now is only refreshed inside the batch-sleep path, which is never reached when sent == 0. The sender is permanently trapped in the retransmit function and never returns to the teardown main loop to check its deadline or read the next heartbeat.

Fix: Track consecutive cooldown skips. When the skip count reaches nackSet.count, all entries are in cooldown — break out of the retransmit loop and return to the teardown main loop. The next heartbeat will either confirm the sequences were received (removing them from the set) or re-NACK them (at which point their cooldown timestamps will have elapsed). This is the same class of bug as Lesson K: a tight inner loop that starves the outer control loop because its exit condition cannot be reached from within.

Observed signature: Sender enters teardown, processes 1–2 heartbeats, retransmits NACKed packets, then hangs indefinitely. The receiver reports the session complete (all data received), but the sender process never exits. CPU usage spikes to 100% on one core. Ctrl-C is required to terminate.

M. Receiver-Side Self-Completion

Problem: §8A defines a two-phase teardown where the sender sends TRANSFER_COMPLETE after the receiver's heartbeats confirm all data received. In the serve daemon architecture (§11), the sender's linger timeout fires a single TRANSFER_COMPLETE and immediately exits. Under WAN conditions (50ms RTT, 1–5% loss via tc netem), this single fire-and-forget control packet is frequently dropped. The receiver has all data (HighestContiguous + 1 == TotalChunks, NACKCount == 0) but remains stuck in the DATA state waiting for a TRANSFER_COMPLETE that will never arrive, eventually hitting the 5-second inactivity timeout.

Fix: The receiver must autonomously detect transfer completion. When constructing a heartbeat with NACKCount == 0 and HighestContiguous + 1 >= TotalChunks, the receiver self-initiates teardown: it flushes and fsyncs the output file, sends ACK_CLOSE, and transitions to the linger state. This makes the receiver resilient to lost control packets and removes the dependency on a single TRANSFER_COMPLETE arriving reliably. The sender's TRANSFER_COMPLETE still accelerates the handshake when it does arrive, but is no longer required for correctness.

Observed signature: Transfer runs at full speed, sender exits normally, but the daemon session hangs for 5 seconds and then reports timed out. The daemon's debug log shows pkts == TotalChunks and HighestContiguous + 1 == TotalChunks — all data is present but the session never transitions to teardown.

N. Serve Daemon Must Handle PARITY Packets

Problem: §11 specifies that the serve daemon runs the "normal receiver" flow for push transfers, but does not explicitly require PARITY packet handling. An implementation that dispatches DATA but ignores PARITY (e.g., case PARITY: break) will function on lossless links but degrade severely under WAN conditions. Without FEC recovery, every lost DATA packet requires a full NACK → retransmit round trip (RTT × 1.25 cooldown + propagation). At 1–5% loss with 50ms RTT, this converts what should be a transparent FEC correction into a NACK storm that slows teardown convergence and can interact with Lesson L to cause hangs.

Fix: The serve daemon's packet dispatch must handle PARITY packets identically to the standalone receiver: maintain a per-session FEC block pool, track parity shard arrival, and attempt Reed-Solomon recovery after each new parity shard. Recovered data shards are written to the mmap, marked in the receive bitset, and advance HighestContiguous — exactly as if they had arrived as DATA packets. PARITY reception should also update the inactivity timer to prevent timeout during FEC-dominated tail phases where the only arriving packets are parity shards.

11. Serve Daemon — Bidirectional File Hub

The serve daemon is a persistent multi-session UDP server that manages a file directory and services both pull requests (clients fetch files) and push requests (clients deposit files). It listens on a single UDP socket and dispatches packets by (src_addr, session_id) tuple. Up to HPUDP_MAX_SESSIONS = 16 concurrent transfers are supported. Additional requests beyond that limit receive SERVER_BUSY (0x03) and may retry. LIST_REQ is always answered regardless of the current session count.

A. SESSION_REJECT Reason Codes

All SESSION_REJECT packets carry a single reason-code byte in the payload:

Code	Name	Meaning
`0x01`	`SESSION_ID_COLLISION`	The submitted SessionID is already active on the server.
`0x02`	`HASH_MISMATCH`	Received file hash does not match the value declared in SESSION_REQ.
`0x03`	`SERVER_BUSY`	A transfer is already in progress; try again later.
`0x04`	`FILE_NOT_FOUND`	The requested filename is not in the serve manifest.
`0x05`	`FILE_EXISTS`	A push was rejected because the filename already exists on the server (no-overwrite policy).
`0x06`	`ENCRYPTION_UNSUPPORTED`	The receiver does not support encryption and received a request with the Encrypted flag (0x04) set.
`0x07`	`RESUME_HASH_MISMATCH`	The sender rejected a RESUME_REQ because the FullHash or PartialHash in the request did not match the file on disk.
`0x08`	`CLIENT_DISCONNECT`	Sent by the client as a graceful disconnect signal (e.g. Ctrl+C). The sender releases the session immediately upon receipt rather than waiting for the heartbeat timeout.

B. PULL_REQ — Client-Initiated Pull (NAT Traversal)

The PULL_REQ mechanism allows a client behind NAT to retrieve a file from a serve daemon that has a public IP address, without any port-forwarding configuration on the client side.

Wire format: Packet type 0x07. Payload is a null-terminated UTF-8 filename with no fixed-size prefix.

Flow:

Client generates a random SessionID (same CSPRNG path as §4A), binds a local UDP socket on an OS-assigned ephemeral port, and sends PULL_REQ to the server's control port. This outbound packet punches the NAT hole: the NAT mapping records client-ip:ephemeral-port → server-ip:control-port.
Server receives PULL_REQ. If busy or filename not in manifest, sends SESSION_REJECT back to the client's address. Otherwise, it fires a normal SESSION_REQ to the client's address from a new outbound connection. Because the server's IP was the destination of the punching packet, port-restricted-cone NAT routers (the most common home router class) allow this inbound from the same IP on any source port.
Client receives the SESSION_REQ on its bound socket, records the sender's address (server ephemeral port), and enters the normal receiver flow using its existing socket — no rebind required. The same socket that sent PULL_REQ carries the entire transfer.
The server's sender thread uses the SessionID supplied in the PULL_REQ header, eliminating an extra round-trip for ID assignment.

C. LIST_REQ / LIST_RESP — Catalog Query

LIST_REQ (0x0D) allows a client to query the serve daemon's file manifest without initiating a transfer. The serve daemon responds with LIST_RESP (0x0E) immediately, regardless of whether a transfer is currently in progress (SERVER_BUSY does not apply to listing).

Wire format — LIST_REQ: Packet type 0x0D. No payload. The client generates a random SessionID used solely to match the response.

Wire format — LIST_RESP: Packet type 0x0E. Payload is a UTF-8 string where each entry is a tab-separated filename\tsize\n line (filename, a tab character, the file's byte count as a decimal integer, and a newline). Empty payload means no files are available. The list is truncated to fit within a single MTU payload (MaxPayload = 1368 bytes) if the directory is very large. Example: backup.tar.gz\t524288000\nreport.pdf\t2097152\n.

Flow:

Client generates a random SessionID, binds a local UDP socket, and sends LIST_REQ to the server's control port.
Server reads the manifest under a read lock and immediately writes a LIST_RESP to the client's address with the same SessionID. The busy flag is not checked.
Client waits up to 2 seconds for a matching LIST_RESP (matching by SessionID), retrying up to 3 times on timeout.

D. PUSH_REQ / PUSH_ACCEPT — Client-Initiated Push

The push flow allows a client to deposit a file into the serve daemon's directory. Three security invariants are always enforced:

Base-name-only rule: The filename from the PUSH_REQ payload is sanitized to its base name — the substring after the last / or \ separator (C: manual reverse scan for both separators). Path traversal sequences such as ../../etc/passwd or /absolute/path are reduced to just the final filename component before any further processing.
No-overwrite rule: If a file with the sanitized name already exists in the serve directory, the request is rejected with reason code FILE_EXISTS (0x05).
Post-hash atomic rename: The incoming file is written to filename.tmp during the transfer. Only after a successful TRANSFER_COMPLETE (xxHash64 verified) is the .tmp renamed to its final path. A failed or interrupted transfer leaves no partial file in the directory.

PUSH_REQ wire format (type 0x08): FileSize(8B, big-endian) + FileHash(8B, big-endian) + InitialRate(4B, big-endian) + optional PubKey(32B) (encrypted mode only) + null-terminated filename. The layout mirrors SESSION_REQ (§4C Step 1).

PUSH_ACCEPT wire format (type 0x09): No payload in unencrypted mode. In encrypted mode, a single PubKey(32B) — the server's ephemeral X25519 public key for completing the key exchange.

Flow:

Client generates a SessionID, sends PUSH_REQ to the daemon port containing the file metadata and filename. If encrypted, includes the client's X25519 public key in the payload (§4.5E).
Server validates (not over session limit, base-name safe, file does not already exist). If encrypted, generates its own X25519 keypair and derives the shared session key. Sends PUSH_ACCEPT (including its pub key in encrypted mode). Allocates a session slot and begins receiving DATA packets on the same shared socket.
Client receives PUSH_ACCEPT. If encrypted, calls hpudp_crypto_derive() with the server's pub key. Begins the calibration burst and data transmission to the same daemon port using the same SessionID.
Server receives DATA/PARITY packets, dispatches them by (src_addr, session_id) to the allocated session slot, writes to filename.tmp via mmap. On TRANSFER_COMPLETE: verifies hash, atomically renames filename.tmp to final path. On failure: tmp is deleted and the slot is cleared.

D. Manifest Lifecycle

The manifest is a filename-to-absolute-path map built at daemon startup by a non-recursive directory scan. Symlinks and directories are excluded. Files added to the directory after startup are invisible until the daemon is restarted — this is an intentional security boundary. Successful push transfers atomically add their promoted file to the in-memory manifest under a write-lock (Go: sync.RWMutex; C: pthread_rwlock_t). PULL_REQ handlers acquire a read-lock when consulting the manifest.

12. Resumable Transfers

A transfer interrupted mid-flight (network loss, client Ctrl+C, daemon restart) can be resumed transparently without any wire-level negotiation. The mechanism is entirely receiver-side: a binary checkpoint sidecar written periodically by the receiver is loaded automatically when a matching new session arrives.

A. Checkpoint Sidecar

While receiving, the receiver writes a checkpoint sidecar (filename: <output>.hpudp-ckpt) on every heartbeat. The sidecar is a binary file with the following layout. All fields are little-endian — this is a local file format, not a wire format.

Offset	Size	Field	Description
0	4 bytes	`magic`	Always `0x48505543` ("HPUC"). Identifies the file as a valid checkpoint.
4	4 bytes	`version`	Always `1` in the current implementation.
8	8 bytes	`file_size`	Total declared file size from the original SESSION_REQ.
16	8 bytes	`file_hash`	xxHash64 of the complete file, as declared in SESSION_REQ. Used to match the checkpoint to a resumed session.
24	8 bytes	`total_chunks`	`ceil(file_size / MaxPayload)` — total expected DATA packets.
32	8 bytes	`highest_contiguous`	Highest sequence number N such that all sequences 0…N have been received.
40	variable	`recv_bits`	Receive bitset: `ceil(total_chunks / 8)` bytes. Bit `i` is set if sequence `i` has been received. Allows the receiver to skip already-received out-of-order packets on resume.

The sidecar is written atomically (write to <output>.hpudp-ckpt.tmp, then rename) to prevent partial reads. It is deleted on successful transfer completion.

B. Transparent Resume Flow

Resume is fully transparent to the sender — the sender always initiates with a normal SESSION_REQ from sequence 0. The hpudp resume command is identical to hpudp send; the negotiation happens at the receiver.

Receiver receives SESSION_REQ. It checks for a sidecar at <output_path>.hpudp-ckpt. A sidecar matches if its file_hash equals the hash in the incoming SESSION_REQ. If no sidecar is found, the receiver proceeds with a fresh transfer.
Receiver loads the sidecar: restores the receive bitset and highest_contiguous. The mmap file already contains the previously-received data at the correct offsets.
Sender transmits from seq 0 as normal. The receiver skips (discards) packets whose bit is already set in recv_bits. New packets are written normally.
On the first heartbeat, the receiver reports the restored HighestContiguous. The sender's sliding window advances past all already-acknowledged sequences, freeing ring buffer slots.
Transfer completes and hash is verified as normal. The sidecar is deleted on success.

Design Rationale: Receiver-side transparent resume is simpler and more robust than wire-level RESUME_REQ / RESUME_ACCEPT negotiation. The sender needs no changes. The receiver handles hash validation locally via the stored file_hash. There is no cross-machine hash-of-partial-data coordination that could mismatch due to FEC recovery reordering partial shard data.

C. Packet Types 0x0B / 0x0C — Reserved

Packet types 0x0B RESUME_REQ and 0x0C RESUME_ACCEPT are defined in the protocol header and assigned their type codes, but are not used in the current implementation. They are reserved for a future wire-level resume negotiation protocol (e.g., for scenarios where the sender and receiver are on different machines and the receiver wants to skip already-received bytes at the wire level to save bandwidth). Receivers must silently ignore packets with these type codes.

Appendix A: Protocol Constants (Defaults)

Parameter	Default Value	Configurable
MTU Hard Cap	1400 bytes (total)	No
Header Size	32 bytes (4 × 64-bit aligned, includes `SenderTimestampNs`)	No
Max Payload	1368 bytes unencrypted (`MTUHardCap(1400) − HeaderSize(32)`); 1352 bytes encrypted (`1368 − GCM_TagSize(16)`)	No
FEC Block Size	100 data packets	Yes
FEC Initial Parity	5%	Yes
FEC Tail Min Parity	2 packets	Yes
Calibration Burst Size	10 packets (packet train)	Yes
Calibration Burst Spacing	0 (wire speed)	Yes
Default Starting Rate	2 MB/s	Yes
EWMA Smoothing Factor (α)	0.3	Yes
Loss Threshold: Increase	< 100 bp (1%)	Yes
Loss Threshold: Hold / Phase Transition	100–500 bp (1–5%)	Yes
Loss Threshold: Decrease	> 500 bp (5%)	Yes
Consecutive Decrease Signals	2	Yes
Phase 1 Increase Multiplier	1.25× per RTT	Yes
Phase 2 Additive Increase	MaxPayload / RTT per RTT	Yes
Decrease Factor	0.85× smoothed delivery rate	Yes
Auto-Ceiling Multiplier (Phase 1)	4× peak delivery rate (runaway prevention)	Yes
Auto-Ceiling Multiplier (Phase 2)	1.5× peak delivery rate (avoidance bound)	Yes
Deficit Accumulator Burst Cap	2ms of credit	Yes
Deficit Sleep Threshold	≥ 1ms	No (OS-dependent)
Max NACKs per Send Iteration	3	Yes
Rate Floor	10 KB/s	Yes
Inactivity Timeout	max(5 × heartbeat interval, 5s)	Yes
Sender Probe Interval	500ms	Yes
Sender Probe Timeout	10 seconds	Yes
Linger Duration (both sides)	3 seconds	Yes
Receiver Teardown Retries	3	Yes
Stale SessionID Reservation	10 seconds	Yes
Max SESSION_REQ File Size	1 TB	Yes
Teardown Batch Size	10 packets per sleep	Yes
Teardown Batch Sleep	2 ms	Yes
Delivery-Collapse Threshold	25% of current send rate (was 50%)	Yes
NACK Cooldown Margin	RTT × 1.25 (RTT + 25%)	Yes
Tail-Drop Injection Limit	167 sequences per heartbeat	Yes
Sliding Window Slots	65,536 (2¹⁶, ~89 MB peak). Go prototype uses 50,000; C implementations should use power-of-2 for bitmask index wrapping.	Yes
Encryption Cipher	AES-128-GCM (128-bit key, 128-bit auth tag, 96-bit nonce)	No
GCM Tag Size	16 bytes	No
GCM Nonce Size	12 bytes: `iv_base(8B) \|\| seq_low32_be(4B)`. `iv_base` is derived from HKDF output bytes 16–23; `seq_low32_be` is the low 32 bits of `SequenceNum` in big-endian. Not transmitted.	No
Key Exchange	X25519 ephemeral (32-byte public key per side)	No
Key Derivation	HKDF-SHA256, salt = SessionID (4B big-endian), info = "hp-udp-aes128-v5", output = 24 bytes (bytes 0–15: AES-128 session key; bytes 16–23: iv_base)	No

Appendix B: Revision Log

v5.2 — C implementation alignment: HKDF output extended to 24 bytes (adds HKDF-derived iv_base for nonce construction); GCM nonce redesigned to iv_base(8B) || seq_low32_be(4B) replacing the 3-field SessionID+PacketType+UniqueID scheme; PUSH_ACCEPT payload simplified (no ephemeral port — daemon is single-socket); PUSH_REQ payload aligned with SESSION_REQ (adds FileHash and InitialRate); serve daemon upgraded to 16 concurrent sessions; LIST_RESP format updated to tab-separated filename\tsize\n lines; §12 Resume rewritten as transparent receiver-side checkpoint mechanism (no wire RESUME_REQ/RESUME_ACCEPT — types 0x0B/0x0C reserved); checkpoint sidecar renamed .hpudp-ckpt with binary format (magic + version + bitset).
v5.1 — Resumable transfers (§12): checkpoint sidecar (.hpuft-resume), RESUME_REQ (0x0B) / RESUME_ACCEPT (0x0C) packet types, FEC block-boundary alignment, PartialHash cross-validation; graceful client disconnect: CLIENT_DISCONNECT reject reason (0x08), sender hard-abort on 3s heartbeat staleness (SenderHeartbeatTimeout); catalog query: LIST_REQ (0x0D) / LIST_RESP (0x0E), served regardless of busy state; RESUME_HASH_MISMATCH (0x07) reject code.
v5.0 — End-to-end encryption: ephemeral X25519 key exchange + AES-128-GCM per-packet encryption (§4.5); SESSION_ACCEPT (0x0A) packet type; Encrypted flag (0x04); 1-RTT handshake when encrypted (0-RTT preserved for unencrypted); deterministic nonce from header fields; encrypt-after-FEC data path; extended payloads for SESSION_REQ, PUSH_REQ, PUSH_ACCEPT, PULL_REQ; ENCRYPTION_UNSUPPORTED reject code; encrypted MaxPayload 1352 bytes; backward compatible with v4.x unencrypted transfers.
v4.2 — C implementation readiness: universal big-endian byte order declaration covering all payload fields (not just header); language-neutral wording throughout (monotonic clock, clock_nanosleep, pthread_rwlock_t, manual basename scan, ephemeral bind()+getsockname()); expanded deficit-accumulator C guidance (TIMER_ABSTIME, SCHED_FIFO); architecture-neutral teardown socket ownership (epoll single-thread vs goroutine); power-of-2 sliding window slot count (65,536) for bitmask index wrapping; io_uring receiver disk I/O note.
v4.1 — Sender sliding window: bounded ring buffer (50,000 slots) replaces unbounded map; backpressure NACK starvation deadlock fix (if IsFull { continue } replaces bare spin).
v4.0 — WAN reliability overhaul: same-clock RTT measurement via SenderTimestampNs header field (HeaderSize 24→32, MaxPayload 1376→1368); frozen-timestamp RTT guard (lastEchoNs); RTT-aware NACK cooldown map; tail-drop deadlock prevention via proactive tail injection; teardown micro-burst prevention (batch-10 / 2ms); StorageFlushRate removed from CC effective-rate; delivery-collapse threshold 50%→25%; progress bar repair state.
v3.1 — Calibration burst reduced to 10 packets (packet train); dispersion-based bottleneck measurement (DispersionNs); sender timestamps for RTT measurement; two-phase congestion controller (Phase 1 multiplicative 1.25×/RTT, Phase 2 additive); decrease formula changed from E × 1.05 to E × 0.85; rate-increase gating to once per RTT; Phase 2 throughput analysis; sendmmsg()/recvmmsg()/io_uring roadmap.
v3.0 — Delivery-rate-ratio CC replaced with loss-driven algorithm; delivery-collapse guard; paced teardown retransmits; NACK deduplication (set instead of queue); two-tier auto-ceiling (Phase 1: 4×, Phase 2: 1.5×); consecutive decrease requirement; EWMA smoothing for delivery rate.
v2.0 — 50-packet calibration burst at 1ms spacing; per-packet sender timestamps in payload metadata; 1.5× multiplicative increase per heartbeat; initial delivery-rate-ratio CC algorithm.
v1.0 — Initial specification: 0-RTT handshake, SessionID generation/collision handling, FEC (Reed-Solomon GF(2⁸)), adaptive parity ratio, deficit-accumulator pacing, Heartbeat/NACK mechanism, inactivity timeouts, graceful teardown with linger states, serve daemon (PULL_REQ/PUSH_REQ).