Design Specification: High-Performance UDP File Transfer Protocol (HP-UDP) v5.0

Changelog from v4.2 → v5.0 (End-to-End Encryption):
Changelog from v4.1 → v4.2 (C Implementation Readiness):
Changelog from v4.0 → v4.1 (Sender Sliding Window):
Changelog from v3.1 → v4.0 (WAN Reliability Overhaul):
Changelog from v3.0 → v3.1 (retained):
Changelog from v1.0 (retained):

1. The Motivation (The "Why")

The development of HP-UDP is driven by a singular goal: to engineer the fastest possible file transfer protocol that mathematically guarantees perfect data integrity across volatile network conditions.

While TCP is the foundational workhorse of the internet, its general-purpose congestion algorithms inherently throttle performance on Long Fat Networks (LFNs). HP-UDP was built to prove that it is possible to outperform TCP in raw throughput by replacing reactive safety nets with proactive, domain-specific algorithms.

This protocol democratizes high-speed data movement, giving developers and engineers the ability to send and receive massive files cleanly, reliably, and at maximum hardware limits. It is a rigorous demonstration of advanced systems engineering, built to prove what is possible when legacy constraints are stripped away.

2. Architectural Overview

HP-UDP is an application-layer file transfer mechanism built on top of UDP. The design is lean, avoids unnecessary overhead, and focuses intently on its primary goal of speed while ensuring the reliability required for production use.

The architecture is built upon five core pillars:

Note on Security Scope: HP-UDP v5.0 adds optional end-to-end encryption via ephemeral X25519 key exchange and AES-128-GCM per-packet encryption (§4.5), providing confidentiality, integrity, and perfect forward secrecy. Encryption is backward-compatible: unencrypted transfers still work when the Encrypted flag is unset. HP-UDP intentionally omits authentication — it does not verify the identity of the remote endpoint. In the target deployment environment (managed networks with known infrastructure behind SDNs), endpoint identity is established at the network layer. Optional certificate or pre-shared-key authentication may be added in a future revision as a separate concern.

3. Packet Wire Format (Custom Header)

The protocol utilizes a tightly packed, fixed-width 32-byte binary header for every datagram. The header is naturally aligned for 64-bit systems (four 8-byte words). The hard MTU cap is 1400 bytes total (header + payload), yielding a maximum payload of 1368 bytes (MTUHardCap(1400) − HeaderSize(32)). This ensures safe passage within standard 1500-byte ethernet MTUs without IP-level fragmentation.

Byte order: All multi-byte fields in the entire protocol — header fields, heartbeat payload fields (§6B), SESSION_REQ payload fields (§4C), PUSH_REQ/PUSH_ACCEPT payload fields (§11C), and NACK arrays — are in big-endian (network byte order). C implementations must use htonl/ntohl (32-bit) and htonll/ntohll (64-bit) or equivalent for every multi-byte field on the wire. Go implementations use binary.BigEndian methods. This applies uniformly; there are no little-endian fields anywhere in the protocol.

Byte Offset Size Field Name Description
0x00 1 Byte PacketType 0x00 SESSION_REQ, 0x01 DATA, 0x02 PARITY, 0x03 HEARTBEAT, 0x04 SESSION_REJECT, 0x05 TRANSFER_COMPLETE, 0x06 ACK_CLOSE, 0x07 PULL_REQ, 0x08 PUSH_REQ, 0x09 PUSH_ACCEPT, 0x0A SESSION_ACCEPT.
0x01 4 Bytes SessionID Client-generated random identifier for the active transfer (see §4 for collision handling).
0x05 8 Bytes SequenceNum Strictly incrementing 64-bit chunk identifier. Eliminates rollover concerns up to ~16 EB file sizes.
0x0D 8 Bytes BlockGroup 64-bit identifier for the FEC block this packet belongs to. Aligned with SequenceNum address space.
0x15 2 Bytes PayloadLen Size of the raw data payload (max 1368 bytes).
0x17 1 Byte Flags Bitmask: 0x01 = End of File, 0x02 = Calibration Burst, 0x04 = Encrypted (payload is AES-128-GCM ciphertext; see §4.5).
0x18 8 Bytes SenderTimestampNs The sender's monotonic clock timestamp in nanoseconds at the moment each DATA or PARITY packet is built (C: clock_gettime(CLOCK_MONOTONIC) converted to nanoseconds; Go: time.Now().UnixNano()). Non-data control packets leave this field zero. The receiver echoes this value verbatim as EchoTimestampNs in the Heartbeat payload; the sender computes RTT = now_ns − EchoTimestampNs using only its own clock (§6B).
0x20 Variable Payload Raw file bytes, FEC parity data, or protocol metadata.

4. Connection Establishment: 0-RTT Optimistic Handshake

To eliminate latency before data transmission begins, the protocol uses a Zero Round-Trip Time (0-RTT) handshake with a wire-speed calibration burst to probe link capacity.

A. SessionID Generation

The client generates the SessionID as a cryptographically random 32-bit integer (C: getrandom() or /dev/urandom; Go: crypto/rand). This keeps the handshake 0-RTT since no server round-trip is required for ID assignment.

B. SESSION_REQ Validation

Before allocating resources, the receiver validates the SESSION_REQ payload:

If validation fails, the receiver sends a SESSION_REJECT and logs a diagnostic warning.

C. Handshake Sequence

The handshake is 0-RTT when unencrypted and 1-RTT when encrypted (§4.5). Both flows are described below.

Configurable Initial Rate Override

If the InitialRate field in SESSION_REQ is non-zero, the sender skips calibration mode and begins transmitting at the specified bytes-per-second rate immediately. This is intended for known environments (e.g., a dedicated 10 Gbps LAN) where the operator can confidently set the initial rate. The adaptive congestion controller still takes over after the first Heartbeat.

Design History: v2.0 used 50 packets at 1ms spacing (~1.38 MB/s probe). v3.0 changed to 100 packets at wire speed. Both had problems: the 1ms spacing made LAN ramp-up take too long, while 100 wire-speed packets (140 KB) instantly filled router buffers on Starlink and poisoned the initial peakRate measurement. The v3.1 packet train (10 packets) is small enough to avoid buffer overflow on any reasonable link, while the dispersion measurement extracts the same capacity information that 100 packets would provide.

4.5. End-to-End Encryption

HP-UDP optionally encrypts all DATA and PARITY payloads using AES-128-GCM with ephemeral X25519 key exchange. Encryption is negotiated during the handshake (§4C Step 1.5) and is all-or-nothing for a session — once the Encrypted flag is set, every DATA and PARITY packet in the session is encrypted. Control packets (HEARTBEAT, TRANSFER_COMPLETE, ACK_CLOSE) are not encrypted; their payloads contain only protocol metadata, not file content.

A. Ephemeral Key Exchange

Both sides generate a fresh X25519 keypair (32-byte public key, 32-byte private key) at the start of each session. Private keys exist only in memory for the duration of the transfer and are securely zeroed on session teardown. This provides perfect forward secrecy: there is no persistent key material that could decrypt recorded traffic after the session ends.

Key exchange is embedded in the existing handshake flow with no additional round trips beyond the 1-RTT SESSION_ACCEPT:

FlowSender Key InReceiver Key InAdded RTTs
Direct send/recv SESSION_REQ payload SESSION_ACCEPT (0x0A) payload +1 (0-RTT → 1-RTT)
Serve daemon push PUSH_REQ payload PUSH_ACCEPT payload (extended) +0 (already 1-RTT)
Serve daemon pull PULL_REQ payload SESSION_REQ payload (server is sender) +0 (already 1-RTT)

For push and pull via the serve daemon, the existing round trip already accommodates the key exchange — no additional latency is introduced. Only the basic send/recv flow gains one round trip.

B. Session Key Derivation

Both sides independently derive the same symmetric key:

shared_secret = X25519(my_private_key, their_public_key)     // 32 bytes
session_key   = HKDF-SHA256(
                    ikm  = shared_secret,
                    salt = SessionID (4 bytes, big-endian),
                    info = "hp-udp-aes128-v5",
                    len  = 16                                  // AES-128 key
                )

The SessionID salt ensures that even if the same ephemeral keypair were accidentally reused (implementation bug), different sessions would derive different keys. The info string binds the key to the protocol version and cipher suite, preventing cross-protocol key reuse.

C implementations: OpenSSL EVP_KDF with OSSL_KDF_NAME_HKDF, or libsodium crypto_kdf_hkdf_sha256_expand. Go: golang.org/x/crypto/hkdf.

C. Per-Packet Encryption

Each DATA and PARITY packet is encrypted independently using AES-128-GCM. The packet header (32 bytes) is not encrypted — it is passed as Additional Authenticated Data (AAD) so that the receiver can route, reorder, and identify packets before decryption. The header is authenticated by the GCM tag, preventing tampering.

Wire Layout (Encrypted Packet)

┌──────────────────┬──────────────────────────┬──────────────┐
│ Header (32 B)    │ Ciphertext (PayloadLen B) │ GCM Tag (16B)│
│ cleartext, AAD   │ AES-128-GCM output        │ auth tag     │
└──────────────────┴──────────────────────────┴──────────────┘
 Total ≤ 1400 bytes.  PayloadLen ≤ 1352 (= 1368 − 16 tag).

PayloadLen in the header reflects the plaintext length (which equals the ciphertext length in GCM). The receiver reads PayloadLen + 16 bytes from the payload area to get ciphertext + tag. Encrypted MaxPayload = 1352 bytes. Unencrypted transfers retain MaxPayload = 1368.

Nonce Construction (12 Bytes)

AES-GCM requires a unique nonce for every packet encrypted under the same key. Nonce reuse completely breaks GCM's confidentiality and authenticity guarantees. The nonce is constructed deterministically from fields already present in the packet header:

BytesFieldPurpose
0–3 SessionID (4 bytes) Scopes nonce space to this transfer. Redundant with HKDF salt but provides defense-in-depth against implementation errors.
4 PacketType (1 byte) Domain separator: 0x01 (DATA) vs 0x02 (PARITY). Prevents nonce collision between DATA and PARITY packets that may share the same SequenceNum value within a block.
5–11 Packet Unique ID (7 bytes, big-endian) DATA: lower 56 bits of SequenceNum (globally unique, strictly incrementing). PARITY: lower 56 bits of (BlockGroup << 8) | ParityIndex, where ParityIndex is the zero-based SequenceNum field of the PARITY packet (0..m−1).
Nonce uniqueness proof: DATA nonces are unique because SequenceNum is strictly incrementing across the entire transfer and never reused. PARITY nonces are unique because BlockGroup is strictly incrementing and ParityIndex is unique within a block; the compound (BlockGroup << 8) | ParityIndex is therefore unique across the transfer (up to 256 parity packets per block, which exceeds the maximum of 20). The PacketType byte at offset 4 prevents any DATA nonce from colliding with any PARITY nonce. 56 bits of unique ID supports 256 packets per session (~98 exabytes at 1368 bytes/packet), well beyond the 1 TB maximum file size.

The nonce is not transmitted on the wire. Both sides compute it deterministically from the packet header fields. This saves 12 bytes per packet compared to an explicit nonce.

D. Encryption Placement in the Data Path

Encryption is applied after FEC encoding. The FEC encoder operates on plaintext data shards and produces plaintext parity shards. Each shard (DATA or PARITY) is then encrypted independently before transmission. On the receiver side, each packet is decrypted individually, then the plaintext shards are passed to the FEC decoder for reconstruction if needed.

Sender:  file → chunk → FEC encode (plaintext) → encrypt each shard → transmit
Receiver: receive → decrypt each shard → FEC decode (plaintext) → reassemble → disk

This ordering means FEC reconstruction operates on plaintext, which is correct — the Reed-Solomon math must see the original data bytes, not ciphertext (encrypting before FEC would require decrypting all k shards before reconstruction, which is the same work, but reconstructed ciphertext shards would then need the original plaintext to verify, creating a circular dependency).

E. Payload Format Changes for Key Exchange

When the Encrypted flag (0x04) is set, the following payloads are extended with a 32-byte ephemeral public key:

Packet TypeStandard PayloadEncrypted Payload
SESSION_REQ FileSize(8B) + Hash(8B) + InitialRate(4B) + FileName(null-term) FileSize(8B) + Hash(8B) + InitialRate(4B) + PubKey(32B) + FileName(null-term)
SESSION_ACCEPT (does not exist in unencrypted mode) PubKey(32B)
PUSH_REQ FileSize(8B) + FileName(null-term) FileSize(8B) + PubKey(32B) + FileName(null-term)
PUSH_ACCEPT Port(2B) Port(2B) + PubKey(32B)
PULL_REQ FileName(null-term) PubKey(32B) + FileName(null-term)

The receiver determines whether to parse the public key by checking the Encrypted flag in the packet header. If a receiver does not support encryption and receives a request with the flag set, it responds with SESSION_REJECT (reason code 0x06 = ENCRYPTION_UNSUPPORTED).

F. Security Properties and Non-Goals

G. Performance Budget

OperationThroughput (AES-NI)Impact at 100 MB/s wire speed
AES-128-GCM encrypt4–6 GB/s single-thread<3% CPU
AES-128-GCM decrypt4–6 GB/s single-thread<3% CPU
X25519 scalar multiply~50 µs per sessionNegligible
HKDF-SHA256 derivation~1 µs per sessionNegligible
Payload reduction (1368→1352)1.2% fewer data bytes per packet~1.2% more packets for same file

Net throughput impact: <5%. AES-NI hardware acceleration is present on every x86 CPU manufactured since ~2010 (Intel Westmere / AMD Bulldozer). C implementations should use OpenSSL's EVP_aes_128_gcm (which auto-detects AES-NI) or a SIMD-accelerated library. Go's crypto/aes + crypto/cipher uses AES-NI on amd64 automatically.

Implementation Note: Do not allocate and free GCM cipher contexts per packet. Pre-allocate one context at session start and reuse it across packets, updating only the nonce via EVP_EncryptInit_ex(ctx, NULL, NULL, NULL, nonce) (OpenSSL) or by resetting the cipher.AEAD seal call with a new nonce (Go). Context reuse eliminates ~25,000 allocations/sec at 35 MB/s throughput.

5. Core Reliability Mechanisms

A. Sequence Buffering (Zero-Blocking Receiver)

The receiver will never halt reading from the network socket. Incoming data is immediately mapped into memory.

B. Adaptive Forward Error Correction (FEC)

To eliminate latency penalties from round-trip retransmissions, the protocol proactively embeds mathematical redundancy that adapts to observed network conditions.

Block Grouping

Data packets are organized into sequential BlockGroups. The default block size is 100 data packets per group. The BlockGroup identifier for a DATA packet is computed as:

BlockGroup = SequenceNum / BlockSize   (integer division)

PARITY packets use the same BlockGroup value as the data packets they protect. The SequenceNum field of a PARITY packet is its zero-based index within the block (0, 1, 2, …, m−1), not a global sequence number. A C receiver must distinguish PARITY from DATA packets using the PacketType field (0x02) and interpret SequenceNum accordingly.

Dynamic Parity Ratio

The parity packet count per block is dynamically adjusted based on the observed packet loss rate, reported via Heartbeat metrics (§6).

Observed Loss Rate Parity Ratio Parity Packets per 100-Packet Block
< 0.5%2%2
0.5% – 2%5%5
2% – 5%10%10
5% – 10%15%15
> 10%20%20

The sender initializes at 5% parity during calibration and adjusts after the first Heartbeat containing loss data. Adjustments are applied on block group boundaries — mid-block changes are not permitted, as this would invalidate the Reed-Solomon coding parameters for that group.

Parity Generation

Parity packets are generated using Reed-Solomon erasure coding over GF(28) with the irreducible polynomial x8 + x4 + x3 + x2 + 1 (0x11d). For a block of k data packets with m parity packets, any k of the k+m total packets are sufficient to reconstruct the original data. The encoding uses a Vandermonde-derived matrix whose top k rows form an identity matrix, ensuring data packets pass through unchanged and only parity is computed.

On-the-Fly Recovery

If the receiver detects missing packets within a completed block group (all expected sequence numbers accounted for or timed out), it attempts FEC reconstruction immediately. Only packets that cannot be recovered via FEC are reported as NACKs in the next Heartbeat.

Tail Block Handling

The final block group of a file transfer will almost certainly contain fewer than 100 data packets. The FEC parameters adapt as follows:

The key formulas for computing totals and boundaries are:

TotalChunks      = ceil(FileSize / MaxPayload)          // number of DATA packets (MaxPayload = 1368)
k_tail           = TotalChunks % BlockSize               // 0 means last block is full
                   (if k_tail == 0: k_tail = BlockSize)
FinalPayloadSize = FileSize % MaxPayload                 // bytes in last DATA packet
                   (if FinalPayloadSize == 0: FinalPayloadSize = MaxPayload)
EOF Detection in FEC-Recovered Packets: The EndOfFile flag is only meaningful on the wire. When the final DATA packet is recovered via FEC reconstruction rather than received directly, the flag is not propagated into the reconstructed shard. Receivers must therefore detect end-of-transfer by checking SequenceNum == TotalChunks − 1 (known from the SESSION_REQ FileSize field) rather than relying solely on the Flags field.

6. Adaptive Congestion and Flow Control

HP-UDP separates congestion control (network path capacity) from flow control (receiver processing capacity). The congestion controller is loss-driven: the primary signal is the observed packet loss rate, not the ratio of delivery rate to send rate. The delivery rate acts as a ceiling, not a decision driver.

Design Rationale (v3.0): The v2.0 spec used a delivery-rate-ratio algorithm where the sender increased only if EffectiveRate ≥ 0.95 × SendRate. In practice, this threshold was unreachable: delivery rate is bounded by send rate (the receiver cannot report receiving more than was sent), and timing jitter in heartbeat measurement windows made the ratio consistently fall below 0.95 even on a clean link. This caused the sender to spiral to the rate floor. Loss rate is the correct primary signal because it reflects whether the network has headroom independently of the send rate.

A. The Heartbeat Packet

The receiver periodically sends a HEARTBEAT (Type 0x03) packet to the sender. The heartbeat interval is rate-proportional based on the last measured NetworkDeliveryRate:

Last Measured Network Delivery Rate Heartbeat Interval
< 10 MB/s100ms
10 – 100 MB/s50ms
100 MB/s – 1 GB/s25ms
> 1 GB/s10ms
Implementation Note: The v2.0 spec described the interval as based on "current effective send rate" inferred from packet arrival rate. The implementation initially used cumulative bytes written to disk, which gave incorrect results (e.g., passing 136 MB to a function expecting bytes/sec). The correct approach is to track the NetworkDeliveryRate computed in each heartbeat and use that value directly for interval selection.

B. Heartbeat Payload

The Heartbeat payload contains dual metrics, RTT echo, and a NACK array. All multi-byte fields below are big-endian (network byte order) — C implementations must serialize/deserialize with htonl/ntohl and htonll/ntohll:

Field Size Description
NetworkDeliveryRate 4 Bytes Bytes per second successfully received from the socket into the ring buffer during the last heartbeat interval. Reflects network path capacity.
StorageFlushRate 4 Bytes Bytes per second flushed from the ring buffer to disk during the last heartbeat interval. Reflects receiver I/O capacity.
LossRate 2 Bytes Packet loss percentage for the current reporting window, encoded as basis points (e.g., 150 = 1.50%). Primary signal for congestion control.
EchoTimestampNs 8 Bytes The verbatim value of SenderTimestampNs from the most recently received DATA or PARITY packet header. The sender computes RTT = now_ns − EchoTimestampNs using only its own monotonic clock, eliminating cross-machine clock-skew error entirely. If no DATA/PARITY packets were received in the interval, the receiver echoes the previous value unchanged (the sender's frozen-timestamp guard ignores stale repeats — see §6B).
DispersionNs 8 Bytes Calibration burst dispersion measurement: the time (nanoseconds) between the first and last received calibration-flagged packet. Zero outside of calibration. The sender uses this to compute bottleneck bandwidth: BW = (BurstSize − 1) × MaxPayload / DispersionNs (where MaxPayload = 1368). See §4C.
HighestContiguous 8 Bytes The highest SequenceNum such that all packets 0..N have been received or FEC-recovered. Allows sender to track receiver progress.
NACKCount 2 Bytes Number of unrecoverable sequence numbers in the NACK array that follows.
NACKArray 8 Bytes × N Array of 64-bit SequenceNum values that were not recoverable via FEC and require retransmission. Bounded to sequences between HighestContiguous+1 and the highest received sequence number (never NACKs packets the sender hasn't transmitted yet). The array is physically limited to fit within one packet: MaxNACKs = (MaxPayload − HeartbeatFixedSize) / 8 = (1368 − 36) / 8 = 166. For spec simplicity this is rounded down and stated as 167 in rate-limiting contexts; implementations must not exceed the true computed limit. If more than 167 sequences are pending, the receiver sends the highest-priority subset and the remainder appear in a subsequent heartbeat.

RTT Measurement — Same-Clock Design

Each DATA and PARITY packet carries a sender timestamp in the fixed SenderTimestampNs header field (offset 0x18, 8 bytes, unix nanoseconds set at packet-build time). Non-data control packets leave this field zero.

Problem with prior versions: v3.x had the receiver set EchoTimestampNs = receiver's time.Now(). The sender then computed RTT = sender_now − receiver_now, a cross-machine clock comparison. A 4-second clock skew between machines produced a 5-second RTT estimate, permanently locking the NACK cooldown above the receiver's 5-second inactivity timeout and killing the transfer.

Fix: The receiver now echoes the sender's own timestamp verbatim: EchoTimestampNs = pkt.Header.SenderTimestampNs. The sender computes RTT = now_ns − EchoTimestampNs using only its own monotonic clock. Cross-clock error is eliminated entirely.

Frozen-Timestamp RTT Guard

Problem: When the sender is idle (honouring a NACK cooldown), the receiver keeps echoing the same stale SenderTimestampNs from the last packet it received. Each heartbeat makes RTT = now − staleTs grow by one heartbeat interval. After enough heartbeats the RTT inflates past 5 seconds, the NACK cooldown exceeds the receiver's inactivity timeout, and the transfer dies.

Fix: The TokenBucket tracks lastEchoNs — the highest EchoTimestampNs it has processed. RTT is updated only when echoNs > lastEchoNs. Stale repeated echoes are silently ignored; the RTT estimate stays locked at its last valid measurement until the sender transmits a new packet and the receiver reflects a fresh timestamp.

C. Loss-Driven Rate Adjustment Algorithm

The sender adjusts its rate based on the LossRate reported in each Heartbeat. The effective delivery rate (rawEffective = NetworkDeliveryRate) serves as a ceiling for decreases. StorageFlushRate is still reported in heartbeats for observability but is no longer an input to the rate controller: the receiver uses a pre-allocated full-file ring buffer, so disk lag cannot cause packet loss. Including StorageFlushRate in the minimum caused out-of-order packets to stall the contiguous flush frontier to 0, making StorageFlushRate ≈ 0 and falsely triggering the delivery-collapse guard on every heartbeat. Rate increases are gated to once per RTT to prevent the sender from making dozens of upward adjustments before any feedback arrives (critical on high-latency links like satellite).

RTT-Aware Rate Gating

The sender tracks the time of its last rate increase. When a heartbeat signals an increase, the sender checks whether at least one RTT has elapsed since the previous increase. If not, the increase is suppressed (treated as a hold). Decreases are not gated — they are applied immediately upon consecutive loss signals to clear the pipe as fast as possible.

Phased Growth Model

The congestion controller operates in two phases, similar to TCP's slow start and congestion avoidance but adapted for loss-driven UDP with FEC:

Phase 1 — Probe (Multiplicative Increase): While loss is < 1%, the sender has never observed the link ceiling. It probes aggressively with multiplicative increase, applied once per RTT:

S_new = S_current × 1.25

This is more conservative than the v3.0 1.5× multiplier but is applied per RTT rather than per heartbeat, giving the network time to signal back before each step.

Phase 2 — Congestion Avoidance (Additive Increase): Once the sender observes loss entering the 1%–5% hold zone for the first time, it has found the approximate ceiling of the link. The controller permanently transitions to Phase 2 and never returns to Phase 1 for this session. Probing uses additive increase, applied once per RTT:

S_new = S_current + (MaxPayload / RTT)

This adds approximately one packet per RTT of additional bandwidth, gently probing for headroom without risking a burst of loss.

Decision Logic

Let L = reported LossRate in basis points, E = effective delivery rate (NetworkDeliveryRate — see §6C rationale above), S = current send rate. On each Heartbeat reception, the delivery-collapse guard is checked first:

Condition Action Rationale
NACKCount > 0 AND E < S × 0.25 Hold + permanently transition to Phase 2. Evaluated before loss thresholds. OS socket buffer overflow. Packets are dropped before reaching the receiver application — reported LossRate stays 0% (no FEC failures counted) while delivery collapses. NACKs confirm real packet loss. Entering Phase 2 fires the 1.5× ceiling immediately, cutting the target rate to near actual link capacity. Threshold lowered from 50% to 25%: on high-latency paths (50ms+ RTT) approximately 50% of packets are legitimately in-flight during warm-up, causing the old 50% threshold to fire prematurely on measurement lag.
L < 100 (loss < 1%) Increase (once per RTT): Phase 1: S × 1.25. Phase 2: S + MaxPayload/RTT. Link has headroom. FEC absorbs transient loss. RTT gating prevents runaway probing on high-latency paths.
100 ≤ L ≤ 500 (1% – 5%) Hold: S = S. If first time entering this zone, transition to Phase 2 permanently. FEC is handling the loss. The link ceiling has been discovered. Switch to additive probing from this point forward.
L > 500 (loss > 5%), consecutive confirmation Decrease: S = smoothed(E) × 0.85 Drop to 85% of the EWMA-smoothed effective delivery rate. The 15% undershoot allows router queues to drain; FEC bridges the gap during recovery. Requires two consecutive above-threshold heartbeats to trigger (see below).
Design Rationale (v3.1 / v4.0): Five cumulative changes. (1) The v3.0 decrease formula E × 1.05 set the new rate above the rate that just caused severe loss, sustaining congestion. Dropping to E × 0.85 gives queues time to drain. (2) The 1.5× multiplicative increase per heartbeat was replaced with 1.25× per RTT — on a Starlink link with 100ms heartbeats and 40ms RTT, the old algorithm made 2.5 increases per RTT, compounding to ~1.95× per RTT. (3) The permanent transition to additive increase after discovering the link ceiling prevents repeated boom-bust oscillation at the capacity boundary. (4) StorageFlushRate removed from effective-rate formula (v4.0): pre-allocated ring buffers always stall flush at 0 for out-of-order arrivals, making min(NetworkDeliveryRate, StorageFlushRate) ≈ 0 and permanently tripping the delivery-collapse guard. (5) Delivery-collapse threshold lowered 50%→25% (v4.0): legitimate in-flight packets on 50ms+ RTT paths account for ~50% of the window, causing false collapses during ramp-up with the old threshold.

EWMA Smoothing

Raw delivery rate measurements from individual heartbeats are noisy due to timing jitter, especially on high-latency links. The sender maintains an exponentially weighted moving average (EWMA) of the effective delivery rate:

smoothed = α × raw_sample + (1 − α) × smoothed_previous

The default smoothing factor is α = 0.3, which provides moderate dampening (converges in ~3 samples). The smoothed rate is used as the target when decreasing, preventing single-heartbeat noise from crashing the rate.

Consecutive Decrease Requirement

A single heartbeat reporting > 5% loss may be a transient spike (e.g., a router briefly queuing). The sender requires two consecutive above-threshold heartbeats before executing a decrease. The first signal starts a "decrease streak" counter; the second confirms it. Any increase or hold resets the streak to zero.

Auto-Ceiling

The ceiling is two-tiered based on phase:

Phase 1:  if rate > peakDeliveryRate × 4.0: rate = peakDeliveryRate × 4.0
Phase 2:  if rate > peakDeliveryRate × 1.5: rate = peakDeliveryRate × 1.5

Phase 1 (4× — runaway prevention): During the multiplicative probe, delivery-rate measurements lag the send rate because the sender increases 25% per heartbeat and the receiver's measurement window has not stabilised. For example, at a 7.63 MB/s send rate the receiver may report only 4.19 MB/s delivery. A tight multiplier like 1.5× would fire immediately, giving a 6.28 MB/s ceiling below the current rate and locking the sender at ~5.68 MB/s for the entire transfer on a 110 MB/s Gigabit link. The generous 4× multiplier prevents this while still bounding the exponential: on a clean link where FEC absorbs all drops (LossRate remains 0% throughout), Phase 2 is never entered and without any Phase 1 ceiling the target rate grows without bound (observed: 345 trillion MB/s). With 4×, the target is capped at ~400 MB/s on a Gigabit LAN — effectively the same as nodelay (pacing is disabled at that rate anyway), but without the absurd log output.

Phase 2 (1.5× — avoidance bound): Once Phase 2 is entered, the delivery rate was measured near actual link capacity — the loss event that triggered the transition occurred at or near the ceiling — so 1.5× is a tight and reliable upper bound for the additive probing that follows.

Implementation Note: Earlier revisions gated the ceiling on a fixed warmup period (originally 3 heartbeats, later extended to 5) to avoid locking in a low peakRate from cold-start measurements. This approach was superseded by Phase 2 gating, which is more principled: the ceiling is irrelevant during Phase 1 (the sender is still discovering the link ceiling) and correct during Phase 2 (the delivery rate was measured near capacity). The warmup constant no longer exists in the implementation.

D. Deficit-Accumulator Pacing

The sender must convert the target rate (bytes/sec) into inter-packet timing. The naive approach — computing a per-packet interval and sleeping for that duration — fails in practice because OS timer granularity (~1ms on Windows, ~100µs on Linux) and language-runtime preemption (e.g., Go's asynchronous goroutine preemption, or signal-based preemption in C with certain threading models) make sub-millisecond sleeps unreliable.

Instead, the sender uses a deficit accumulator:

This produces the correct long-term average rate without relying on sub-millisecond timer precision.

C Implementation Note: Linux clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME) provides ~50–100µs precision on default kernels, significantly better than Go's ~1ms floor. For sub-100µs precision (useful at 10 GbE+ rates), run the sender thread under SCHED_FIFO real-time scheduling via sched_setscheduler(). The deficit accumulator is still the correct architecture — even with precise timers, per-packet nanosleep at 67,000 pps would burn CPU on syscall overhead. The accumulator batches the timing debt and issues one sleep per deficit threshold crossing, regardless of timer precision.
Design Rationale: The v2.0 spec described a "dynamic Token Bucket" with "microsecond interval between packet dispatches." The Go prototype initially used runtime.Gosched()-based busy-wait spin loops for sub-ms pacing. Go 1.14+ introduces asynchronous goroutine preemption that signals goroutines at safe points — even inside tight loops — causing the spin to overshoot to ~1ms per packet. At 1400 bytes/1ms = 1.4 MB/s, this created an artificial throughput ceiling regardless of the target rate. The deficit accumulator was developed to solve this without platform-specific timer hacks.

E. NACK-Driven Retransmission

When the sender receives a NACK array in a Heartbeat, it queues the identified packets for retransmission. Retransmitted packets carry their original SequenceNum and BlockGroup.

Retransmissions are interleaved with forward progress: the sender processes at most 3 NACKed packets per send-loop iteration, then sends the next new data packet. This prevents NACK storms (e.g., 169 NACKs on a satellite link) from monopolizing bandwidth and stalling seqNum advancement.

The pending NACK set must be maintained as a deduplicated set (hash set or bitset), not a FIFO queue. Each heartbeat may report the same sequence numbers as the previous one (the receiver keeps NACKing until the packet arrives). If the same sequence is appended to a plain list on every heartbeat, the retransmit queue grows without bound. A set ensures each sequence is queued at most once regardless of how many heartbeats repeat it. If a NACK arrives for a sequence that has already been pruned from the sender's chunk cache (because HighestContiguous advanced past it via FEC recovery), the sender silently skips it — the receiver has already recovered the packet and the NACK is stale.

Implementation Note: The v2.0 spec stated that retransmissions are "injected ahead of new data." In practice, draining the entire NACK queue before sending any new data caused a fatal stall on Starlink: at 0.38 MB/s with 169 NACKs, each send-loop iteration spent ~0.62s on retransmits, and seqNum never advanced. The 3-per-iteration cap ensures the transfer always makes forward progress even under heavy loss.

7. Session Timeout and Failure Recovery

A. Receiver Inactivity Timeout

If the receiver does not receive any packets (data, parity, or retransmission) for a period of 5 consecutive expected heartbeat intervals, with a minimum floor of 5 seconds, it declares the session dead:

Implementation Note: The 5-second floor was added after integration testing showed that at the lowest heartbeat tier (100ms), the computed timeout of 500ms was too aggressive — the natural gap between the sender finishing its data blast and NACK retransmissions arriving through a lossy proxy would trigger a false timeout. The floor gives the NACK retransmission cycle time to recover.

B. Sender Inactivity Timeout

If the sender does not receive a Heartbeat for 5 consecutive expected heartbeat intervals, it assumes the receiver has failed or the return path is broken:

8. Graceful Teardown (The Final Handshake)

To prevent "tail drop" issues where the final packets are lost and the connection deadlocks, the protocol implements a synchronized teardown with timeout-driven cleanup on both sides.

A. Normal Teardown Sequence

B. Teardown NACK Handling

During the teardown wait (after all data is sent, before TRANSFER_COMPLETE arrives), the sender continues to process Heartbeat packets synchronously. If a Heartbeat contains NACKs, the sender retransmits the requested packets from the sliding window ring buffer and resets the read deadline. This ensures the receiver can complete even if some late packets were lost.

Teardown retransmits are paced through the token bucket at the congestion controller's current rate. Without pacing, a backlog of queued Heartbeats (e.g., after a brief receive gap) can be drained all at once, causing the sender to fire hundreds of retransmit packets in tens of milliseconds — a burst that overwhelms the same congested link that caused the NACKs in the first place. See Lesson J for the observed failure case.

RTT-Aware NACK Cooldown

Problem: The teardown retransmit loop had no cooldown. On a 50ms-RTT path with a 50ms heartbeat interval, every in-flight heartbeat triggered a redundant retransmit of the same lost packets. The retransmit flood caused fresh congestion, which caused more NACKs, creating a self-reinforcing spiral. Observed: 59,908 reported NACKs for approximately 780 actual losses.

Fix: A nackCooldown map[uint64]time.Time gates each sequence number to at most one retransmit per RTT × 1.25 (RTT plus 25% margin). The map is seeded with all NACKs outstanding at the moment the main send loop ends. On each teardown heartbeat, a sequence is only retransmitted if its cooldown timestamp has elapsed; otherwise it is silently skipped until the next eligible window.

Tail-Drop Deadlock Prevention

Problem: The receiver's NACK scan window is bounded by HighestReceived. If the last packets of the file are dropped, HighestReceived never advances to the end of the file, so the receiver's NACK list is empty. The sender sees 0 NACKs, sends nothing, and the receiver hits its 5-second inactivity timeout — a deadlock neither side can break without external intervention.

Fix: In the teardown loop, if a heartbeat arrives with NACKCount == 0 but hb.HighestContiguous < totalChunks−1, the sender proactively computes up to 167 missing tail sequences (from HighestContiguous+1 through totalChunks−1) and injects them into the retransmit pipeline. These injected sequences flow through the NACK cooldown gate exactly like receiver-reported NACKs, preventing the same tail sequences from being re-injected on every heartbeat.

Teardown Micro-Burst Prevention

Problem: At high transfer speeds (100+ MB/s) the TokenBucket's 2ms burst allowance is approximately 200 KB. A full 167-packet retransmit batch is approximately 232 KB at MaxPayload = 1368 bytes — it fires as a near-simultaneous burst, flooding the OS UDP socket buffer and the serve daemon's 256-slot receive channel, causing most retransmits to be silently dropped.

Fix: Teardown retransmits are chunked into batches of 10 packets with a 2ms sleep between batches. 167 packets are spread over approximately 34ms — invisible to the user, well within any heartbeat interval, and guaranteed to fit through any buffer in the path.

C. Hash Mismatch

If the receiver completes all data reception but the xxHash64 does not match the expected value from SESSION_REQ:

9. Development Phases

Progress Bar — Repair State

Once the main send loop completes and the sender enters the teardown loop (§8B), the progress bar changes from the normal 100% | 40.0 MB/s | NACKs: N display to 100% | Repairing... | NACKs: N. This tells the user the network is actively recovering dropped tail packets rather than hanging. The repair state persists until TRANSFER_COMPLETE is received or the teardown timeout expires.

  1. Phase 1 (Go Prototype) — COMPLETE: Validated FEC mathematics (GF(28) Reed-Solomon), Heartbeat state machine, adaptive FEC tuning, wire-speed calibration burst, loss-driven congestion control with deficit-accumulator pacing, NACK retransmission with forward-progress interleaving. Proved 100% file integrity under simulated packet loss at 0%, 1%, 5%, 10%, 15%. Tested on LAN (1 Gbps Ethernet, 41 MB/s throughput) and WAN (Starlink satellite, variable latency/loss). 86 unit tests across protocol, sender, and receiver packages. Phase 1 throughput bottleneck identified: per-packet conn.Write() syscall overhead limits Go to ~30–41 MB/s on LAN versus 93 MB/s for FTP (which uses kernel-level TCP segmentation with a single large write()).
  2. Phase 2 (C Productionization): Translate the validated logic into C. Key optimizations:
Phase 1 Throughput Analysis: FTP achieves 93 MB/s on the same LAN because TCP's write() pushes megabytes at once — the kernel handles segmentation into ~1500-byte frames internally. HP-UDP's per-packet conn.Write() makes ~67,000 syscalls/sec at full speed, each costing a user-kernel context switch. This is not a fundamental protocol limitation — it's a syscall-overhead problem that sendmmsg() batching in Phase 2 will eliminate.

10. Lessons Learned (Phase 1)

The following empirical findings emerged during Phase 1 implementation and testing. They are documented here to guide the Phase 2 C port and future protocol revisions.

A. Delivery-Rate-Ratio CC Is Fundamentally Broken

The v2.0 algorithm increased the rate only when EffectiveRate ≥ 0.95 × SendRate. Since delivery rate is bounded by send rate and measurement windows never align perfectly, this ratio consistently falls below 0.95 even on a lossless link. The sender spirals to the rate floor. Loss rate is the correct primary signal.

B. Sub-Millisecond Pacing Is Unreliable in Userspace

Both time.Sleep(<1ms) and busy-wait spin loops fail for sub-millisecond pacing on Windows (minimum ~1ms granularity) and on any platform using Go 1.14+ (asynchronous goroutine preemption interrupts tight loops at ~1ms intervals). The deficit accumulator sidesteps this entirely by sleeping only when the accumulated deficit justifies a ≥1ms sleep. The C port should use clock_nanosleep() or similar, but should still avoid relying on sub-millisecond precision for correctness.

C. Calibration Burst Must Not Flood the Link

A 50 MB/s starting rate on a Starlink connection (~10 Mbps effective uplink) caused massive packet loss during the first 100ms, which poisoned the peakRate measurement and locked the auto-ceiling at ~0.38 MB/s for the entire transfer. The starting rate must be conservative enough for the worst expected link (2 MB/s default), while the calibration burst itself runs at wire speed to discover the actual capacity.

D. NACK Storms Stall Forward Progress

On a satellite link with ~30ms RTT and 5% loss, each heartbeat reported ~169 NACKs. Processing all NACKs before each new data packet caused the send loop to spend its entire bandwidth on retransmissions, preventing seqNum from advancing. Capping retransmissions at 3 per iteration restored forward progress.

E. Early Delivery-Rate Measurements Are Unreliable

The first few heartbeats arrive during or immediately after the calibration burst, when the receiver is still allocating buffers and the network path hasn't stabilized. More broadly, during Phase 1 ramp-up the receiver's measurement window hasn't caught up to the sender's current rate — at a 7.63 MB/s send rate the receiver may only report 4.19 MB/s delivery because the sender had only been at that rate for one 100ms heartbeat interval. Any ceiling derived from these measurements will be artificially low. The correct solution is to gate the auto-ceiling on Phase 2 entry rather than a fixed warmup period: by the time Phase 2 is entered, the sender has been near the link ceiling long enough for delivery measurements to be meaningful.

F. Socket Ownership During Teardown

The sender's heartbeat listener goroutine and the teardown synchronous read loop compete for the same socket. If the goroutine is still running when the sender enters teardown, it consumes packets (including TRANSFER_COMPLETE) that the teardown loop needs. The goroutine must be stopped before entering teardown, and any queued NACKs must be drained synchronously.

G. Per-Packet Syscall Overhead Is the Phase 1 Bottleneck

FTP achieves 93 MB/s on the same Gigabit LAN where HP-UDP reaches ~30–41 MB/s. The difference is not the protocol or the language — it's the syscall pattern. FTP writes large buffers to a TCP socket; the kernel segments them into packets internally. HP-UDP calls conn.Write() for every 1376-byte packet, requiring ~67,000 user-kernel context switches per second at full speed. Each syscall costs ~15µs of overhead, consuming nearly 100% of available CPU time at target throughput. The Phase 2 C port must use sendmmsg()/recvmmsg() or equivalent batching to amortize this cost across 16–64 packets per syscall.

H. Go-Specific Memory Pressure

Three allocations patterns in the Go prototype created significant memory and CPU overhead that are not visible in the protocol design but directly impacted measured throughput. All three were fixed in Phase 1 and their equivalents must be avoided in the Phase 2 C port.

  1. FEC Encoder Construction Cost. NewRSEncoder(k, m) builds a k×k Vandermonde matrix and inverts it using O(k³) GF(2⁸) operations. For the default block size of k=100 this is ~2 million GF operations, measured at ~4 ms per call. With ~1,720 FEC blocks in a 237 MB transfer, constructing a fresh encoder for each block costs ~6.9 seconds of CPU time — roughly equal to the entire transfer duration at 35 MB/s. The fix is to cache the encoder keyed on (dataShards, parityShards) and reuse it across blocks. The matrix is deterministic for a given (k, m) pair and encoding only reads from it (no mutation), so the cached instance is safe for concurrent use. The Phase 2 C port must pre-build encoder matrices at session start and reuse them.
  2. FEC Shard Buffer Allocations. Each data packet that enters the FEC encoder requires a MaxPayload-sized buffer padded to equal length before RS encoding. At the default block size of 100 and a 35 MB/s transfer rate, this produces ~25,000 allocations per second, generating ~470 MB of heap churn per 237 MB transfer and placing the garbage collector on the critical path. The fix is a sync.Pool of pre-allocated MaxPayload-sized buffers checked out at encode time and returned immediately after the parity computation completes. The Phase 2 C port should maintain a fixed pool of shard-sized stack or heap buffers reused across blocks.
  3. Unbounded Retransmit Cache Growth. The original sender cached every transmitted chunk in a map[uint64][]byte to service NACK retransmissions. Without eviction, this map retained all ~172,000 chunks for a 237 MB transfer, consuming ~470 MB of heap memory for the entire session duration. Fixed in Phase 1: replaced with a bounded SlidingWindow ring buffer (50,000 slots, ~68 MB peak). Entries are evicted when HighestContiguous advances (received from each heartbeat), since the receiver has already confirmed contiguous receipt up to that point and will never NACK those sequences. When the window is full (all 50,000 un-acknowledged slots occupied), the sender pauses sending new packets until the receiver's HighestContiguous advances and frees slots — providing memory-bounded backpressure. The Phase 2 C port should use the same ring-buffer pattern with a power-of-2 slot count (65,536 recommended, ~89 MB peak) so that index wrapping uses a bitmask (idx & 0xFFFF) instead of a modulo operation — eliminating a division on the hot path for every packet sent and every HighestContiguous advance.

I. Auto-Ceiling Overshoot Causes Persistent NACK Storms

With the original auto-ceiling multiplier of 4×, the target rate on a Gigabit LAN reached 396 MB/s (4× a measured peak delivery of ~99 MB/s). At this target the token bucket's pacing budget was so large that no sleep was ever triggered — the sender fired packets as fast as the CPU allowed. The OS NIC queue and the receiver's socket buffer were overwhelmed, systematically dropping the same ~167 packets per heartbeat interval. Because these drops were clustered within FEC blocks (the sender had burst through several consecutive blocks before the receiver could drain them), the parity overhead was insufficient to recover them. The result: every heartbeat for the entire transfer carried the same 167-entry NACK list, the sender retransmitted them repeatedly, and the teardown phase required ~3.5 seconds of retransmit cycles before the receiver could recover all blocks and issue TRANSFER_COMPLETE.

Three fixes address this together:

  1. Two-tier ceiling (Phase 1: 4×, Phase 2: 1.5×). The fix went through two iterations. Applying a uniform 1.5× ceiling during Phase 1 ramp-up caused the opposite problem: with a 4.19 MB/s peak delivery measurement at heartbeat 6, a 1.5× ceiling produced a 6.28 MB/s cap — below the current send rate — locking the sender at ~5.68 MB/s for a 42-second transfer on a 110 MB/s Gigabit link. Making the ceiling Phase 2-only solved the ramp-up problem but exposed a new failure: on a clean Gigabit LAN where FEC absorbs all OS socket buffer drops, LossRate remains 0% throughout, Phase 2 is never entered, and without any ceiling the multiplicative probe compounded to 345 trillion MB/s (observed). The two-tier ceiling resolves both: Phase 1 uses 4× as a loose runaway brake (capping at ~400 MB/s on a Gigabit LAN, which effectively disables pacing without absurd log output), and Phase 2 uses 1.5× as a tight avoidance bound (by Phase 2 entry the delivery rate was measured near actual link capacity, so 1.5× is meaningful).
  2. NACK deduplication. The sender's retransmit queue was a plain slice. Each heartbeat appended the full NACK list again (167 entries × every ~25 ms = the queue grew without bound). The fix is to use a set (hash map of pending sequence numbers): a sequence already waiting for retransmission is not added a second time. This prevents the queue from accumulating thousands of stale duplicate entries and ensures the 3-retransmits-per-iteration cap is spent on distinct missing packets. The Phase 2 C port should maintain a fixed-size bitset or hash set of pending NACK sequences rather than a FIFO queue.

J. OS Socket Buffer Drops Are Invisible to the Loss Rate Signal (LFN)

On a real Long Fat Network (1 GB file sent over a ~20 MB/s WAN link), Phase 1 multiplicative probing ramped the sender from 2 MB/s to 71 MB/s (the 4× Phase 1 ceiling) in under 5 seconds. The link could only sustain ~20 MB/s. The excess was absorbed by OS socket buffers, which then overflowed. From that point, packets were dropped at the OS layer before reaching the receiver application. The observed effect:

Additionally, a secondary failure compounded the problem during teardown: a 7-second gap in heartbeat reception caused a backlog to accumulate. When heartbeats resumed, the sender drained the entire queue at once — 39 calls to the retransmit function in 45ms, each firing all 167 packets at wire speed (~199 MB/s burst on a 20 MB/s link).

Two fixes address this:

  1. Delivery-collapse guard. Before evaluating the loss-rate thresholds, OnHeartbeat checks: NACKCount > 0 AND E < S × 0.25 (threshold lowered from 0.5 in v4.0 — see §6C). If both are true, the sender holds and permanently enters Phase 2. The Phase 2 ceiling (1.5× peak delivery) fires immediately, cutting the target from 71 MB/s to ~26 MB/s. With the rate near actual link capacity, the normal loss signals take over and back it down further. The NACKCount condition is critical — it prevents false holds on cold-start windows where delivery is transiently near zero but the link is healthy.
  2. Paced teardown retransmits. The retransmit function now accepts the token bucket and calls Pace() for each packet. A backlog of 39 queued heartbeats retransmitting 167 packets each is spread over ~600ms at 15 MB/s rather than firing in 45ms, giving the receiver time to process them and advance HighestContiguous.

K. Backpressure Starves NACK Retransmits (Window-Full Deadlock)

Problem: When the sender's sliding window fills (50,000 un-acknowledged slots), the main send loop must pause to avoid growing memory without bound. The original implementation used a bare for sw.IsFull(seqNum) { time.Sleep(1ms) } inner loop. While spinning there, the outer loop never returned to its NACK-processing step at the top. If the very first DATA packet was dropped, HighestContiguous on the receiver stayed at 0. Because Advance(0) is a no-op (guarding against the zero-value case), hc in the sliding window stayed at its sentinel value (MaxUint64). With that sentinel, IsFull returned true at exactly seq 50,000, which corresponds to ~68 MB — roughly 6% of a 1 GB file. The sender froze, NACKs queued in nackPending went unserviced, and the receiver detected no incoming packets for 5 seconds and declared an inactivity timeout.

Fix: Replace for sw.IsFull(seqNum) { sleep } with if sw.IsFull(seqNum) { sleep; continue }. The continue jumps back to the top of the outer for seqNum < totalChunks loop, so every backpressure iteration still drains nackPending and retransmits pending sequences before sleeping. The retransmit of the lost first packet allows HighestContiguous to advance on the next heartbeat, which unblocks IsFull and resumes the main send loop normally.

Observed signature: Transfer ramps to full speed (100+ MB/s), freezes at approximately 6% of a 1 GB file, receiver reports inactivity timeout 5 seconds later. NACKs counter shows 0 during the freeze (the NACK retransmit loop never ran). The freeze point scales exactly with window size: windowSlots × MaxPayload / fileSize.

11. Serve Daemon — Bidirectional File Hub

The serve daemon is a persistent single-lane UDP server that manages a file manifest and services both pull requests (clients fetch files) and push requests (clients deposit files). It listens on a single control port and handles one transfer at a time — concurrent requests receive SERVER_BUSY and may retry.

A. SESSION_REJECT Reason Codes

All SESSION_REJECT packets carry a single reason-code byte in the payload:

CodeNameMeaning
0x01SESSION_ID_COLLISIONThe submitted SessionID is already active on the server.
0x02HASH_MISMATCHReceived file hash does not match the value declared in SESSION_REQ.
0x03SERVER_BUSYA transfer is already in progress; try again later.
0x04FILE_NOT_FOUNDThe requested filename is not in the serve manifest.
0x05FILE_EXISTSA push was rejected because the filename already exists on the server (no-overwrite policy).
0x06ENCRYPTION_UNSUPPORTEDThe receiver does not support encryption and received a request with the Encrypted flag (0x04) set.

B. PULL_REQ — Client-Initiated Pull (NAT Traversal)

The PULL_REQ mechanism allows a client behind NAT to retrieve a file from a serve daemon that has a public IP address, without any port-forwarding configuration on the client side.

Wire format: Packet type 0x07. Payload is a null-terminated UTF-8 filename with no fixed-size prefix.

Flow:

  1. Client generates a random SessionID (same CSPRNG path as §4A), binds a local UDP socket on an OS-assigned ephemeral port, and sends PULL_REQ to the server's control port. This outbound packet punches the NAT hole: the NAT mapping records client-ip:ephemeral-port → server-ip:control-port.
  2. Server receives PULL_REQ. If busy or filename not in manifest, sends SESSION_REJECT back to the client's address. Otherwise, it fires a normal SESSION_REQ to the client's address from a new outbound connection. Because the server's IP was the destination of the punching packet, port-restricted-cone NAT routers (the most common home router class) allow this inbound from the same IP on any source port.
  3. Client receives the SESSION_REQ on its bound socket, records the sender's address (server ephemeral port), and enters the normal receiver flow using its existing socket — no rebind required. The same socket that sent PULL_REQ carries the entire transfer.
  4. The server's sender thread uses the SessionID supplied in the PULL_REQ header, eliminating an extra round-trip for ID assignment.

C. PUSH_REQ / PUSH_ACCEPT — Client-Initiated Push

The push flow allows a client to deposit a file into the serve daemon's directory. Three security invariants are always enforced:

  1. Base-name-only rule: The filename from the PUSH_REQ payload is sanitized to its base name — the substring after the last / or \ separator (Go: filepath.Base(); C: manual reverse scan for both separators). Path traversal sequences such as ../../etc/passwd or /absolute/path are reduced to just the final filename component before any further processing.
  2. No-overwrite rule: If a file with the sanitized name already exists in the serve directory, the request is rejected with reason code FILE_EXISTS (0x05).
  3. Post-hash manifest promotion: The incoming file is written to filename.tmp during the transfer. Only after a successful TRANSFER_COMPLETE (xxHash64 verified) is the .tmp renamed to its final path and added to the live manifest. A failed or interrupted transfer leaves no partial file in the manifest.

PUSH_REQ wire format (type 0x08): 8-byte big-endian FileSize followed by a null-terminated filename. Total payload: 9 bytes minimum.

PUSH_ACCEPT wire format (type 0x09): 2-byte big-endian Port — the ephemeral UDP port the server has bound for the incoming data transfer.

Flow:

  1. Client generates a SessionID, sends PUSH_REQ to the control port containing the filename and declared file size.
  2. Server validates (not busy, base-name safe, no existing file), binds an ephemeral UDP data socket on port 0 (Go: net.ListenUDP(":0"); C: bind() with sin_port = 0, then getsockname() to retrieve the assigned port), and replies with PUSH_ACCEPT containing the assigned data port.
  3. Client extracts the data port, constructs server-host:data-port, and starts a normal HP-UDP sender using the same SessionID from step 1.
  4. Server runs a normal HP-UDP receiver on the data socket writing to filename.tmp. On success: atomic rename (rename()) from tmp to final path + manifest promotion under a write-lock. On failure: tmp is deleted and the slot is cleared.

D. Manifest Lifecycle

The manifest is a filename-to-absolute-path map built at daemon startup by a non-recursive directory scan. Symlinks and directories are excluded. Files added to the directory after startup are invisible until the daemon is restarted — this is an intentional security boundary. Successful push transfers atomically add their promoted file to the in-memory manifest under a write-lock (Go: sync.RWMutex; C: pthread_rwlock_t). PULL_REQ handlers acquire a read-lock when consulting the manifest.

Appendix A: Protocol Constants (Defaults)

Parameter Default Value Configurable
MTU Hard Cap1400 bytes (total)No
Header Size32 bytes (4 × 64-bit aligned, includes SenderTimestampNs)No
Max Payload1368 bytes unencrypted (MTUHardCap(1400) − HeaderSize(32)); 1352 bytes encrypted (1368 − GCM_TagSize(16))No
FEC Block Size100 data packetsYes
FEC Initial Parity5%Yes
FEC Tail Min Parity2 packetsYes
Calibration Burst Size10 packets (packet train)Yes
Calibration Burst Spacing0 (wire speed)Yes
Default Starting Rate2 MB/sYes
EWMA Smoothing Factor (α)0.3Yes
Loss Threshold: Increase< 100 bp (1%)Yes
Loss Threshold: Hold / Phase Transition100–500 bp (1–5%)Yes
Loss Threshold: Decrease> 500 bp (5%)Yes
Consecutive Decrease Signals2Yes
Phase 1 Increase Multiplier1.25× per RTTYes
Phase 2 Additive IncreaseMaxPayload / RTT per RTTYes
Decrease Factor0.85× smoothed delivery rateYes
Auto-Ceiling Multiplier (Phase 1)4× peak delivery rate (runaway prevention)Yes
Auto-Ceiling Multiplier (Phase 2)1.5× peak delivery rate (avoidance bound)Yes
Deficit Accumulator Burst Cap2ms of creditYes
Deficit Sleep Threshold≥ 1msNo (OS-dependent)
Max NACKs per Send Iteration3Yes
Rate Floor10 KB/sYes
Inactivity Timeoutmax(5 × heartbeat interval, 5s)Yes
Sender Probe Interval500msYes
Sender Probe Timeout10 secondsYes
Linger Duration (both sides)3 secondsYes
Receiver Teardown Retries3Yes
Stale SessionID Reservation10 secondsYes
Max SESSION_REQ File Size1 TBYes
Teardown Batch Size10 packets per sleepYes
Teardown Batch Sleep2 msYes
Delivery-Collapse Threshold25% of current send rate (was 50%)Yes
NACK Cooldown MarginRTT × 1.25 (RTT + 25%)Yes
Tail-Drop Injection Limit167 sequences per heartbeatYes
Sliding Window Slots65,536 (216, ~89 MB peak). Go prototype uses 50,000; C implementations should use power-of-2 for bitmask index wrapping.Yes
Encryption CipherAES-128-GCM (128-bit key, 128-bit auth tag, 96-bit nonce)No
GCM Tag Size16 bytesNo
GCM Nonce Size12 bytes (constructed from SessionID + PacketType + UniqueID; not transmitted)No
Key ExchangeX25519 ephemeral (32-byte public key per side)No
Key DerivationHKDF-SHA256, salt = SessionID, info = "hp-udp-aes128-v5"No

Appendix B: Revision Log