Encrypted flag (0x04) in the packet header.0x0A SESSION_ACCEPT — carries receiver's ephemeral public key during encrypted handshake.0x04 = Encrypted flag bit.SESSION_ACCEPT key exchange. Calibration burst is encrypted in encrypted mode.0x06 ENCRYPTION_UNSUPPORTED.Encrypted flag defaults to 0.binary.BigEndian.time.Now().UnixNano() replaced with "monotonic nanosecond timestamp" and platform guidance for C (clock_gettime(CLOCK_MONOTONIC)) and Go (time.Now().UnixNano()).io_uring submission queue replaces the Go async flush goroutine. Single-threaded epoll event loop architecture eliminates the socket ownership race described in Lesson F.htonl/ntohl every field.clock_gettime(CLOCK_MONOTONIC) for elapsed time, clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME) for sleeps, optional SCHED_FIFO for sub-100µs precision.idx & 0xFFFF) replaces modulo, eliminating a division per packet on the hot path./ and \ to extract base name, replacing Go-specific filepath.Base().pthread_rwlock_t, Go: sync.RWMutex). Ephemeral socket bind: C uses bind() with port 0.sentChunks map[uint64][]byte replaced by a bounded SlidingWindow ring buffer (50,000 slots, ~68 MB peak). Entries are evicted on each HighestContiguous advance from incoming heartbeats; the tail pointer never holds confirmed data. The sender blocks new packet sends — but continues processing NACK retransmits — when the window is full, providing natural memory-safe backpressure.for sw.IsFull(seqNum) { time.Sleep(1ms) } — a bare spin that never returned to the NACK-processing step at the top of the outer loop. If the first DATA packet was dropped, HighestContiguous stayed at 0 (Advance(0) is a no-op), the window filled at exactly seq 50,000 (~68 MB = ~6% of a 1 GB file), and NACKs in nackPending went unserviced — causing receiver inactivity timeout. Fixed by replacing for IsFull { sleep } with if IsFull { sleep; continue }, so every backpressure iteration still executes NACK retransmits before sleeping.SenderTimestampNs (8 bytes, unix nanoseconds) added at offset 0x18. HeaderSize 24→32 bytes. MaxPayload 1376→1368 bytes. Sender timestamp is now in the fixed header (not the payload) for all DATA and PARITY packets.pkt.Header.SenderTimestampNs verbatim as EchoTimestampNs. Sender computes RTT = time.Now().UnixNano() − EchoTimestampNs using only its own clock, eliminating cross-machine clock-skew error entirely.TokenBucket tracks lastEchoNs; RTT is only updated when echoNs > lastEchoNs. Stale repeated echoes (sender idle during NACK cooldown) are silently ignored, preventing RTT inflation past the receiver's 5-second inactivity timeout.HighestContiguous < totalChunks−1, the sender proactively injects up to 167 missing tail sequences into the retransmit pipeline, breaking the deadlock where tail drops prevent HighestReceived from advancing.rawEffective = NetworkDeliveryRate only. StorageFlushRate removed from the rate formula; disk-flush stalls (always 0 due to pre-allocated ring buffer out-of-order writes) no longer falsely trigger the delivery-collapse guard.Repairing... instead of a speed figure, indicating tail recovery rather than a hang.DispersionNs field.EchoTimestamp field. Sender computes RTT for rate-gating.E × 1.05 to E × 0.85. The old formula sustained congestion by targeting above the capacity that caused loss; the new formula drops below to drain router queues, relying on FEC to bridge the gap.sendmmsg()/recvmmsg() batching, io_uring, and throughput analysis explaining the Phase 1 FTP speed gap (~30 MB/s vs 93 MB/s due to per-packet syscall overhead).While TCP is the foundational workhorse of the internet, its general-purpose congestion algorithms inherently throttle performance on Long Fat Networks (LFNs). HP-UDP was built to prove that it is possible to outperform TCP in raw throughput by replacing reactive safety nets with proactive, domain-specific algorithms.
This protocol democratizes high-speed data movement, giving developers and engineers the ability to send and receive massive files cleanly, reliably, and at maximum hardware limits. It is a rigorous demonstration of advanced systems engineering, built to prove what is possible when legacy constraints are stripped away.
HP-UDP is an application-layer file transfer mechanism built on top of UDP. The design is lean, avoids unnecessary overhead, and focuses intently on its primary goal of speed while ensuring the reliability required for production use.
The architecture is built upon five core pillars:
Note on Security Scope: HP-UDP v5.0 adds optional end-to-end encryption via ephemeral X25519 key exchange and AES-128-GCM per-packet encryption (§4.5), providing confidentiality, integrity, and perfect forward secrecy. Encryption is backward-compatible: unencrypted transfers still work when the Encrypted flag is unset. HP-UDP intentionally omits authentication — it does not verify the identity of the remote endpoint. In the target deployment environment (managed networks with known infrastructure behind SDNs), endpoint identity is established at the network layer. Optional certificate or pre-shared-key authentication may be added in a future revision as a separate concern.
The protocol utilizes a tightly packed, fixed-width 32-byte binary header for every datagram. The header is naturally aligned for 64-bit systems (four 8-byte words). The hard MTU cap is 1400 bytes total (header + payload), yielding a maximum payload of 1368 bytes (MTUHardCap(1400) − HeaderSize(32)). This ensures safe passage within standard 1500-byte ethernet MTUs without IP-level fragmentation.
Byte order: All multi-byte fields in the entire protocol — header fields, heartbeat payload fields (§6B), SESSION_REQ payload fields (§4C), PUSH_REQ/PUSH_ACCEPT payload fields (§11C), and NACK arrays — are in big-endian (network byte order). C implementations must use htonl/ntohl (32-bit) and htonll/ntohll (64-bit) or equivalent for every multi-byte field on the wire. Go implementations use binary.BigEndian methods. This applies uniformly; there are no little-endian fields anywhere in the protocol.
| Byte Offset | Size | Field Name | Description |
|---|---|---|---|
0x00 |
1 Byte | PacketType |
0x00 SESSION_REQ,
0x01 DATA,
0x02 PARITY,
0x03 HEARTBEAT,
0x04 SESSION_REJECT,
0x05 TRANSFER_COMPLETE,
0x06 ACK_CLOSE,
0x07 PULL_REQ,
0x08 PUSH_REQ,
0x09 PUSH_ACCEPT,
0x0A SESSION_ACCEPT.
|
0x01 |
4 Bytes | SessionID |
Client-generated random identifier for the active transfer (see §4 for collision handling). |
0x05 |
8 Bytes | SequenceNum |
Strictly incrementing 64-bit chunk identifier. Eliminates rollover concerns up to ~16 EB file sizes. |
0x0D |
8 Bytes | BlockGroup |
64-bit identifier for the FEC block this packet belongs to. Aligned with SequenceNum address space. |
0x15 |
2 Bytes | PayloadLen |
Size of the raw data payload (max 1368 bytes). |
0x17 |
1 Byte | Flags |
Bitmask: 0x01 = End of File, 0x02 = Calibration Burst, 0x04 = Encrypted (payload is AES-128-GCM ciphertext; see §4.5). |
0x18 |
8 Bytes | SenderTimestampNs |
The sender's monotonic clock timestamp in nanoseconds at the moment each DATA or PARITY packet is built (C: clock_gettime(CLOCK_MONOTONIC) converted to nanoseconds; Go: time.Now().UnixNano()). Non-data control packets leave this field zero. The receiver echoes this value verbatim as EchoTimestampNs in the Heartbeat payload; the sender computes RTT = now_ns − EchoTimestampNs using only its own clock (§6B). |
0x20 |
Variable | Payload |
Raw file bytes, FEC parity data, or protocol metadata. |
To eliminate latency before data transmission begins, the protocol uses a Zero Round-Trip Time (0-RTT) handshake with a wire-speed calibration burst to probe link capacity.
The client generates the SessionID as a cryptographically random 32-bit integer (C: getrandom() or /dev/urandom; Go: crypto/rand). This keeps the handshake 0-RTT since no server round-trip is required for ID assignment.
SESSION_REQ carries a SessionID that is already in use, the server responds with a SESSION_REJECT (Type 0x04) containing a reason code. The client generates a new random SessionID and retransmits the SESSION_REQ. At typical concurrency levels (hundreds of concurrent transfers), 32-bit random IDs yield negligible collision probability (~1 in 10 million at 200 concurrent sessions).Before allocating resources, the receiver validates the SESSION_REQ payload:
If validation fails, the receiver sends a SESSION_REJECT and logs a diagnostic warning.
The handshake is 0-RTT when unencrypted and 1-RTT when encrypted (§4.5). Both flows are described below.
Packet 0 (SESSION_REQ). The payload contains: FileSize (8 bytes), xxHash64 checksum (8 bytes), InitialRate (4 bytes, 0 = use calibration mode), and FileName (variable, null-terminated). If the Encrypted flag (0x04) is set in the header Flags field, a 32-byte SenderPublicKey (X25519 ephemeral) is appended after InitialRate and before FileName. The null terminator is appended after the filename bytes and is not counted in PayloadLen. Filenames must not contain embedded null bytes. The receiver strips all path separators (/ and \) from the filename before writing — C implementations must scan for both characters to be platform-neutral.Encrypted flag is set, the receiver responds with a SESSION_ACCEPT (Type 0x0A) carrying its own 32-byte ephemeral ReceiverPublicKey as the payload. Both sides compute the shared secret via X25519 and derive the session key via HKDF (§4.5). The sender blocks until SESSION_ACCEPT arrives or the sender inactivity timeout fires. In unencrypted mode, this step is skipped entirely — the sender proceeds directly to Step 2.DATA packets with the Calibration flag (0x02) set. In unencrypted mode, this starts immediately after Step 1 (0-RTT). In encrypted mode, this starts after receiving SESSION_ACCEPT (1-RTT). The burst consists of 10 packets sent back-to-back at wire speed. This small packet train probes the link without flooding router buffers, even on constrained links like satellite. The token bucket is initialized at a default starting rate of 2 MB/s. In encrypted mode, calibration DATA packets are encrypted — there are no plaintext data packets on the wire once the key exchange completes.(10 − 1) × 1368 / 0.010 = 1.23 MB/s (or (10 − 1) × 1352 / 0.010 in encrypted mode). The receiver reports this as the DispersionNs field in the first heartbeat, giving the sender a direct measurement of the path's bottleneck capacity before the first rate adjustment. The sender can use this to seed the CC's peak rate estimate.Calibration flag. If the heartbeat includes a valid DispersionNs, the sender uses the derived bandwidth as the initial peakRate estimate. The loss-driven congestion controller (§6) governs the sending rate from this point forward.If the InitialRate field in SESSION_REQ is non-zero, the sender skips calibration mode and begins transmitting at the specified bytes-per-second rate immediately. This is intended for known environments (e.g., a dedicated 10 Gbps LAN) where the operator can confidently set the initial rate. The adaptive congestion controller still takes over after the first Heartbeat.
HP-UDP optionally encrypts all DATA and PARITY payloads using AES-128-GCM with ephemeral X25519 key exchange. Encryption is negotiated during the handshake (§4C Step 1.5) and is all-or-nothing for a session — once the Encrypted flag is set, every DATA and PARITY packet in the session is encrypted. Control packets (HEARTBEAT, TRANSFER_COMPLETE, ACK_CLOSE) are not encrypted; their payloads contain only protocol metadata, not file content.
Both sides generate a fresh X25519 keypair (32-byte public key, 32-byte private key) at the start of each session. Private keys exist only in memory for the duration of the transfer and are securely zeroed on session teardown. This provides perfect forward secrecy: there is no persistent key material that could decrypt recorded traffic after the session ends.
Key exchange is embedded in the existing handshake flow with no additional round trips beyond the 1-RTT SESSION_ACCEPT:
| Flow | Sender Key In | Receiver Key In | Added RTTs |
|---|---|---|---|
Direct send/recv |
SESSION_REQ payload |
SESSION_ACCEPT (0x0A) payload |
+1 (0-RTT → 1-RTT) |
Serve daemon push |
PUSH_REQ payload |
PUSH_ACCEPT payload (extended) |
+0 (already 1-RTT) |
Serve daemon pull |
PULL_REQ payload |
SESSION_REQ payload (server is sender) |
+0 (already 1-RTT) |
For push and pull via the serve daemon, the existing round trip already accommodates the key exchange — no additional latency is introduced. Only the basic send/recv flow gains one round trip.
Both sides independently derive the same symmetric key:
shared_secret = X25519(my_private_key, their_public_key) // 32 bytes
session_key = HKDF-SHA256(
ikm = shared_secret,
salt = SessionID (4 bytes, big-endian),
info = "hp-udp-aes128-v5",
len = 16 // AES-128 key
)
The SessionID salt ensures that even if the same ephemeral keypair were accidentally reused (implementation bug), different sessions would derive different keys. The info string binds the key to the protocol version and cipher suite, preventing cross-protocol key reuse.
C implementations: OpenSSL EVP_KDF with OSSL_KDF_NAME_HKDF, or libsodium crypto_kdf_hkdf_sha256_expand. Go: golang.org/x/crypto/hkdf.
Each DATA and PARITY packet is encrypted independently using AES-128-GCM. The packet header (32 bytes) is not encrypted — it is passed as Additional Authenticated Data (AAD) so that the receiver can route, reorder, and identify packets before decryption. The header is authenticated by the GCM tag, preventing tampering.
┌──────────────────┬──────────────────────────┬──────────────┐ │ Header (32 B) │ Ciphertext (PayloadLen B) │ GCM Tag (16B)│ │ cleartext, AAD │ AES-128-GCM output │ auth tag │ └──────────────────┴──────────────────────────┴──────────────┘ Total ≤ 1400 bytes. PayloadLen ≤ 1352 (= 1368 − 16 tag).
PayloadLen in the header reflects the plaintext length (which equals the ciphertext length in GCM). The receiver reads PayloadLen + 16 bytes from the payload area to get ciphertext + tag. Encrypted MaxPayload = 1352 bytes. Unencrypted transfers retain MaxPayload = 1368.
AES-GCM requires a unique nonce for every packet encrypted under the same key. Nonce reuse completely breaks GCM's confidentiality and authenticity guarantees. The nonce is constructed deterministically from fields already present in the packet header:
| Bytes | Field | Purpose |
|---|---|---|
| 0–3 | SessionID (4 bytes) |
Scopes nonce space to this transfer. Redundant with HKDF salt but provides defense-in-depth against implementation errors. |
| 4 | PacketType (1 byte) |
Domain separator: 0x01 (DATA) vs 0x02 (PARITY). Prevents nonce collision between DATA and PARITY packets that may share the same SequenceNum value within a block. |
| 5–11 | Packet Unique ID (7 bytes, big-endian) | DATA: lower 56 bits of SequenceNum (globally unique, strictly incrementing). PARITY: lower 56 bits of (BlockGroup << 8) | ParityIndex, where ParityIndex is the zero-based SequenceNum field of the PARITY packet (0..m−1). |
(BlockGroup << 8) | ParityIndex is therefore unique across the transfer (up to 256 parity packets per block, which exceeds the maximum of 20). The PacketType byte at offset 4 prevents any DATA nonce from colliding with any PARITY nonce. 56 bits of unique ID supports 256 packets per session (~98 exabytes at 1368 bytes/packet), well beyond the 1 TB maximum file size.
The nonce is not transmitted on the wire. Both sides compute it deterministically from the packet header fields. This saves 12 bytes per packet compared to an explicit nonce.
Encryption is applied after FEC encoding. The FEC encoder operates on plaintext data shards and produces plaintext parity shards. Each shard (DATA or PARITY) is then encrypted independently before transmission. On the receiver side, each packet is decrypted individually, then the plaintext shards are passed to the FEC decoder for reconstruction if needed.
Sender: file → chunk → FEC encode (plaintext) → encrypt each shard → transmit Receiver: receive → decrypt each shard → FEC decode (plaintext) → reassemble → disk
This ordering means FEC reconstruction operates on plaintext, which is correct — the Reed-Solomon math must see the original data bytes, not ciphertext (encrypting before FEC would require decrypting all k shards before reconstruction, which is the same work, but reconstructed ciphertext shards would then need the original plaintext to verify, creating a circular dependency).
When the Encrypted flag (0x04) is set, the following payloads are extended with a 32-byte ephemeral public key:
| Packet Type | Standard Payload | Encrypted Payload |
|---|---|---|
SESSION_REQ |
FileSize(8B) + Hash(8B) + InitialRate(4B) + FileName(null-term) | FileSize(8B) + Hash(8B) + InitialRate(4B) + PubKey(32B) + FileName(null-term) |
SESSION_ACCEPT |
(does not exist in unencrypted mode) | PubKey(32B) |
PUSH_REQ |
FileSize(8B) + FileName(null-term) | FileSize(8B) + PubKey(32B) + FileName(null-term) |
PUSH_ACCEPT |
Port(2B) | Port(2B) + PubKey(32B) |
PULL_REQ |
FileName(null-term) | PubKey(32B) + FileName(null-term) |
The receiver determines whether to parse the public key by checking the Encrypted flag in the packet header. If a receiver does not support encryption and receives a request with the flag set, it responds with SESSION_REJECT (reason code 0x06 = ENCRYPTION_UNSUPPORTED).
| Operation | Throughput (AES-NI) | Impact at 100 MB/s wire speed |
|---|---|---|
| AES-128-GCM encrypt | 4–6 GB/s single-thread | <3% CPU |
| AES-128-GCM decrypt | 4–6 GB/s single-thread | <3% CPU |
| X25519 scalar multiply | ~50 µs per session | Negligible |
| HKDF-SHA256 derivation | ~1 µs per session | Negligible |
| Payload reduction (1368→1352) | 1.2% fewer data bytes per packet | ~1.2% more packets for same file |
Net throughput impact: <5%. AES-NI hardware acceleration is present on every x86 CPU manufactured since ~2010 (Intel Westmere / AMD Bulldozer). C implementations should use OpenSSL's EVP_aes_128_gcm (which auto-detects AES-NI) or a SIMD-accelerated library. Go's crypto/aes + crypto/cipher uses AES-NI on amd64 automatically.
EVP_EncryptInit_ex(ctx, NULL, NULL, NULL, nonce) (OpenSSL) or by resetting the cipher.AEAD seal call with a new nonce (Go). Context reuse eliminates ~25,000 allocations/sec at 35 MB/s throughput.
The receiver will never halt reading from the network socket. Incoming data is immediately mapped into memory.
FileSize.SequenceNum dictates their exact memory offset. Out-of-order packets are slotted into their correct positions seamlessly.io_uring submission queue — the main epoll event loop submits write SQEs for contiguous regions and reaps completions without blocking the network read path. This eliminates the flush thread entirely.To eliminate latency penalties from round-trip retransmissions, the protocol proactively embeds mathematical redundancy that adapts to observed network conditions.
Data packets are organized into sequential BlockGroups. The default block size is 100 data packets per group. The BlockGroup identifier for a DATA packet is computed as:
BlockGroup = SequenceNum / BlockSize (integer division)
PARITY packets use the same BlockGroup value as the data packets they protect. The SequenceNum field of a PARITY packet is its zero-based index within the block (0, 1, 2, …, m−1), not a global sequence number. A C receiver must distinguish PARITY from DATA packets using the PacketType field (0x02) and interpret SequenceNum accordingly.
The parity packet count per block is dynamically adjusted based on the observed packet loss rate, reported via Heartbeat metrics (§6).
| Observed Loss Rate | Parity Ratio | Parity Packets per 100-Packet Block |
|---|---|---|
| < 0.5% | 2% | 2 |
| 0.5% – 2% | 5% | 5 |
| 2% – 5% | 10% | 10 |
| 5% – 10% | 15% | 15 |
| > 10% | 20% | 20 |
The sender initializes at 5% parity during calibration and adjusts after the first Heartbeat containing loss data. Adjustments are applied on block group boundaries — mid-block changes are not permitted, as this would invalidate the Reed-Solomon coding parameters for that group.
Parity packets are generated using Reed-Solomon erasure coding over GF(28) with the irreducible polynomial x8 + x4 + x3 + x2 + 1 (0x11d). For a block of k data packets with m parity packets, any k of the k+m total packets are sufficient to reconstruct the original data. The encoding uses a Vandermonde-derived matrix whose top k rows form an identity matrix, ensuring data packets pass through unchanged and only parity is computed.
If the receiver detects missing packets within a completed block group (all expected sequence numbers accounted for or timed out), it attempts FEC reconstruction immediately. Only packets that cannot be recovered via FEC are reported as NACKs in the next Heartbeat.
The final block group of a file transfer will almost certainly contain fewer than 100 data packets. The FEC parameters adapt as follows:
k_tail equals the remaining packet count after the last full block.m_tail is calculated using the current adaptive parity ratio, with a minimum of 2 parity packets regardless of block size (even a 1-packet tail block gets 2 parity packets). This ensures the most loss-vulnerable portion of the transfer — the tail — has adequate redundancy.EndOfFile flag (0x01) on the final data packet and the final parity packet of the tail block. This signals to the receiver that no further block groups will follow.The key formulas for computing totals and boundaries are:
TotalChunks = ceil(FileSize / MaxPayload) // number of DATA packets (MaxPayload = 1368)
k_tail = TotalChunks % BlockSize // 0 means last block is full
(if k_tail == 0: k_tail = BlockSize)
FinalPayloadSize = FileSize % MaxPayload // bytes in last DATA packet
(if FinalPayloadSize == 0: FinalPayloadSize = MaxPayload)
EndOfFile flag is only meaningful on the wire. When the final DATA packet is recovered via FEC reconstruction rather than received directly, the flag is not propagated into the reconstructed shard. Receivers must therefore detect end-of-transfer by checking SequenceNum == TotalChunks − 1 (known from the SESSION_REQ FileSize field) rather than relying solely on the Flags field.
HP-UDP separates congestion control (network path capacity) from flow control (receiver processing capacity). The congestion controller is loss-driven: the primary signal is the observed packet loss rate, not the ratio of delivery rate to send rate. The delivery rate acts as a ceiling, not a decision driver.
EffectiveRate ≥ 0.95 × SendRate. In practice, this threshold was unreachable: delivery rate is bounded by send rate (the receiver cannot report receiving more than was sent), and timing jitter in heartbeat measurement windows made the ratio consistently fall below 0.95 even on a clean link. This caused the sender to spiral to the rate floor. Loss rate is the correct primary signal because it reflects whether the network has headroom independently of the send rate.
The receiver periodically sends a HEARTBEAT (Type 0x03) packet to the sender. The heartbeat interval is rate-proportional based on the last measured NetworkDeliveryRate:
| Last Measured Network Delivery Rate | Heartbeat Interval |
|---|---|
| < 10 MB/s | 100ms |
| 10 – 100 MB/s | 50ms |
| 100 MB/s – 1 GB/s | 25ms |
| > 1 GB/s | 10ms |
NetworkDeliveryRate computed in each heartbeat and use that value directly for interval selection.
The Heartbeat payload contains dual metrics, RTT echo, and a NACK array. All multi-byte fields below are big-endian (network byte order) — C implementations must serialize/deserialize with htonl/ntohl and htonll/ntohll:
| Field | Size | Description |
|---|---|---|
NetworkDeliveryRate |
4 Bytes | Bytes per second successfully received from the socket into the ring buffer during the last heartbeat interval. Reflects network path capacity. |
StorageFlushRate |
4 Bytes | Bytes per second flushed from the ring buffer to disk during the last heartbeat interval. Reflects receiver I/O capacity. |
LossRate |
2 Bytes | Packet loss percentage for the current reporting window, encoded as basis points (e.g., 150 = 1.50%). Primary signal for congestion control. |
EchoTimestampNs |
8 Bytes | The verbatim value of SenderTimestampNs from the most recently received DATA or PARITY packet header. The sender computes RTT = now_ns − EchoTimestampNs using only its own monotonic clock, eliminating cross-machine clock-skew error entirely. If no DATA/PARITY packets were received in the interval, the receiver echoes the previous value unchanged (the sender's frozen-timestamp guard ignores stale repeats — see §6B). |
DispersionNs |
8 Bytes | Calibration burst dispersion measurement: the time (nanoseconds) between the first and last received calibration-flagged packet. Zero outside of calibration. The sender uses this to compute bottleneck bandwidth: BW = (BurstSize − 1) × MaxPayload / DispersionNs (where MaxPayload = 1368). See §4C. |
HighestContiguous |
8 Bytes | The highest SequenceNum such that all packets 0..N have been received or FEC-recovered. Allows sender to track receiver progress. |
NACKCount |
2 Bytes | Number of unrecoverable sequence numbers in the NACK array that follows. |
NACKArray |
8 Bytes × N | Array of 64-bit SequenceNum values that were not recoverable via FEC and require retransmission. Bounded to sequences between HighestContiguous+1 and the highest received sequence number (never NACKs packets the sender hasn't transmitted yet). The array is physically limited to fit within one packet: MaxNACKs = (MaxPayload − HeartbeatFixedSize) / 8 = (1368 − 36) / 8 = 166. For spec simplicity this is rounded down and stated as 167 in rate-limiting contexts; implementations must not exceed the true computed limit. If more than 167 sequences are pending, the receiver sends the highest-priority subset and the remainder appear in a subsequent heartbeat. |
Each DATA and PARITY packet carries a sender timestamp in the fixed SenderTimestampNs header field (offset 0x18, 8 bytes, unix nanoseconds set at packet-build time). Non-data control packets leave this field zero.
Problem with prior versions: v3.x had the receiver set EchoTimestampNs = receiver's time.Now(). The sender then computed RTT = sender_now − receiver_now, a cross-machine clock comparison. A 4-second clock skew between machines produced a 5-second RTT estimate, permanently locking the NACK cooldown above the receiver's 5-second inactivity timeout and killing the transfer.
Fix: The receiver now echoes the sender's own timestamp verbatim: EchoTimestampNs = pkt.Header.SenderTimestampNs. The sender computes RTT = now_ns − EchoTimestampNs using only its own monotonic clock. Cross-clock error is eliminated entirely.
Problem: When the sender is idle (honouring a NACK cooldown), the receiver keeps echoing the same stale SenderTimestampNs from the last packet it received. Each heartbeat makes RTT = now − staleTs grow by one heartbeat interval. After enough heartbeats the RTT inflates past 5 seconds, the NACK cooldown exceeds the receiver's inactivity timeout, and the transfer dies.
Fix: The TokenBucket tracks lastEchoNs — the highest EchoTimestampNs it has processed. RTT is updated only when echoNs > lastEchoNs. Stale repeated echoes are silently ignored; the RTT estimate stays locked at its last valid measurement until the sender transmits a new packet and the receiver reflects a fresh timestamp.
The sender adjusts its rate based on the LossRate reported in each Heartbeat. The effective delivery rate (rawEffective = NetworkDeliveryRate) serves as a ceiling for decreases. StorageFlushRate is still reported in heartbeats for observability but is no longer an input to the rate controller: the receiver uses a pre-allocated full-file ring buffer, so disk lag cannot cause packet loss. Including StorageFlushRate in the minimum caused out-of-order packets to stall the contiguous flush frontier to 0, making StorageFlushRate ≈ 0 and falsely triggering the delivery-collapse guard on every heartbeat. Rate increases are gated to once per RTT to prevent the sender from making dozens of upward adjustments before any feedback arrives (critical on high-latency links like satellite).
The sender tracks the time of its last rate increase. When a heartbeat signals an increase, the sender checks whether at least one RTT has elapsed since the previous increase. If not, the increase is suppressed (treated as a hold). Decreases are not gated — they are applied immediately upon consecutive loss signals to clear the pipe as fast as possible.
The congestion controller operates in two phases, similar to TCP's slow start and congestion avoidance but adapted for loss-driven UDP with FEC:
Phase 1 — Probe (Multiplicative Increase): While loss is < 1%, the sender has never observed the link ceiling. It probes aggressively with multiplicative increase, applied once per RTT:
S_new = S_current × 1.25
This is more conservative than the v3.0 1.5× multiplier but is applied per RTT rather than per heartbeat, giving the network time to signal back before each step.
Phase 2 — Congestion Avoidance (Additive Increase): Once the sender observes loss entering the 1%–5% hold zone for the first time, it has found the approximate ceiling of the link. The controller permanently transitions to Phase 2 and never returns to Phase 1 for this session. Probing uses additive increase, applied once per RTT:
S_new = S_current + (MaxPayload / RTT)
This adds approximately one packet per RTT of additional bandwidth, gently probing for headroom without risking a burst of loss.
Let L = reported LossRate in basis points, E = effective delivery rate (NetworkDeliveryRate — see §6C rationale above), S = current send rate. On each Heartbeat reception, the delivery-collapse guard is checked first:
| Condition | Action | Rationale |
|---|---|---|
NACKCount > 0 AND E < S × 0.25 |
Hold + permanently transition to Phase 2. Evaluated before loss thresholds. | OS socket buffer overflow. Packets are dropped before reaching the receiver application — reported LossRate stays 0% (no FEC failures counted) while delivery collapses. NACKs confirm real packet loss. Entering Phase 2 fires the 1.5× ceiling immediately, cutting the target rate to near actual link capacity. Threshold lowered from 50% to 25%: on high-latency paths (50ms+ RTT) approximately 50% of packets are legitimately in-flight during warm-up, causing the old 50% threshold to fire prematurely on measurement lag. |
L < 100 (loss < 1%) |
Increase (once per RTT): Phase 1: S × 1.25. Phase 2: S + MaxPayload/RTT. |
Link has headroom. FEC absorbs transient loss. RTT gating prevents runaway probing on high-latency paths. |
100 ≤ L ≤ 500 (1% – 5%) |
Hold: S = S. If first time entering this zone, transition to Phase 2 permanently. |
FEC is handling the loss. The link ceiling has been discovered. Switch to additive probing from this point forward. |
L > 500 (loss > 5%), consecutive confirmation |
Decrease: S = smoothed(E) × 0.85 |
Drop to 85% of the EWMA-smoothed effective delivery rate. The 15% undershoot allows router queues to drain; FEC bridges the gap during recovery. Requires two consecutive above-threshold heartbeats to trigger (see below). |
E × 1.05 set the new rate above the rate that just caused severe loss, sustaining congestion. Dropping to E × 0.85 gives queues time to drain. (2) The 1.5× multiplicative increase per heartbeat was replaced with 1.25× per RTT — on a Starlink link with 100ms heartbeats and 40ms RTT, the old algorithm made 2.5 increases per RTT, compounding to ~1.95× per RTT. (3) The permanent transition to additive increase after discovering the link ceiling prevents repeated boom-bust oscillation at the capacity boundary. (4) StorageFlushRate removed from effective-rate formula (v4.0): pre-allocated ring buffers always stall flush at 0 for out-of-order arrivals, making min(NetworkDeliveryRate, StorageFlushRate) ≈ 0 and permanently tripping the delivery-collapse guard. (5) Delivery-collapse threshold lowered 50%→25% (v4.0): legitimate in-flight packets on 50ms+ RTT paths account for ~50% of the window, causing false collapses during ramp-up with the old threshold.
Raw delivery rate measurements from individual heartbeats are noisy due to timing jitter, especially on high-latency links. The sender maintains an exponentially weighted moving average (EWMA) of the effective delivery rate:
smoothed = α × raw_sample + (1 − α) × smoothed_previous
The default smoothing factor is α = 0.3, which provides moderate dampening (converges in ~3 samples). The smoothed rate is used as the target when decreasing, preventing single-heartbeat noise from crashing the rate.
A single heartbeat reporting > 5% loss may be a transient spike (e.g., a router briefly queuing). The sender requires two consecutive above-threshold heartbeats before executing a decrease. The first signal starts a "decrease streak" counter; the second confirms it. Any increase or hold resets the streak to zero.
The ceiling is two-tiered based on phase:
Phase 1: if rate > peakDeliveryRate × 4.0: rate = peakDeliveryRate × 4.0 Phase 2: if rate > peakDeliveryRate × 1.5: rate = peakDeliveryRate × 1.5
Phase 1 (4× — runaway prevention): During the multiplicative probe, delivery-rate measurements lag the send rate because the sender increases 25% per heartbeat and the receiver's measurement window has not stabilised. For example, at a 7.63 MB/s send rate the receiver may report only 4.19 MB/s delivery. A tight multiplier like 1.5× would fire immediately, giving a 6.28 MB/s ceiling below the current rate and locking the sender at ~5.68 MB/s for the entire transfer on a 110 MB/s Gigabit link. The generous 4× multiplier prevents this while still bounding the exponential: on a clean link where FEC absorbs all drops (LossRate remains 0% throughout), Phase 2 is never entered and without any Phase 1 ceiling the target rate grows without bound (observed: 345 trillion MB/s). With 4×, the target is capped at ~400 MB/s on a Gigabit LAN — effectively the same as nodelay (pacing is disabled at that rate anyway), but without the absurd log output.
Phase 2 (1.5× — avoidance bound): Once Phase 2 is entered, the delivery rate was measured near actual link capacity — the loss event that triggered the transition occurred at or near the ceiling — so 1.5× is a tight and reliable upper bound for the additive probing that follows.
peakRate from cold-start measurements. This approach was superseded by Phase 2 gating, which is more principled: the ceiling is irrelevant during Phase 1 (the sender is still discovering the link ceiling) and correct during Phase 2 (the delivery rate was measured near capacity). The warmup constant no longer exists in the implementation.
The sender must convert the target rate (bytes/sec) into inter-packet timing. The naive approach — computing a per-packet interval and sleeping for that duration — fails in practice because OS timer granularity (~1ms on Windows, ~100µs on Linux) and language-runtime preemption (e.g., Go's asynchronous goroutine preemption, or signal-based preemption in C with certain threading models) make sub-millisecond sleeps unreliable.
Instead, the sender uses a deficit accumulator:
tokens balance (in bytes) accrues credit at the target rate over elapsed wall-clock time. Elapsed time must use a monotonic clock source (C: clock_gettime(CLOCK_MONOTONIC); Go: time.Now() which uses the monotonic component internally).tokens by the packet size.max_tokens = rate_bytes_per_sec × 0.002. This prevents idle periods from banking enough credit to burst a large backlog of packets instantly.time.Sleep(). C: clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &target_ts, NULL) — the TIMER_ABSTIME flag prevents drift from accumulating across consecutive sleeps.This produces the correct long-term average rate without relying on sub-millisecond timer precision.
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME) provides ~50–100µs precision on default kernels, significantly better than Go's ~1ms floor. For sub-100µs precision (useful at 10 GbE+ rates), run the sender thread under SCHED_FIFO real-time scheduling via sched_setscheduler(). The deficit accumulator is still the correct architecture — even with precise timers, per-packet nanosleep at 67,000 pps would burn CPU on syscall overhead. The accumulator batches the timing debt and issues one sleep per deficit threshold crossing, regardless of timer precision.
runtime.Gosched()-based busy-wait spin loops for sub-ms pacing. Go 1.14+ introduces asynchronous goroutine preemption that signals goroutines at safe points — even inside tight loops — causing the spin to overshoot to ~1ms per packet. At 1400 bytes/1ms = 1.4 MB/s, this created an artificial throughput ceiling regardless of the target rate. The deficit accumulator was developed to solve this without platform-specific timer hacks.
When the sender receives a NACK array in a Heartbeat, it queues the identified packets for retransmission. Retransmitted packets carry their original SequenceNum and BlockGroup.
Retransmissions are interleaved with forward progress: the sender processes at most 3 NACKed packets per send-loop iteration, then sends the next new data packet. This prevents NACK storms (e.g., 169 NACKs on a satellite link) from monopolizing bandwidth and stalling seqNum advancement.
The pending NACK set must be maintained as a deduplicated set (hash set or bitset), not a FIFO queue. Each heartbeat may report the same sequence numbers as the previous one (the receiver keeps NACKing until the packet arrives). If the same sequence is appended to a plain list on every heartbeat, the retransmit queue grows without bound. A set ensures each sequence is queued at most once regardless of how many heartbeats repeat it. If a NACK arrives for a sequence that has already been pruned from the sender's chunk cache (because HighestContiguous advanced past it via FEC recovery), the sender silently skips it — the receiver has already recovered the packet and the NACK is stale.
seqNum never advanced. The 3-per-iteration cap ensures the transfer always makes forward progress even under heavy loss.
If the receiver does not receive any packets (data, parity, or retransmission) for a period of 5 consecutive expected heartbeat intervals, with a minimum floor of 5 seconds, it declares the session dead:
.partial suffix if configured for resume support in a future version).SequenceNum and HighestContiguous for diagnostics.If the sender does not receive a Heartbeat for 5 consecutive expected heartbeat intervals, it assumes the receiver has failed or the return path is broken:
To prevent "tail drop" issues where the final packets are lost and the connection deadlocks, the protocol implements a synchronized teardown with timeout-driven cleanup on both sides.
EndOfFile flag (0x01) on the final data packet and the final parity packet of the tail block. It stops reading new data but keeps the socket open, listening for Heartbeats. Socket ownership must be exclusive before entering teardown: Go implementations must stop the heartbeat listener goroutine to prevent it from consuming packets the teardown loop needs (see Lesson F). C implementations using a single-threaded epoll event loop have exclusive socket access by construction — no action is needed.xxHash64 as contiguous blocks are flushed from the ring buffer to disk. Upon receiving the EOF-flagged packets and verifying the completed hash against the SESSION_REQ metadata, it sends a TRANSFER_COMPLETE (Type 0x05) packet.TRANSFER_COMPLETE, responds with ACK_CLOSE (Type 0x06), frees heavy memory allocations (including the sliding window ring buffer), and enters a 3-second Linger state. Duplicate TRANSFER_COMPLETE packets arriving in this window are answered with a repeat ACK_CLOSE.TRANSFER_COMPLETE, the receiver enters its own 3-second Linger state. If ACK_CLOSE is not received within this window, the receiver retransmits TRANSFER_COMPLETE up to 3 times at 1-second intervals. If no ACK_CLOSE is received after all retries, the receiver considers the transfer successful (the file hash was verified) and performs a unilateral teardown.During the teardown wait (after all data is sent, before TRANSFER_COMPLETE arrives), the sender continues to process Heartbeat packets synchronously. If a Heartbeat contains NACKs, the sender retransmits the requested packets from the sliding window ring buffer and resets the read deadline. This ensures the receiver can complete even if some late packets were lost.
Teardown retransmits are paced through the token bucket at the congestion controller's current rate. Without pacing, a backlog of queued Heartbeats (e.g., after a brief receive gap) can be drained all at once, causing the sender to fire hundreds of retransmit packets in tens of milliseconds — a burst that overwhelms the same congested link that caused the NACKs in the first place. See Lesson J for the observed failure case.
Problem: The teardown retransmit loop had no cooldown. On a 50ms-RTT path with a 50ms heartbeat interval, every in-flight heartbeat triggered a redundant retransmit of the same lost packets. The retransmit flood caused fresh congestion, which caused more NACKs, creating a self-reinforcing spiral. Observed: 59,908 reported NACKs for approximately 780 actual losses.
Fix: A nackCooldown map[uint64]time.Time gates each sequence number to at most one retransmit per RTT × 1.25 (RTT plus 25% margin). The map is seeded with all NACKs outstanding at the moment the main send loop ends. On each teardown heartbeat, a sequence is only retransmitted if its cooldown timestamp has elapsed; otherwise it is silently skipped until the next eligible window.
Problem: The receiver's NACK scan window is bounded by HighestReceived. If the last packets of the file are dropped, HighestReceived never advances to the end of the file, so the receiver's NACK list is empty. The sender sees 0 NACKs, sends nothing, and the receiver hits its 5-second inactivity timeout — a deadlock neither side can break without external intervention.
Fix: In the teardown loop, if a heartbeat arrives with NACKCount == 0 but hb.HighestContiguous < totalChunks−1, the sender proactively computes up to 167 missing tail sequences (from HighestContiguous+1 through totalChunks−1) and injects them into the retransmit pipeline. These injected sequences flow through the NACK cooldown gate exactly like receiver-reported NACKs, preventing the same tail sequences from being re-injected on every heartbeat.
Problem: At high transfer speeds (100+ MB/s) the TokenBucket's 2ms burst allowance is approximately 200 KB. A full 167-packet retransmit batch is approximately 232 KB at MaxPayload = 1368 bytes — it fires as a near-simultaneous burst, flooding the OS UDP socket buffer and the serve daemon's 256-slot receive channel, causing most retransmits to be silently dropped.
Fix: Teardown retransmits are chunked into batches of 10 packets with a 2ms sleep between batches. 167 packets are spread over approximately 34ms — invisible to the user, well within any heartbeat interval, and guaranteed to fit through any buffer in the path.
If the receiver completes all data reception but the xxHash64 does not match the expected value from SESSION_REQ:
SESSION_REJECT (Type 0x04) with a reason code indicating hash failure.Once the main send loop completes and the sender enters the teardown loop (§8B), the progress bar changes from the normal 100% | 40.0 MB/s | NACKs: N display to 100% | Repairing... | NACKs: N. This tells the user the network is actively recovering dropped tail packets rather than hanging. The repair state persists until TRANSFER_COMPLETE is received or the teardown timeout expires.
conn.Write() syscall overhead limits Go to ~30–41 MB/s on LAN versus 93 MB/s for FTP (which uses kernel-level TCP segmentation with a single large write()).sendmmsg() (Linux) or GSO (Generic Segmentation Offload) to submit multiple UDP packets per kernel transition. At 1376 bytes per packet and 93 MB/s target, the sender must dispatch ~67,000 packets/sec. Per-packet sendto() incurs ~15µs of context-switch overhead each, consuming ~1 second of CPU per second of transfer. sendmmsg() with batches of 16–64 packets amortizes this to ~1,000–4,000 syscalls/sec. On the receiver side, recvmmsg() provides the same benefit.mmap()-backed ring buffer with PACKET_RX_RING/AF_XDP to avoid kernel-to-userspace copy on receive.write() pushes megabytes at once — the kernel handles segmentation into ~1500-byte frames internally. HP-UDP's per-packet conn.Write() makes ~67,000 syscalls/sec at full speed, each costing a user-kernel context switch. This is not a fundamental protocol limitation — it's a syscall-overhead problem that sendmmsg() batching in Phase 2 will eliminate.
The following empirical findings emerged during Phase 1 implementation and testing. They are documented here to guide the Phase 2 C port and future protocol revisions.
The v2.0 algorithm increased the rate only when EffectiveRate ≥ 0.95 × SendRate. Since delivery rate is bounded by send rate and measurement windows never align perfectly, this ratio consistently falls below 0.95 even on a lossless link. The sender spirals to the rate floor. Loss rate is the correct primary signal.
Both time.Sleep(<1ms) and busy-wait spin loops fail for sub-millisecond pacing on Windows (minimum ~1ms granularity) and on any platform using Go 1.14+ (asynchronous goroutine preemption interrupts tight loops at ~1ms intervals). The deficit accumulator sidesteps this entirely by sleeping only when the accumulated deficit justifies a ≥1ms sleep. The C port should use clock_nanosleep() or similar, but should still avoid relying on sub-millisecond precision for correctness.
A 50 MB/s starting rate on a Starlink connection (~10 Mbps effective uplink) caused massive packet loss during the first 100ms, which poisoned the peakRate measurement and locked the auto-ceiling at ~0.38 MB/s for the entire transfer. The starting rate must be conservative enough for the worst expected link (2 MB/s default), while the calibration burst itself runs at wire speed to discover the actual capacity.
On a satellite link with ~30ms RTT and 5% loss, each heartbeat reported ~169 NACKs. Processing all NACKs before each new data packet caused the send loop to spend its entire bandwidth on retransmissions, preventing seqNum from advancing. Capping retransmissions at 3 per iteration restored forward progress.
The first few heartbeats arrive during or immediately after the calibration burst, when the receiver is still allocating buffers and the network path hasn't stabilized. More broadly, during Phase 1 ramp-up the receiver's measurement window hasn't caught up to the sender's current rate — at a 7.63 MB/s send rate the receiver may only report 4.19 MB/s delivery because the sender had only been at that rate for one 100ms heartbeat interval. Any ceiling derived from these measurements will be artificially low. The correct solution is to gate the auto-ceiling on Phase 2 entry rather than a fixed warmup period: by the time Phase 2 is entered, the sender has been near the link ceiling long enough for delivery measurements to be meaningful.
The sender's heartbeat listener goroutine and the teardown synchronous read loop compete for the same socket. If the goroutine is still running when the sender enters teardown, it consumes packets (including TRANSFER_COMPLETE) that the teardown loop needs. The goroutine must be stopped before entering teardown, and any queued NACKs must be drained synchronously.
FTP achieves 93 MB/s on the same Gigabit LAN where HP-UDP reaches ~30–41 MB/s. The difference is not the protocol or the language — it's the syscall pattern. FTP writes large buffers to a TCP socket; the kernel segments them into packets internally. HP-UDP calls conn.Write() for every 1376-byte packet, requiring ~67,000 user-kernel context switches per second at full speed. Each syscall costs ~15µs of overhead, consuming nearly 100% of available CPU time at target throughput. The Phase 2 C port must use sendmmsg()/recvmmsg() or equivalent batching to amortize this cost across 16–64 packets per syscall.
Three allocations patterns in the Go prototype created significant memory and CPU overhead that are not visible in the protocol design but directly impacted measured throughput. All three were fixed in Phase 1 and their equivalents must be avoided in the Phase 2 C port.
NewRSEncoder(k, m) builds a k×k Vandermonde matrix and inverts it using O(k³) GF(2⁸) operations. For the default block size of k=100 this is ~2 million GF operations, measured at ~4 ms per call. With ~1,720 FEC blocks in a 237 MB transfer, constructing a fresh encoder for each block costs ~6.9 seconds of CPU time — roughly equal to the entire transfer duration at 35 MB/s. The fix is to cache the encoder keyed on (dataShards, parityShards) and reuse it across blocks. The matrix is deterministic for a given (k, m) pair and encoding only reads from it (no mutation), so the cached instance is safe for concurrent use. The Phase 2 C port must pre-build encoder matrices at session start and reuse them.
MaxPayload-sized buffer padded to equal length before RS encoding. At the default block size of 100 and a 35 MB/s transfer rate, this produces ~25,000 allocations per second, generating ~470 MB of heap churn per 237 MB transfer and placing the garbage collector on the critical path. The fix is a sync.Pool of pre-allocated MaxPayload-sized buffers checked out at encode time and returned immediately after the parity computation completes. The Phase 2 C port should maintain a fixed pool of shard-sized stack or heap buffers reused across blocks.
map[uint64][]byte to service NACK retransmissions. Without eviction, this map retained all ~172,000 chunks for a 237 MB transfer, consuming ~470 MB of heap memory for the entire session duration. Fixed in Phase 1: replaced with a bounded SlidingWindow ring buffer (50,000 slots, ~68 MB peak). Entries are evicted when HighestContiguous advances (received from each heartbeat), since the receiver has already confirmed contiguous receipt up to that point and will never NACK those sequences. When the window is full (all 50,000 un-acknowledged slots occupied), the sender pauses sending new packets until the receiver's HighestContiguous advances and frees slots — providing memory-bounded backpressure. The Phase 2 C port should use the same ring-buffer pattern with a power-of-2 slot count (65,536 recommended, ~89 MB peak) so that index wrapping uses a bitmask (idx & 0xFFFF) instead of a modulo operation — eliminating a division on the hot path for every packet sent and every HighestContiguous advance.
With the original auto-ceiling multiplier of 4×, the target rate on a Gigabit LAN reached 396 MB/s (4× a measured peak delivery of ~99 MB/s). At this target the token bucket's pacing budget was so large that no sleep was ever triggered — the sender fired packets as fast as the CPU allowed. The OS NIC queue and the receiver's socket buffer were overwhelmed, systematically dropping the same ~167 packets per heartbeat interval. Because these drops were clustered within FEC blocks (the sender had burst through several consecutive blocks before the receiver could drain them), the parity overhead was insufficient to recover them. The result: every heartbeat for the entire transfer carried the same 167-entry NACK list, the sender retransmitted them repeatedly, and the teardown phase required ~3.5 seconds of retransmit cycles before the receiver could recover all blocks and issue TRANSFER_COMPLETE.
Three fixes address this together:
On a real Long Fat Network (1 GB file sent over a ~20 MB/s WAN link), Phase 1 multiplicative probing ramped the sender from 2 MB/s to 71 MB/s (the 4× Phase 1 ceiling) in under 5 seconds. The link could only sustain ~20 MB/s. The excess was absorbed by OS socket buffers, which then overflowed. From that point, packets were dropped at the OS layer before reaching the receiver application. The observed effect:
packetsLost / (packetsReceived + packetsLost). When the OS drops all packets in a heartbeat window, packetsReceived = 0 and packetsLost = 0, so totalPackets = 0 and the guard clause produces 0% loss. The CC never detected any congestion.Additionally, a secondary failure compounded the problem during teardown: a 7-second gap in heartbeat reception caused a backlog to accumulate. When heartbeats resumed, the sender drained the entire queue at once — 39 calls to the retransmit function in 45ms, each firing all 167 packets at wire speed (~199 MB/s burst on a 20 MB/s link).
Two fixes address this:
OnHeartbeat checks: NACKCount > 0 AND E < S × 0.25 (threshold lowered from 0.5 in v4.0 — see §6C). If both are true, the sender holds and permanently enters Phase 2. The Phase 2 ceiling (1.5× peak delivery) fires immediately, cutting the target from 71 MB/s to ~26 MB/s. With the rate near actual link capacity, the normal loss signals take over and back it down further. The NACKCount condition is critical — it prevents false holds on cold-start windows where delivery is transiently near zero but the link is healthy.Pace() for each packet. A backlog of 39 queued heartbeats retransmitting 167 packets each is spread over ~600ms at 15 MB/s rather than firing in 45ms, giving the receiver time to process them and advance HighestContiguous.Problem: When the sender's sliding window fills (50,000 un-acknowledged slots), the main send loop must pause to avoid growing memory without bound. The original implementation used a bare for sw.IsFull(seqNum) { time.Sleep(1ms) } inner loop. While spinning there, the outer loop never returned to its NACK-processing step at the top. If the very first DATA packet was dropped, HighestContiguous on the receiver stayed at 0. Because Advance(0) is a no-op (guarding against the zero-value case), hc in the sliding window stayed at its sentinel value (MaxUint64). With that sentinel, IsFull returned true at exactly seq 50,000, which corresponds to ~68 MB — roughly 6% of a 1 GB file. The sender froze, NACKs queued in nackPending went unserviced, and the receiver detected no incoming packets for 5 seconds and declared an inactivity timeout.
Fix: Replace for sw.IsFull(seqNum) { sleep } with if sw.IsFull(seqNum) { sleep; continue }. The continue jumps back to the top of the outer for seqNum < totalChunks loop, so every backpressure iteration still drains nackPending and retransmits pending sequences before sleeping. The retransmit of the lost first packet allows HighestContiguous to advance on the next heartbeat, which unblocks IsFull and resumes the main send loop normally.
Observed signature: Transfer ramps to full speed (100+ MB/s), freezes at approximately 6% of a 1 GB file, receiver reports inactivity timeout 5 seconds later. NACKs counter shows 0 during the freeze (the NACK retransmit loop never ran). The freeze point scales exactly with window size: windowSlots × MaxPayload / fileSize.
The serve daemon is a persistent single-lane UDP server that manages a file manifest and services both pull requests (clients fetch files) and push requests (clients deposit files). It listens on a single control port and handles one transfer at a time — concurrent requests receive SERVER_BUSY and may retry.
All SESSION_REJECT packets carry a single reason-code byte in the payload:
| Code | Name | Meaning |
|---|---|---|
0x01 | SESSION_ID_COLLISION | The submitted SessionID is already active on the server. |
0x02 | HASH_MISMATCH | Received file hash does not match the value declared in SESSION_REQ. |
0x03 | SERVER_BUSY | A transfer is already in progress; try again later. |
0x04 | FILE_NOT_FOUND | The requested filename is not in the serve manifest. |
0x05 | FILE_EXISTS | A push was rejected because the filename already exists on the server (no-overwrite policy). |
0x06 | ENCRYPTION_UNSUPPORTED | The receiver does not support encryption and received a request with the Encrypted flag (0x04) set. |
The PULL_REQ mechanism allows a client behind NAT to retrieve a file from a serve daemon that has a public IP address, without any port-forwarding configuration on the client side.
Wire format: Packet type 0x07. Payload is a null-terminated UTF-8 filename with no fixed-size prefix.
Flow:
SessionID (same CSPRNG path as §4A), binds a local UDP socket on an OS-assigned ephemeral port, and sends PULL_REQ to the server's control port. This outbound packet punches the NAT hole: the NAT mapping records client-ip:ephemeral-port → server-ip:control-port.PULL_REQ. If busy or filename not in manifest, sends SESSION_REJECT back to the client's address. Otherwise, it fires a normal SESSION_REQ to the client's address from a new outbound connection. Because the server's IP was the destination of the punching packet, port-restricted-cone NAT routers (the most common home router class) allow this inbound from the same IP on any source port.SESSION_REQ on its bound socket, records the sender's address (server ephemeral port), and enters the normal receiver flow using its existing socket — no rebind required. The same socket that sent PULL_REQ carries the entire transfer.SessionID supplied in the PULL_REQ header, eliminating an extra round-trip for ID assignment.The push flow allows a client to deposit a file into the serve daemon's directory. Three security invariants are always enforced:
PUSH_REQ payload is sanitized to its base name — the substring after the last / or \ separator (Go: filepath.Base(); C: manual reverse scan for both separators). Path traversal sequences such as ../../etc/passwd or /absolute/path are reduced to just the final filename component before any further processing.FILE_EXISTS (0x05).filename.tmp during the transfer. Only after a successful TRANSFER_COMPLETE (xxHash64 verified) is the .tmp renamed to its final path and added to the live manifest. A failed or interrupted transfer leaves no partial file in the manifest.PUSH_REQ wire format (type 0x08): 8-byte big-endian FileSize followed by a null-terminated filename. Total payload: 9 bytes minimum.
PUSH_ACCEPT wire format (type 0x09): 2-byte big-endian Port — the ephemeral UDP port the server has bound for the incoming data transfer.
Flow:
SessionID, sends PUSH_REQ to the control port containing the filename and declared file size.net.ListenUDP(":0"); C: bind() with sin_port = 0, then getsockname() to retrieve the assigned port), and replies with PUSH_ACCEPT containing the assigned data port.server-host:data-port, and starts a normal HP-UDP sender using the same SessionID from step 1.filename.tmp. On success: atomic rename (rename()) from tmp to final path + manifest promotion under a write-lock. On failure: tmp is deleted and the slot is cleared.The manifest is a filename-to-absolute-path map built at daemon startup by a non-recursive directory scan. Symlinks and directories are excluded. Files added to the directory after startup are invisible until the daemon is restarted — this is an intentional security boundary. Successful push transfers atomically add their promoted file to the in-memory manifest under a write-lock (Go: sync.RWMutex; C: pthread_rwlock_t). PULL_REQ handlers acquire a read-lock when consulting the manifest.
| Parameter | Default Value | Configurable |
|---|---|---|
| MTU Hard Cap | 1400 bytes (total) | No |
| Header Size | 32 bytes (4 × 64-bit aligned, includes SenderTimestampNs) | No |
| Max Payload | 1368 bytes unencrypted (MTUHardCap(1400) − HeaderSize(32)); 1352 bytes encrypted (1368 − GCM_TagSize(16)) | No |
| FEC Block Size | 100 data packets | Yes |
| FEC Initial Parity | 5% | Yes |
| FEC Tail Min Parity | 2 packets | Yes |
| Calibration Burst Size | 10 packets (packet train) | Yes |
| Calibration Burst Spacing | 0 (wire speed) | Yes |
| Default Starting Rate | 2 MB/s | Yes |
| EWMA Smoothing Factor (α) | 0.3 | Yes |
| Loss Threshold: Increase | < 100 bp (1%) | Yes |
| Loss Threshold: Hold / Phase Transition | 100–500 bp (1–5%) | Yes |
| Loss Threshold: Decrease | > 500 bp (5%) | Yes |
| Consecutive Decrease Signals | 2 | Yes |
| Phase 1 Increase Multiplier | 1.25× per RTT | Yes |
| Phase 2 Additive Increase | MaxPayload / RTT per RTT | Yes |
| Decrease Factor | 0.85× smoothed delivery rate | Yes |
| Auto-Ceiling Multiplier (Phase 1) | 4× peak delivery rate (runaway prevention) | Yes |
| Auto-Ceiling Multiplier (Phase 2) | 1.5× peak delivery rate (avoidance bound) | Yes |
| Deficit Accumulator Burst Cap | 2ms of credit | Yes |
| Deficit Sleep Threshold | ≥ 1ms | No (OS-dependent) |
| Max NACKs per Send Iteration | 3 | Yes |
| Rate Floor | 10 KB/s | Yes |
| Inactivity Timeout | max(5 × heartbeat interval, 5s) | Yes |
| Sender Probe Interval | 500ms | Yes |
| Sender Probe Timeout | 10 seconds | Yes |
| Linger Duration (both sides) | 3 seconds | Yes |
| Receiver Teardown Retries | 3 | Yes |
| Stale SessionID Reservation | 10 seconds | Yes |
| Max SESSION_REQ File Size | 1 TB | Yes |
| Teardown Batch Size | 10 packets per sleep | Yes |
| Teardown Batch Sleep | 2 ms | Yes |
| Delivery-Collapse Threshold | 25% of current send rate (was 50%) | Yes |
| NACK Cooldown Margin | RTT × 1.25 (RTT + 25%) | Yes |
| Tail-Drop Injection Limit | 167 sequences per heartbeat | Yes |
| Sliding Window Slots | 65,536 (216, ~89 MB peak). Go prototype uses 50,000; C implementations should use power-of-2 for bitmask index wrapping. | Yes |
| Encryption Cipher | AES-128-GCM (128-bit key, 128-bit auth tag, 96-bit nonce) | No |
| GCM Tag Size | 16 bytes | No |
| GCM Nonce Size | 12 bytes (constructed from SessionID + PacketType + UniqueID; not transmitted) | No |
| Key Exchange | X25519 ephemeral (32-byte public key per side) | No |
| Key Derivation | HKDF-SHA256, salt = SessionID, info = "hp-udp-aes128-v5" | No |
SESSION_ACCEPT (0x0A) packet type; Encrypted flag (0x04); 1-RTT handshake when encrypted (0-RTT preserved for unencrypted); deterministic nonce from header fields; encrypt-after-FEC data path; extended payloads for SESSION_REQ, PUSH_REQ, PUSH_ACCEPT, PULL_REQ; ENCRYPTION_UNSUPPORTED reject code; encrypted MaxPayload 1352 bytes; backward compatible with v4.x unencrypted transfers.clock_nanosleep, pthread_rwlock_t, manual basename scan, ephemeral bind()+getsockname()); expanded deficit-accumulator C guidance (TIMER_ABSTIME, SCHED_FIFO); architecture-neutral teardown socket ownership (epoll single-thread vs goroutine); power-of-2 sliding window slot count (65,536) for bitmask index wrapping; io_uring receiver disk I/O note.map; backpressure NACK starvation deadlock fix (if IsFull { continue } replaces bare spin).SenderTimestampNs header field (HeaderSize 24→32, MaxPayload 1376→1368); frozen-timestamp RTT guard (lastEchoNs); RTT-aware NACK cooldown map; tail-drop deadlock prevention via proactive tail injection; teardown micro-burst prevention (batch-10 / 2ms); StorageFlushRate removed from CC effective-rate; delivery-collapse threshold 50%→25%; progress bar repair state.DispersionNs); sender timestamps for RTT measurement; two-phase congestion controller (Phase 1 multiplicative 1.25×/RTT, Phase 2 additive); decrease formula changed from E × 1.05 to E × 0.85; rate-increase gating to once per RTT; Phase 2 throughput analysis; sendmmsg()/recvmmsg()/io_uring roadmap.