Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

A TCP receive path as an effect tower

This is the chapter where the design has to prove it can handle something genuinely complicated. Networking is the standard test for systems-language expressiveness. TCP in particular is notorious: it has a state machine with a dozen states, retransmission logic, congestion control, reassembly buffers, and a flow-control window. Production TCP stacks in Linux and the BSDs run to tens of thousands of lines of carefully-managed C, and getting them right took decades.

We’re not going to build a production stack. What we’re going to do is sketch the receive path — packet arrives at the NIC, eventually surfaces as bytes a process can read — and show that the structure the design suggests is dramatically cleaner than the conventional approach. The structure is: each layer of the protocol stack is a handler that performs the next layer’s effect to the layer above. The full receive path is a tower of handlers.

If this works, it’s evidence that the unification scales: from kernel scheduling to protocol implementation, the same shape applies. If it doesn’t work, it’s evidence the design has a ceiling. Either is useful.

The effect tower

The protocol layers, written out as effect signatures:

effect Link {
  recv_frame : () -> ethernet_frame
}

effect IP {
  recv_datagram : () -> ip_datagram
  send_datagram : ip_datagram -> ()
}

effect TCP {
  open    : { src : port; dst : socket_addr } -> connection
  recv    : connection -> bytes
  send    : connection -> bytes -> ()
  close   : connection -> ()
}

Read this as a stack. Link is the bottom — raw Ethernet frames in and out of the NIC. IP is the middle — IP datagrams, after fragmentation/reassembly. TCP is the top — a connection-oriented byte stream.

A user process performs TCP.recv conn and receives bytes. Under the hood:

  • A TCP handler is the implementation of recv. To do its job, it performs IP.recv_datagram to get the next IP datagram on the relevant connection.
  • An IP handler implements recv_datagram. To do its job, it performs Link.recv_frame to get the next raw Ethernet frame.
  • A Link handler implements recv_frame. To do its job, it waits for Hardware.nic_packet to fire from the NIC device.

Each layer translates between the abstraction below and the abstraction above. Each layer has local state (the IP reassembly buffer, the TCP receive window, the per-connection state machine). Each layer is composable: you can plug in a different IP handler (IPv6 vs. IPv4) without changing the TCP handler, or a different Link handler (Ethernet vs. PPP) without changing the IP handler.

type ethernet_frame = {
  src_mac    : mac;
  dst_mac    : mac;
  ethertype  : int;
  payload    : bytes;
}

handler ethernet_link = {
  | Link.recv_frame (), k ->
      let pkt = perform Hardware.nic_recv () in
      match parse_ethernet pkt with
      | Ok frame ->
          if not_for_us frame.dst_mac then
            perform Link.recv_frame ()   (* try again — drop this frame *)
          else
            resume k frame
      | Error _ ->
          perform Link.recv_frame ()      (* drop malformed *)

  | Hardware.nic_packet pkt, k ->
      (* This shouldn't happen unless we're acting as a low-level
         NIC handler too. In the normal stack, the NIC handler
         is below us. Resume and continue. *)
      resume k ()
}

The Link handler performs Hardware.nic_recv to get the next raw packet (waiting if necessary), parses the Ethernet header, decides whether to keep or drop, and resumes the caller with the parsed frame.

A subtle detail: the Link handler is itself using perform to recursively call its own effect if it wants to skip a frame. This is fine — the recursion goes back through the handler installation, finds itself, and runs again. No special looping construct needed.

Even more subtle: Hardware.nic_recv is a synchronous request-response over the NoC. The Link handler doesn’t have its own poll loop; it asks the hardware, blocks, gets a packet. The blocking is implemented as we saw in chapter 12 — the continuation goes into a waiters queue for the NIC device, and the NIC fires an arrival event when a packet is ready. This is all the same mechanism, applied recursively.

The IP handler

type ip_datagram = {
  src_addr  : ip_addr;
  dst_addr  : ip_addr;
  protocol  : int;
  payload   : bytes;
}

type ip_state = {
  mutable reassembly : reassembly_table;
  mutable my_addr    : ip_addr;
}

handler ip_v4 (st : ip_state) = {
  | IP.recv_datagram (), k ->
      let frame = perform Link.recv_frame () in
      if frame.ethertype != ETH_IPV4 then
        perform IP.recv_datagram ()    (* not IP, try again *)
      else
        match parse_ip frame.payload with
        | Error _ -> perform IP.recv_datagram ()
        | Ok ip ->
            if ip.dst_addr != st.my_addr then
              perform IP.recv_datagram ()
            else if is_fragment ip then
              match add_fragment st.reassembly ip with
              | Complete datagram -> resume k datagram
              | Incomplete -> perform IP.recv_datagram ()
            else
              resume k ip

  | IP.send_datagram dg, k ->
      let frame = encapsulate_in_ethernet dg in
      perform Link.send_frame frame;
      resume k ()
}

The IP handler holds two pieces of state: a reassembly table for in-flight fragmented datagrams, and our IP address. When IP.recv_datagram is performed, the handler:

  1. Performs Link.recv_frame to get the next Ethernet frame.
  2. Drops anything that isn’t IPv4.
  3. Parses the IP header.
  4. Drops anything not addressed to us.
  5. Handles fragmentation: if the datagram is whole, deliver it; if fragmented, file the fragment and try again.

Notice the structure. The IP handler is exactly an IP packet processor, expressed in terms of the Link layer below and the abstraction it provides above. Nothing in the IP handler knows about TCP. Nothing in the IP handler knows about UDP. It just translates Link frames into IP datagrams.

A real implementation would also handle ICMP, IP options, TTL decrement on forwarding, etc. These are additions to this handler, not changes to the layers above or below.

The TCP handler

This is the interesting one because TCP has real state: per-connection state machines, retransmission buffers, sequence number tracking, etc. The handler is bigger.

type connection = {
  conn_id   : int;
  local     : socket_addr;
  remote    : socket_addr;
  mutable state : tcp_state;
  mutable recv_window : int;
  mutable recv_next   : seqnum;
  mutable recv_buffer : ordered_byte_buffer;
  mutable send_unacked : seqnum;
  mutable send_next    : seqnum;
  mutable send_buffer  : retransmit_buffer;
  mutable read_waiter  : (bytes continuation) option;
}

type tcp_state =
  | Closed
  | Listen
  | SynSent
  | SynReceived
  | Established
  | FinWait1
  | FinWait2
  | CloseWait
  | Closing
  | LastAck
  | TimeWait

type tcp_global = {
  mutable connections : connection table;   (* keyed by (local, remote) tuple *)
  mutable listeners   : listener table;     (* keyed by local port *)
}

handler tcp (g : tcp_global) = {
  | TCP.recv conn, k ->
      (* Try to deliver buffered data immediately if any *)
      let available = ordered_byte_buffer_take conn.recv_buffer in
      match available with
      | Some bytes -> resume k bytes
      | None ->
          (* No data buffered. Save k as the read_waiter; pump the network. *)
          conn.read_waiter <- Some k;
          pump_until_data conn

  | TCP.send conn bytes, k ->
      (* Add to retransmit buffer, send segment, return *)
      let seg = build_segment conn bytes in
      retransmit_buffer_add conn.send_buffer seg;
      perform IP.send_datagram (wrap_in_ip seg);
      conn.send_next <- conn.send_next + length bytes;
      resume k ()

  | TCP.open params, k ->
      let conn = make_new_connection params in
      g.connections <- table_add g.connections (params.src, params.dst) conn;
      conn.state <- SynSent;
      conn.send_next <- random_initial_seqnum ();
      let syn = build_syn conn in
      perform IP.send_datagram (wrap_in_ip syn);
      (* Wait for SYN-ACK; this is itself a perform on IP.recv_datagram
         in the pump loop *)
      pump_until_state conn Established;
      resume k conn

  | TCP.close conn, k -> ...
}

and pump_until_data conn =
  (* Receive IP datagrams, process them as TCP segments, until conn has data *)
  let rec loop () =
    let dg = perform IP.recv_datagram () in
    if dg.protocol == IP_PROTO_TCP then begin
      let seg = parse_tcp dg.payload in
      let target = lookup_conn g (dg.dst_addr, seg.dst_port) (dg.src_addr, seg.src_port) in
      handle_tcp_segment target seg;
      (* Check if our connection got data *)
      if has_data conn then
        let bytes = ordered_byte_buffer_take conn.recv_buffer in
        match conn.read_waiter with
        | Some waiter ->
            conn.read_waiter <- None;
            invoke waiter (unwrap bytes)
        | None -> loop ()
      else
        loop ()
    end else
      loop ()
  in loop ()

and handle_tcp_segment conn seg =
  match conn.state, seg with
  | Listen, { syn = true; ack = false } -> ...
  | SynSent, { syn = true; ack = true } -> ...
  | Established, { ack = true; payload } -> 
      if seg.seq == conn.recv_next then begin
        ordered_byte_buffer_add conn.recv_buffer seg.payload;
        conn.recv_next <- conn.recv_next + length seg.payload;
        send_ack conn
      end else
        send_dup_ack conn
  | ...

Long, yes — but look at its shape. The handler is doing what a TCP implementation does: managing per-connection state, processing incoming segments according to the protocol state machine, buffering ordered data, holding waiters who want to read.

Three things to notice.

First: the pump pattern

pump_until_data is the core idiom of a protocol layer. When a TCP.recv is performed and there’s no data buffered, the handler can’t just resume the caller — it has no data to give. It also can’t return failure — that would change the semantics. So it does what blocking implementations always do: it drives the network forward until data is available.

The way it does so is by performing IP.recv_datagram () in a loop, processing whatever datagrams arrive (which may or may not be for the connection that’s waiting), and updating per-connection state until the connection in question has data, at which point it invokes the waiter.

This is the pattern that, in a conventional TCP implementation, lives as “the receive softirq” or “the kernel thread that drains the NIC.” Here it’s just a function inside the handler. It’s structured. It’s blocking (it does perform IP.recv_datagram, which blocks at the Link layer, which blocks at the Hardware layer, all the way down to the NIC interrupt). It composes naturally with the kernel’s existing blocking mechanism.

Second: the waiter pattern

When a TCP.recv finds no data, it stores the caller’s continuation in conn.read_waiter and starts pumping. When the pump eventually fills the buffer, it pulls the waiter out and invokes it with the data.

This is exactly the same pattern as the kernel’s IO.read_byte waiter — store the continuation in a wait slot, resume it from elsewhere. The difference is that the wait slot lives inside the TCP connection state, not in a global scheduler structure. Per-resource wait slots is the natural extension of the kernel-wide wait queue.

The waiter is linear (continuations always are). It’s stored exactly once, retrieved exactly once, and invoked exactly once. The connection cannot accidentally have two readers blocked simultaneously — the type would forbid it (the field is option, and assigning a new value when one was already there is a programming error that linear types help surface).

Third: each layer is a handler, and handlers compose

The full receive path:

NIC hardware
   ↓ fires Hardware.nic_packet
NIC driver handler
   ↓ delivers via Link.recv_frame
IP handler
   ↓ delivers via IP.recv_datagram  
TCP handler
   ↓ delivers via TCP.recv
user process

Each ↓ is a resume k after the layer below has performed its effect and delivered a value. Each layer is a handler installed in scope below the user process. The user process does perform TCP.recv conn; control transfers to the TCP handler; the TCP handler does perform IP.recv_datagram; control transfers to the IP handler; and so on. The stack of handlers is the protocol stack.

In a conventional implementation, this layering exists as a structural convention enforced by code review — each layer is a separate file, with its own functions, and the calls between layers are conventional function calls. Here, the layering is enforced by the effect type system: the IP handler can perform Link.recv_frame because the Link handler is in scope below it; if you forgot to install the Link handler, the program wouldn’t typecheck.

This is more than a stylistic choice. It means:

Testing. You can stub out the Link layer by installing a test handler that returns pre-recorded frames. The IP and TCP handlers above will run unmodified, processing the test frames. No mocking framework required — the test stub is just another handler.

Interposition. Want to log every IP datagram for debugging? Install a logging handler between IP and the layer above; pass through everything but log first. Nothing else changes.

Composability across deployments. The single-core kernel and the multicore kernel can both use the same TCP handler — the underlying queue and waiter mechanisms differ, but the protocol logic is invariant.

Sending

I focused on receive; sending is structurally the same, going the opposite way. TCP.send conn bytes performs IP.send_datagram (wrap_in_ip seg), which performs Link.send_frame (encapsulate seg), which performs Hardware.nic_send pkt. Each layer adds its header on the way down. The send path is shorter than receive in this design because no demultiplexing is needed — you know exactly where each packet goes.

What’s hard, made explicit

The structural cleanness should not obscure the work TCP demands. The hard parts of a real TCP implementation are not in the layering; they’re inside each handler:

TCP state machine. The eleven-state machine of TCP (Closed → Listen → SynSent → … → Closed) has subtle rules about which transitions are legal under which segment patterns. A real implementation has to handle simultaneous opens, abort sequences, half-close, and the famous time-wait state with its 2*MSL timer. This logic lives inside the TCP handler’s handle_tcp_segment function.

Retransmission. Tracking which bytes have been acknowledged and which need to be resent if no ack arrives in time. The retransmit buffer, the RTO timer, the exponential backoff on consecutive losses.

Congestion control. Slow start, congestion avoidance, fast retransmit, fast recovery, possibly newer algorithms like BBR or CUBIC. All operating on per-connection state.

Flow control. Advertising a receive window, respecting the peer’s window, the silly window syndrome, the Nagle algorithm.

Out-of-order reassembly. When segments arrive out of order, buffering them and delivering bytes in sequence.

None of this goes away. None of it is made easier by being inside a handler. But none of it is made harder, either, and the parts that are made easier — the layering, the testing, the composition — are pure wins.

The bet of this chapter is that the things made easier are the things conventional implementations spend disproportionate effort on. The conventional TCP stack’s complexity has some essential complexity (state machine, retransmission) and some accidental complexity (layering bookkeeping, threading discipline, context-switch management between user-space and kernel-space). The accidental complexity is what the effect-tower structure eliminates.

What this chapter committed to

A TCP receive path as a tower of effect handlers. Each layer translating from the abstraction below to the abstraction above. The pump pattern for driving the network forward when data is needed. Per-resource waiter slots for blocking semantics. The acknowledgment that essential protocol complexity remains, but accidental layering complexity goes away.

The next chapter is the one where the move is most explicitly SICP-shaped: building a database without ever using the word.