Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The send side, and what PERF really does

The previous chapter traced an event into the language. This chapter traces a request out — what happens when user code does perform UartTx 'A' to send a byte over the UART. The point is to make precise what perform actually does at the hardware level, and to surface a tension that the design has to resolve: PERF for local effects and PERF for remote effects are similar but not identical.

The asymmetry that’s been hiding

We’ve been treating PERF as a single mechanism. But it has two distinct things it might do, and the hardware has to decide which:

  1. Local handler search. Look in the handler CAM, find a handler installed in the current frame tower, transfer control to it. This is what every perform in the cooperative kernel does so far.

  2. Remote dispatch. Send a NoC packet to another device, which will deliver it as a PERF on that device. This is what we want to happen when the kernel sends a byte to UART.

How does the hardware know which? Several answers, with very different consequences.

Design space

Option A: Always search locally first, fall through to NoC on miss.

PERF executes. CAM is searched. If a handler is found, transfer control locally. If no handler is found, consult the effect table. If the effect table entry says “remote, dst=X,” emit a NoC packet to X.

The cleanest unification — one PERF instruction, dispatch decided by configuration. But it has a subtle cost: every PERF destined for a remote device pays the local search cost first. For a UART send happening many times per second, that’s a lot of fruitless CAM lookups.

It also has a confusing corner case: what if a user installs a handler for an effect that’s supposed to go to a remote device? The local handler intercepts, and the remote device never hears about it. This might be a feature (interpose on device traffic for debugging) or a bug (you accidentally captured a write you meant to send to UART). Powerful and dangerous.

Option B: The effect table entry encodes “local” vs. “remote” explicitly.

Each entry in the effect table is one of:

  • Local(handler_pc) — fall back to a handler at this PC if not in CAM
  • Remote(device_addr, op_encoding) — emit a NoC packet, don’t check CAM at all

PERF consults the effect table first to determine the mode. If Local, do CAM search. If Remote, build packet and emit.

More efficient (no wasted CAM lookups for remote effects) and more explicit. The cost is that dispatch is no longer uniform — the language has to know whether an effect is local or remote, which leaks the implementation into the type system unless we hide it well.

Option C: Separate instructions for local PERF vs. remote send.

PERF is always local. SEND is always remote. The compiler decides which to emit based on the effect’s binding at compile time.

This is what the very early ISA drafts had, before we unified. It’s the most efficient but it splits the abstraction. The language’s perform is no longer one operation; it’s two, and the compiler bridges them.

Option D: The CAM also holds “remote redirect” entries.

CAM entries can be either LocalHandler(pc) or RemoteRedirect(device_addr, op). PERF searches the CAM as before; a remote-redirect entry emits a NoC packet instead of transferring control locally.

The choice

For this design, Option B with a refinement from D. The effect table is the canonical mapping from (family, op) to destination — local handler or remote device. The CAM holds overrides installed by INST, which are necessarily local (you INST a handler at a PC, not at a device address). When PERF executes:

  1. Consult the CAM. If a local override is found, transfer control.
  2. Otherwise, consult the effect table. If Local(pc), transfer control. If Remote(dst, op), build a NoC packet and emit. If unhandled, trap.

This gives us:

  • One instruction (PERF) at the language level
  • Efficient remote dispatch (no wasted CAM search for remote effects, since the CAM only holds local handlers by construction)
  • Override capability (a kernel debugger could INST a handler for an effect that’s normally remote, capturing all device traffic)
  • Clean compile target (the compiler emits PERF eid8, va for every effect; the runtime/effect-table-setup decides the rest)

UART send, end to end

Let’s trace a perform UartTx 'A'. The UART transmitter is device 0x10. At boot, the kernel populated the effect table:

effect_table[(family=2, op=0)] = Remote(device=0x10, op=0x01)

— that is, family 2 op 0 is “UART transmit byte,” routed to device 0x10’s op 0x01. The kernel exposes this to user code via an effect UartTx : byte -> unit declaration.

User code:

perform UartTx 'A'

Compiled to:

LI    v0, 0x41           ; 'A'
PERF  family=2, op=0, v0

Stage 1: PERF executes

The core’s pipeline reaches the PERF instruction. The decoder identifies it; the operand is the 8-bit effect ID (family=2, op=0), and v0 provides the payload.

  1. CAM search. Looking for (2, 0). No local override; miss.
  2. Effect table lookup. Read effect_table[(2,0)] from memory. Entry says Remote(0x10, 0x01).
  3. Decision: remote dispatch. Skip continuation-capture (we’re not doing local handler search; there’s no handler frame to transfer to on this core). Instead, build a NoC packet.

This is a meaningful divergence in PERF semantics depending on the dispatch mode, and it deserves explicit acknowledgment. For remote PERF, no continuation is captured on the local core. The PERF site continues execution after the packet is sent. The receiving device handles the message however it does; if it needs to reply, it sends its own packet.

This is exactly the asymmetry between “synchronous local effect” and “fire-and-forget remote send.” A UART transmit is fire-and-forget: send the byte and continue. You don’t block waiting for the byte to be on the wire.

Stage 2: Packet construction

The PERF unit assembles a NoC packet:

FieldValue
dst_id0x10 (UART)
src_id0x00 (this core)
op0x01 (from effect table)
flags0x0
payload0x00000041 (the byte, zero-extended)

The packet is constructed in a NoC output staging register inside the PERF execution unit. This is a multi-cycle operation, but that’s fine — multi-cycle is the norm for anything involving the NoC.

Stage 3: Handoff to the NoC interface

The packet is handed to the core’s NoC output port, which feeds into the mesh router. The PERF unit’s job is done; the rest is asynchronous.

Stage 4: Backpressure semantics

What does PERF do if the NoC output FIFO is full?

  1. Stall. PERF blocks the pipeline until the FIFO has room. Simple. Means PERF can take unbounded cycles. Preemption can still happen (since stall is a preemptable wait state).
  2. Trap. PERF raises an effect indicating send failure. Handler can retry or report.
  3. Drop. PERF silently drops the packet. Bad. Don’t.

(1) is right. Stall is the default and composes with preemption naturally — if the stall takes too long, the timer eventually fires and the scheduler kicks in. The currently-running process pays the cost, but only of the stall time, and only fairly.

Stage 5: Routing

The packet enters the mesh, follows the routing algorithm, exits at the UART’s NoC input port. A handful of cycles.

Stage 6: UART receives and acts

The UART’s NoC interface receives the packet, decodes it, notes op=0x01 and payload’s low byte = 0x41. It hands the byte to its internal transmit FIFO. The serial-out logic shifts the byte out the TX pin.

The byte travels down the wire, and the SoC’s involvement is complete.

The interesting semantic question: synchronous request-response

I glossed past something. For local PERF, the continuation is captured. For remote fire-and-forget PERF, it isn’t. But what about a remote PERF where the caller wants a response?

Consider effect DiskRead : sector -> bytes. The caller wants to send a read request and then receive the data. The continuation after perform (DiskRead 42) expects to be resumed with the bytes.

Two real choices:

B1: Pipeline blocks waiting for a response. PERF emits the request packet, then blocks the pipeline until a response packet arrives, at which point it delivers the payload to the PERF site and continues. Simple semantics, but the core is blocked for the whole disk operation, which is forever in CPU terms. Bad.

B2: Capture the continuation; resume it when the response arrives. PERF captures the continuation (like a local PERF) and files it somewhere keyed by the request ID. The PERF site doesn’t continue; control returns to the scheduler. When the response packet arrives, it’s matched against the filed continuation and resumed with the response.

(B2) is what you want. Notice: this means the kernel’s ReadByte logic and a hypothetical disk read are doing the same thing at different levels. The kernel-level ReadByte files the continuation in waiters and runs a different process. A hardware-level synchronous PERF would similarly file the continuation, awaiting a matching response.

The question is whether the hardware does this automatically (capture + file + resume on response) or whether it leaves it to software. Two options:

Option α: The hardware doesn’t know about request-response. DiskRead compiles to a fire-and-forget PERF that sends a request packet with a continuation handle as part of the payload. The kernel’s response handler unpacks the handle from the response and resumes it. Continuation handling is entirely software.

Option β: The hardware has a synchronous PERF mode. Captures the continuation, files it in a hardware table indexed by request ID, emits the packet, yields. When a response packet with that request ID arrives, the hardware automatically resumes the continuation.

(α) is simpler and more flexible; (β) is faster but commits the hardware to a specific request-response pattern.

For the teaching ISA, (α) is the right answer. It keeps the ISA minimal and pushes request-response into kernel code where it can be inspected and modified. The kernel writes a small “remote-effect helper” library that wraps fire-and-forget PERFs with continuation tracking. Higher-level effects (DiskRead) compile to calls into this library.

This means at the ISA level: PERF for remote effects is always fire-and-forget. Synchronous request-response is built in software on top of the primitive.

PERF semantics, summarized

The PERF execution unit, in pseudocode:

on PERF (family, op), v_payload:
  if CAM hits on (family, op):
    capture continuation
    transfer control to local handler
  elif effect_table[(family, op)] == Local(pc):
    capture continuation
    transfer control to pc
  elif effect_table[(family, op)] == Remote(dst, dev_op):
    build packet { dst, src=this_core, op=dev_op, payload=v_payload }
    send packet (may stall on backpressure)
    continue at next instruction (no continuation capture)
  else:
    trap (unhandled effect)

This is the unified semantics. The dispatch decision is made at the effect-table lookup. The CAM is for fast local overrides; the effect table is for default routing; the trap is for unhandled cases.

The compiler doesn’t need to know which mode a PERF will be in. It always emits PERF eid8, v_payload. The kernel, by populating the effect table at boot, decides which effects are local and which are remote. User code is uniform.

What this chapter committed to

The send side of PERF. The decision to unify local and remote dispatch under one instruction with effect-table-driven routing. Fire-and-forget semantics for remote PERF, with synchronous request-response built in software on top. Backpressure as a pipeline stall.

Part II is complete. We have an ISA whose primitives — type-checked arithmetic, hardware-aware pattern matching, perform-as-effect, continuation-as-value, NoC-as-effect-routing — fit the language we want to compile. Part III applies all of this to a kernel.