Coroutines, or: discovering you’ve built the runtime

This is the chapter where the design’s recursion shows. We’ve built a kernel out of continuations and handlers. We’ve built a TCP stack out of continuations and handlers. We’ve built a database out of continuations and handlers. In each case, the kernel/stack/database is not a special-purpose runtime — it’s a use of the language’s general-purpose primitives, arranged into the right shape.

Now we’re going to build something whose conventional implementation is a special-purpose runtime: coroutines. Co-routines, generators, async/await, fibers — call them what you want, they’re “stop running this thing here, resume it later, possibly returning a value.” Every modern language has a runtime to support this, written in C or Rust or assembly, hooked into the language as a special feature.

We’re going to write one in CML, in about a hundred lines, using effects. And we’re going to discover that what we’ve written is exactly the same shape as what the kernel does. They’re the same construction. The kernel is a particular use of the coroutine machinery. The coroutine machinery is a generalization of what the kernel needed.

What a coroutine is

A coroutine is a function that can pause itself, return a value to whoever invoked it, and be resumed later from where it paused. Compared to an ordinary function: an ordinary function runs to completion before returning; a coroutine can return many times, with state preserved between returns.

The canonical use case is generators:

let fib_gen () =
  let a = ref 0 in
  let b = ref 1 in
  loop {
    yield !a;
    let next = !a + !b in
    a := !b;
    b := next
  }

Each yield returns a value to the consumer and pauses the generator. The next time the consumer asks for a value, the generator resumes at the line after the yield, with its local state (a, b) intact.

In Python this is a yield statement and the runtime manages the magic. In JavaScript this is a generator function with function* syntax and yield keyword, with the runtime managing the magic. In C++ it’s coroutines (TS 1, C++20), with the compiler doing CPS transform and managing state allocation. The runtime is, in every case, a non-trivial piece of language-implementation engineering.

Here, we’re going to write it as a handler.

The naive version

The simplest possible coroutine is just an effect:

effect Coro a {
  yield : a -> ()
}

A generator is a function that performs Coro.yield operations. A consumer installs a handler that captures each yield, stores the continuation, and returns the value:

type 'a generator =
  | Done
  | Yielded of 'a * (unit -> 'a generator)

handler run_coro (body : () -> ()) : 'a generator =
  match body () with
  | () -> Done
  with
  | Coro.yield v, k ->
      Yielded (v, fun () -> resume_handler run_coro k)

run_coro installs a handler for Coro.yield and runs the body. If the body completes, return Done. If the body yields a value, return Yielded (v, k_resume) where v is the yielded value and k_resume is a thunk that, when called, will run the body further until the next yield or completion.

The consumer pattern-matches on the generator:

let rec consume gen =
  match gen with
  | Done -> ()
  | Yielded (v, k_next) ->
      print_int v;
      consume (k_next ())

This is a cooperative coroutine: the generator and consumer alternate. The generator yields; the handler returns a value and a “how to continue”; the consumer extracts the value, decides whether to ask for more, and if so calls the continuation to resume the generator.

This is exactly the shape that Python generators have. The runtime implementation is different (Python uses frame objects, stack manipulation, and a C-level generator type) but the semantics are identical.

Generalizing: a coroutine system

Generators are useful, but the more interesting move is to support many coroutines that can be scheduled however the user wants. This is where it starts to resemble the kernel.

effect Coro {
  spawn : (() -> ()) -> coro_id
  yield : () -> ()
  exit  : () -> empty
  await : coro_id -> ()
}

type coro = {
  id : int;
  k  : unit continuation;
}

type coro_state = {
  mutable ready    : coro queue;
  mutable waiters  : (coro_id, coro list) table;   (* waiting on completion *)
  mutable next_id  : int;
}

handler coroutines (s : coro_state) = {
  | Coro.spawn body, k ->
      let id = s.next_id in
      s.next_id <- id + 1;
      let cb = fun () -> body (); perform Coro.exit () in
      let cont = reify cb in
      Queue.push s.ready { id; k = cont };
      resume k id

  | Coro.yield (), k ->
      Queue.push s.ready { id = currently_running (); k };
      schedule_next s

  | Coro.exit (), _k ->
      let me = currently_running () in
      wake_waiters_for me s;
      schedule_next s

  | Coro.await target, k ->
      let me = currently_running () in
      add_waiter s target { id = me; k };
      schedule_next s

  | return v -> v
}

and schedule_next s =
  match Queue.pop s.ready with
  | None -> ()           (* all coroutines done *)
  | Some c ->
      set_currently_running c.id;
      invoke c.k ()

Notice anything? This is the kernel. The structure is identical:

A coro_state record with a ready queue and a waiters table → just like the kernel’s sched_state
A spawn effect that creates new units of execution → just like ProcLifecycle.spawn
A yield effect that puts the current unit at the back of the queue → just like ProcLifecycle.yield
An exit effect that drops the current unit and runs the next → just like ProcLifecycle.exit
An await effect that blocks the current unit until something happens → just like IO.read_byte waiting for input
A scheduler loop that pops from ready and invokes → just like the kernel’s scheduler

The “coroutines runtime” we just wrote is the kernel. The only differences are that coroutines run in user-space (no preemption, no devices, no NoC) and that the names are different (Coro.spawn vs. ProcLifecycle.spawn).

You could, in fact, implement this coroutines runtime as a use of the kernel. A coroutine is just a process. Coro.yield is ProcLifecycle.yield. The handler is just an alias. We didn’t build a new runtime; we re-discovered the one we already had.

What about actors?

You can play the same game with actors. An actor system is “many independent units of execution, each with private state, communicating by message-passing.” Let’s build one as a handler:

effect Actor a {
  spawn  : (mailbox a -> ()) -> actor_id
  send   : actor_id -> a -> ()
  recv   : () -> a
  exit   : () -> empty
}

type 'a actor = {
  id : int;
  mailbox : 'a queue;
  k : unit continuation;
}

handler actors (s : actor_state) = {
  | Actor.spawn body, k ->
      let id = s.next_id in
      s.next_id <- id + 1;
      let mb = Queue.empty () in
      let new_a = { id; mailbox = mb; k = reify (fun () -> body mb) } in
      Hash.insert s.actors id new_a;
      Queue.push s.ready new_a;
      resume k id

  | Actor.send target msg, k ->
      let a = Hash.find s.actors target in
      Queue.push a.mailbox msg;
      (match Hash.pop s.recv_waiters target with
       | Some waiter_k ->
           let m = Queue.pop a.mailbox in    (* immediately delivered *)
           Queue.push s.ready { /* current */ };
           invoke waiter_k (unwrap m)
       | None ->
           resume k ())

  | Actor.recv (), k ->
      let me = currently_running_actor () in
      (match Queue.pop me.mailbox with
       | Some msg -> resume k msg
       | None ->
           Hash.insert s.recv_waiters me.id k;
           schedule_next s)

  | Actor.exit (), _k ->
      cleanup_actor (currently_running_actor ());
      schedule_next s
}

Once again: this looks exactly like the kernel with the addition of per-unit mailboxes. The mailbox is what makes it an actor system rather than a generic coroutine system — units communicate via messages, with sends and receives mediated by mailbox queues.

But again, structurally: there’s a ready queue, a per-unit state record (now including a mailbox), a spawn/yield/exit interface, and the same handler shape. Erlang spent considerable engineering effort on its actor runtime (the BEAM VM with its lightweight processes), and we’ve sketched its essentials in about 60 lines.

The realization

You haven’t built a kernel and then separately built a coroutines runtime and separately built an actor system. You’ve built one general-purpose mechanism — handlers acting on continuations — and applied it three times.

This is the deepest claim of the design: the same mechanism, applied recursively, gives you arbitrary stateful runtimes for free. A scheduler is a special handler. An actor system is a special handler with mailboxes. A coroutine generator is a special handler that returns control to a consumer. The distinction between “kernel” and “user-space runtime” dissolves; both are configurations of the same primitive.

Compare this to how things work in Linux. Linux has a kernel scheduler (CFS, EEVDF, or whatever) implemented in C. On top of that, libraries like libuv or asyncio provide user-space coroutine runtimes, implemented in C or Python, using epoll/io_uring/kqueue to multiplex I/O. On top of that, application code uses the coroutines. Three distinct runtimes, written by three different teams, in three different languages, with different invariants, different debugging tools, different performance characteristics.

In our system, there’s one runtime. The kernel handler stack and the user-space handler stack are written in the same language with the same primitives. A debugger can see across the layers. The performance characteristics are uniform. The invariants compose.

What this buys, in practice

A few practical consequences:

Custom schedulers are cheap. Want round-robin? Use a FIFO queue. Want priority-based? Use a heap. Want fair-share? Track CPU time per process. The scheduler is one handler; swap it out at the deployment site without touching anything else. In Linux you write a kernel module; here you write a handler.

Custom concurrency models are cheap. Actors, CSP-style channels, futures, software transactional memory — all of them are handler patterns. None require new language features. None require runtime support beyond the universal primitive.

Debugger and profiler hooks are cheap. A profiler is “a handler that counts effects, then forwards.” A debugger trace is “a handler that logs effects, then forwards.” Adding observability is adding a handler.

Testing is cheap. A test harness installs handlers that record/replay/stub. The system under test is the same code as in production; only the handlers differ.

What this chapter committed to

A coroutines runtime in CML, about 60 lines. The realization that it’s structurally identical to the kernel. A sketch of an actor system, same shape. The argument that effect handlers + continuations give you, by composition, the runtimes that conventional languages bake in as special features.

The next chapter is the honest assessment: what this design is for, and what it isn’t.

Keyboard shortcuts

Concrete Machines