Why we built our own harness

When we started building what would eventually become cos2, the fastest path looked obvious. Wrap Codex. Wrap Claude Code. Ship a thin product layer on top of someone else’s agent loop and call it a harness. Every other team in this space was doing some version of that.

We chose the harder path. We built our own runtime. We picked a language — Go — that almost no one else in AI tooling was using. We wrote the tool execution layer, the streaming loop, the checkpoint system, the sandbox bridge, the LSP integration, the MCP integration, and the multi-agent orchestrator ourselves. We built a gvisor-sandboxed Kubernetes substrate to run it on. And then we compiled the same harness down to WebAssembly and dropped it inside our VSCode extension as a first-class citizen.

This post is about why we made that call, what it costs, and what it bought us.

The founding principle: own the stack you depend on

Cosine’s founding principle is that a product company shipping at the frontier of AI has to control its own runtime. Not because we’re opposed to open source — we ship plenty of it — but because the systems that matter most to our users live at the seams between layers. The agent loop. The sandbox boundary. The tool execution contract. The way files stream into context. These seams are where correctness, speed, and safety are actually decided.

If those seams live in someone else’s binary, you’re a customer of their roadmap.

stack map · what we control

Own the runtime, not every layer.

surfaces → runtime → models → substrate

Surfaces + runtime

Products and agent loop we ship

owned

Cloud

Desktop

CLI

Go harness

Tool layer

ChonkyLLM

Model lane

All reached through ChonkyLLM

routed

Cosine models

Open-source models

Frontier APIs

Substrate

Where remote sessions run

mixed

Kubernetes

gVisor runsc

Cloud CPU/GPU

The product-critical stack is ours where behaviour is decided. Cloud, Desktop and CLI are separate surfaces, but they run against the same harness and tool layer. ChonkyLLM sits on the model boundary: our own Genie-family models, post-trained open-source models, and frontier APIs all come through the same normalised routing contract. Kubernetes and gVisor are open-source substrate; the harness, tool layer, Keef runtime image, and ChonkyLLM routing path are the parts we need to patch and ship ourselves. When a user hits a tool timeout in a Keef pod and needs us to ship a fix, that fix travels through code we wrote, compiled by a pipeline we run, into a binary we signed, scheduled onto Kubernetes, and called by a harness we maintain. No ticket queues. No blocked-on-upstream. No “we’ll get to it next release.”

Why Go — the story starts with filesystems

Our first product was a VSCode extension. We shipped it in March 2023 as part of YC W23, and we wrote it in TypeScript because that’s what VSCode extensions are written in. It was the right call at the time, and the extension taught us a lot of things — but the most important lesson was a quiet one: the filesystem was always the bottleneck.

Every interesting agent operation involves the filesystem. Walking the repo to build context. Reading hundreds of candidate files before narrowing down to one. Streaming chunks into a tool result that the LLM can actually reason about. Watching mtimes to invalidate caches. We kept hitting the same wall, over and over: fs.promises plus worker threads was not going to scale to what we wanted agents to do.

The event loop stalled on large sync reads. V8 had to JIT-warmup before throughput stabilized. Every file produced a fresh JavaScript object, which meant GC pressure that grew with repo size. On repos bigger than a few thousand files, cold-start walk times crossed the line where the user could feel them.

We learned Go the hard way — by replacing our fast path, then our slow path, then the whole ingestion pipeline. The difference wasn’t a percentage. It was a category change.

cold-start walk · ~45,000 files

Walking a real repo — Node (our original stack) vs Go (today)

fs.promises + worker_threads

TypeScript on Node

0/ 45,000

event loop stalls on large sync reads
V8 JIT warmup tax
per-file JS object churn → GC pressure

goroutines + bufio, streamed to LLM

Go harness

0/ 45,000

syscalls scheduled onto OS threads
zero-copy streaming into tool results
static binary · no runtime warmup

node throughput

4,200 / s

go throughput

38,000 / s

go advantage

9.0×

Those numbers are illustrative of what we measured walking a medium monorepo, not a benchmark you should quote back at us, but they represent something real: a statically-linked Go binary walking files with goroutines, streaming straight into tool results, comfortably does an order of magnitude more than our old Node pipeline could. The architectural reason is simple. Go’s runtime schedules blocking syscalls onto OS threads automatically. Node doesn’t. Go has zero warmup. Node doesn’t. Go’s GC pauses are measured in microseconds at our allocation rates. Node’s aren’t.

You can watch other agents discovering this the hard way, in public, in real time. Claude Code is the most visible example. It’s a Node-based harness, and its grep tool is backed by the @vscode/ripgrep npm package — a Node wrapper around a native binary. That wrapper has measurable overhead on large repos, to the point where the community’s standard advice is to set USE_BUILTIN_RIPGREP=0 and route through a locally-installed system rg for a 5–10× speedup. Anthropic’s own /doctor diagnostic admits three separate ripgrep modes — system, builtin, and embedded — depending on how the user installed the tool, because the Node path is slow enough that the native-installation path has to compile ripgrep directly into the Single Executable Application binary to claw the speed back. What you’re watching is a Node harness slowly, painfully, one subsystem at a time, turn itself into a native harness. We made that decision on day one.

When we rebuilt our own search stack a year later, we hit the same performance wall and solved it the way the underlying physics wanted us to. We wrote about that journey in 1.16: Go for the product integration, Rust via FFI for the performance-critical keyword search (we call it ggrep), OS pipes for streaming, static linking for single-binary distribution. No bundled-vs-system-vs-embedded matrix for the user to reason about. One binary, one code path, ripgrep-class performance on every install. Go sat at the centre because it was the language that could hold every part of the stack together.

The harness itself — a turn-based execution engine

The agent loop is the heart of everything. It’s a turn-based execution engine that implements the full OpenAI Responses API with streaming, tool parallelism, and a compaction layer that keeps the working window inside the model’s budget.

agent.RunStreaming

·turn 1·254 tok

context window · live

1 messages · 254 tok

userfix flaky auth middleware test

phase 01 · compact

Compact

Prune and summarise prior turns if the window is close to the model budget.

in Go

if window.tokens > budget*0.8 { window = compact(window) }

What matters isn’t the diagram — every harness has one of these — it’s the properties we get out the other side because the harness is ours:

Deterministic replay. Every tool call, every streaming event, every budget decision is logged with enough structure to replay the whole session. When a user hits a weird result, we don’t ask them to reproduce. We replay.

Checkpoints that are real. Every rollout boundary creates a git commit on a scratch branch. Rewinding isn’t a simulation. It’s git reset. The state the user reverts to is the state their working tree was in.

Honest tool parallelism. Tool calls dispatch concurrently with per-tool rate limits and timeouts. We don’t serialize because the model didn’t tell us to. If the model asks for seven reads, we do seven reads.

EventBus as first-class infrastructure. Observers subscribe to every meaningful event — memory writes, tip surfacing, checkpoint creation, subagent lifecycle. Downstream integrations, the TUI, the desktop overlays, the VSCode extension all plug into the same bus.

Here is what a single session looks like when you watch the wire:

cos — ~/repo/apps/api

❯

Go as the “everything” language

The decision to keep building in Go paid off in a way we didn’t fully anticipate. It wasn’t that Go made the CLI faster — though it did. It was that the same Go source tree started serving as the runtime for every surface we ship.

one Go codebase · five deployment targets

surface · cli

cos2 CLI

darwin · linux · windows · arm64 · x86_64

One statically-linked binary, no runtime, no install-time npm resolution. Ships via brew, winget, and curl | bash. Starts in under 100 ms on every machine we have benchmarked.

$ file $(which cos)
  cos: Mach-O 64-bit executable arm64

$ ldd $(which cos) 2>/dev/null || otool -L $(which cos)
  /usr/lib/libSystem.B.dylib
  /usr/lib/libc++.1.dylib

binary

single

runtime deps

none

compile

< 4 s

Read that component left to right. The same Go code, cross-compiled five different ways, becomes five different products:

The CLI is the obvious one. One statically-linked binary per OS/arch pair, installed by brew, winget, or a curl pipe. No npm resolution, no runtime environment assumptions, no “it works on my node.”

The VSCode extension is where it gets interesting. GOOS=js GOARCH=wasm go build takes the same harness package and produces a .wasm module. The extension loads it, binds a filesystem and network bridge in JavaScript, and calls into the agent loop directly. There is no sidecar CLI subprocess. There is no JSON-RPC shim talking to a local server. The agent loop is the extension. Five years ago this would have been exotic; today it’s the only sensible way to put a serious runtime inside a web-ish host. It also means that when we ship a fix to the CLI, the same fix flows through the same test suite into the WASM build, and the VSCode extension picks it up on the next release.

We learned what not to do here from our original VSCode extension. That extension talked to a language-server-style backend over IPC, and the IPC boundary became the bottleneck — for correctness, for latency, for every new feature we wanted to ship. The WASM-embedded approach removes the boundary entirely. Tool calls that used to round-trip through a subprocess now call a Go function.

The API server imports the harness as a Go package. When a service needs to spawn a session on behalf of a remote user, it doesn’t shell out to cos. It constructs a harness.Session in-process. No serialization tax, no container boundary.

The Keef pod is a Kubernetes pod whose entire image is FROM scratch plus our single binary. That matters. No base Linux userland. No Python runtime. No Node.js. The attack surface inside the sandbox is effectively our own code plus gvisor’s userspace kernel. We’ve gotten pod images down to about 24 MB. Cold starts live comfortably under 500 ms because there’s nothing to warm up.

The desktop app runs on wails v3 alpha. The Go harness is the main process; the UI is a native webview. We wrote a custom browser-view service that lets the agent drive a real Chromium embedded in the same process as the runtime — the agent can take screenshots, click elements, fill forms, and feed all of that straight back into the harness without any cross-process boundary. The whole app is the harness plus a view layer.

This is the thing we keep trying to explain and that doesn’t quite land until people see it running: the harness is not a CLI that has been ported to other places. It is the runtime, and the CLI, the extension, the desktop app, the API server, and the sandbox are all different shapes we compile it into. One source tree, one test suite, one release pipeline.

Why Go specifically, not Rust

The obvious question is why Go rather than Rust. We like Rust. We use it for the parts of our stack where we need to be ruthless about latency — the ggrep engine is Rust, and it’s faster than we would have written in Go.

But Go has a specific property that Rust doesn’t: it stays out of your way while you’re figuring out what you’re building. The compile loop is measured in single-digit seconds. The concurrency story is goroutines, not async colour-of-function gymnastics. The standard library is complete enough that most of our harness doesn’t pull a dependency at all. And we ship a single binary at the end. There is no “which runtime does this user have installed” question. There never has been, for us.

Picking Go was a bet on iteration speed without giving up the ability to compile down to native binaries on every platform we care about — including arm64, x86_64, and wasm32. That bet has aged well.

gvisor-sandboxed Kubernetes: Keef

The harness is one half of the story. The other half is where we run it.

When a user hits the limit of what their laptop can do — five parallel agents, a long background task, a run they want to leave going while they sleep — they promote the session to remote. That promotion flips a bit in the TUI and suddenly the same harness is running in a Keef pod on our infrastructure, streaming events back over a WebSocket.

request path · escape velocity

layer · tui

CLI2 TUI

local process · Bubble Tea v2

The interface that never leaves your machine. Streams events in, sends your keystrokes out. Works identical whether the work is running locally or on a Keef pod.

Keyboard-first UI
Ctrl+T terminal overlay
Checkpoint timeline

Cold start

< 500 ms

Cost / hr

~ $0.0001

Isolation

gvisor runsc

vs Devin VMs

≈ 50% cheaper

The physical stack is four layers deep: TUI, API gateway, Kubernetes, Keef pod. Each boundary has a deliberate protocol. WebSocket for the client because it works through every corporate proxy and firewall we’ve encountered. Kubernetes handles open-source pod orchestration and lifecycle. Length-delimited protobuf over TCP for the pod boundary because it’s cheap, simple, and perfectly adequate for the traffic shape we actually have.

Inside the pod, gvisor’s runsc runtime gives us a userspace kernel per sandbox. That’s the property that matters. A host kernel bug doesn’t automatically become a cross-tenant escape. Combined with per-pod network policies (no cross-tenant egress, no reach into our control plane) and ephemeral storage, we can confidently hand an LLM a shell.

The architectural payoff is on the cost side. Keef pods are micro VMs by construction — they boot from the scratch image in under a second and are torn down when the session ends. Competitors who rent large, always-warm VMs for the same workload spend a lot of money keeping idle cycles ready for users who aren’t there.

illustrative · monthly cloud compute

Micro VMs vs large VMs at scale

Drag the sliders · numbers update live

Parallel agents12 agents

Active hours / day4 h

Days / month22 d

Large-VM competitoralways-warm, provisioned per session

$444

Cosine · Keef micro VMsboot on demand, gvisor-isolated

$190

monthly saving

$253

57% less than the large-VM baseline.

cold start

< 500 ms

runtime

gvisor micro VM

isolation

per-session

scaling

horizontal / agent

The sliders are for your own mental math, not a price list. The architectural point is what matters: boot-on-demand sandboxes per session are fundamentally cheaper than provisioned VMs, and that cost advantage compounds as a user spins up more agents in parallel.

Parallelism is a primitive, not an afterthought

The other thing owning the harness buys us is multi-agent orchestration that actually works.

Swarm mode is a first-class operating mode of the harness. When a user or an orchestrator spawns subagents, the harness tracks them in a live topology, budgets their spawns, enforces file locks to prevent conflicting edits, and emits events every time the tree changes shape.

SwarmEngine · live topology

engine state

spawning

running

done

// live wire events

orchestrator

subplanner

explore

general

web

Every pulsing node is a goroutine running a copy of the same harness code — sometimes in-process, sometimes in a sibling Keef pod depending on the mode. The EventBus is how the TUI renders the live tree you see. The file lock registry is how two subagents trying to edit auth.ts at the same time negotiate rather than race. The budget manager is how we stop a runaway subplanner from spawning its way through your token budget.

You cannot bolt this onto a harness you don’t own. The concurrency model has to run through the loop from the beginning.

The ML flywheel: the more we own, the more we can extract

Everything up to this point is an engineering argument. The next one is a research argument, and it’s the reason we keep pulling the harness deeper into our own stack rather than pushing it out.

We are a lab. We have an in-house ML team whose job is to get the most out of every model we ship against — frontier models through their APIs, open-weight models we post-train ourselves, and our own Genie family that we trained from the ground up on software engineering trajectories. Every one of those efforts runs through the harness. The harness is the environment. The harness is the sampler. The harness is the reward signal. The harness is where the gradient actually comes from.

And that changes the calculus completely. When you own the harness, every part of it becomes a place you can turn a knob:

The tool surface is a knob. Which tools exist, what their signatures look like, how their results are formatted, how errors come back — all of that shapes what the model learns to do. If read_file returns a 300-line file with line numbers prefixed, the model learns one set of patterns. If it streams the file in semantic chunks keyed by symbol, it learns a different set. That’s not a UX decision, that’s a training decision. We can change it and watch the downstream effect.

The context shape is a knob. How we compact old turns, what we pin vs. drop, how we summarize tool outputs, whether diagnostics show before or after the diff — every one of these is a variable in our training pipeline. Someone wrapping Codex gets the context shape the vendor decided on. We get to test six of them in a weekend.

The reward signal is a knob. Did the tests pass? Did the LSP diagnostics go down? Did the checkpoint survive? Did a human approve the PR? Did a second-pass QA agent flag regressions? These are all events the EventBus already emits. For an RL pipeline, they are the reward. We don’t need to scrape logs or screen-scrape a CLI to find out whether a trajectory succeeded — the harness tells us, structurally, at the end of every rollout.

The rollout itself is a knob. Because the harness can run in-process as a Go library, our training pipeline can spin up thousands of parallel rollouts across a Kubernetes cluster without shelling out to a CLI subprocess per sample. Each rollout has its own Keef-style sandbox, its own deterministic replay log, and its own EventBus stream. We can sample at the throughput our GPU budget can actually consume.

This is the part that compounds. In a supervised setting, a 1% improvement in any one of those knobs is a 1% improvement. In a reinforcement learning setting, where the policy is training against the harness, a 1% improvement anywhere in the loop stacks multiplicatively with every other improvement. Better tool results produce better trajectories produce better reward signal produce better gradients produce better models — which then produce better trajectories on the same harness on the next iteration.

If you’re wrapping someone else’s agent loop, almost none of this is accessible to you. You can post-train against their outputs, but you can’t change the environment the policy lives in, because the environment isn’t yours. You’re doing imitation learning against a sealed box. The frontier labs understand this completely — it’s why they each build their own harness, why OpenAI ships Codex and Anthropic ships Claude Code. The harness is the environment the model is trained to act in. Letting someone else own it is letting someone else own your gradient.

We are a small lab competing with enormous ones. The only way that math works is if every part of the stack we touch is a place we can extract marginal gain — and every marginal gain compounds with every other one. That’s what owning the harness actually buys us.

What owning it actually costs

We don’t pretend that building your own runtime is the right move for every product. Our bill for this approach is real:

We employ a full team on the harness. We maintain our own cross-compilation matrix. We write and run our own integration tests across five deployment surfaces. We ship fixes through a release pipeline we built. When a new frontier model lands, we’re the ones wiring up its Responses API quirks. When a VSCode API changes, we’re the ones keeping the WASM bridge stable.

In exchange we get a product where the seams are ours. That’s the trade. It lets us ship a harness that behaves identically on your laptop, in a VSCode extension, in a desktop app, and inside a gvisor-sandboxed pod a continent away. It lets us change the agent loop on a Tuesday and ship it on Thursday. It lets us be accountable for the whole experience.

What comes next

The harness is the foundation, not the ceiling. The same architecture that lets us run agents in a pod also lets us run them as a long-lived daemon on a developer’s machine, listening to Slack and GitHub, spawning proposals into an approval queue. The same WASM build that powers the extension opens the door to running the harness in the browser. The same tool layer that talks to LSP and MCP can talk to whatever protocol comes next.

None of that is possible if you’re a wrapper around someone else’s loop.

We believe — and spend our days proving — that you can extract a lot more out of the models this industry has access to by being more opinionated about how they operate, not less. The gains are in the tight coupling between good tooling, good practice, and real control over the loop the model lives inside. That’s where the next generation of coding agents will be won, and we’d rather be the ones doing the winning.

we’re hiring

If this was your kind of deep dive, come build the next one with us.

We’re a small team working at the intersection of systems engineering and applied ML — writing Go, training models, running Kubernetes, shipping products. If that combination sounds like the right problem to spend the next few years on, we want to hear from you.

See open roles

— Pandelis

If you want to go deeper, read the runtime deep-dive, our lessons from the original VSCode product, and the ggrep engineering post for the Rust/Go FFI story.