Kiloclaw: Hosting Thousands of Claws

Author

Name: Daniel Botha

A lot of people trying out a new agent will run it in a Docker container on whatever machine they have handy. For a while, that was the best balance between simplicity and security. But it comes with issues that don’t make your agent feel like the always-online, all-knowing assistant you want it to be. Laptops sleep. VPS disks fill up with logs. Processes die at 2am and don’t restart. OS updates break dependencies.

Self-hosting an agent is straightforward until you need it to be reliable 24/7. KiloClaw is Kilo’s answer to that reliability gap, and the downright crazy speed at which they have scaled confirms that having an Openclaw that “just works” is more exciting to a lot of people than building and maintaining their own.

What OpenClaw Actually Is

OpenClaw is an open source AI agent that runs 24/7 and connects to chat platforms like Slack, Discord, and Telegram. It maintains outbound connections, runs scheduled tasks via built-in cron, and takes real actions: sending emails, managing calendars, browsing the web, running shell commands.

Who’s Using Kiloclaw

The team built Kiloclaw for developers. Then other people showed up. Like… a lot of them.

Airbnb hosts are managing multiple listings: bulk price changes, responding to guest messages, coordinating cleaners. Parents are using it as a family command center: school schedules, meal planning, coordinating who’s picking up which kid. One user said their Claw knows their family’s calendar better than they do.

Small business owners who aren’t technical and don’t care what “agentic AI” means are finding that just being able to say “find all invoices from last month and tell me which ones are unpaid” is life changing when it just works.

The Self-Hosting Gap

Openclaw is relatively easy to set up and once all your connectors are plugged in, it’s easy to become reliant on it. It only starts to feel like work when you need it and something breaks. Computers are involved here, so things will break. Kiloclaw aims to extend that initial honeymoon phase indefinitely by taking a few key jobs off your plate.

Memory and state management. OpenClaw doesn’t remember yesterday unless you configure persistence. A lot of people deploy it, love it, and then ask “why have you forgotten everything?!?!?!” a week later. You need to decide what to store, handle edge cases, force it to write things down more often.

Monitoring and recovery. Your agent will crash. Your VM will restart. Your API keys will expire. And of course this will happen at the exact moment you really need your Openclaw to work.

Security surface. OpenClaw running on your machine has access to … your machine. Your files, your shell, your credentials. In this situation “what happens if someone tricks my agent into doing something I didn’t ask for” hits different.

The Kiloclaw team sees the gap as operational maturity, not technical skill. It’s the same gap between “I can write a web app” and “I can run a web app that’s up at 3am on Christmas.”

The Stack

Every customer gets their own Fly app, their own volume, their own machine. One app per tenant, one machine per app. The control plane sits on Cloudflare, with a Durable Object per deployment coordinating provisioning, lifecycle management, auth token rotation, and access gating.

The compute runs on Fly. Every customer gets their own Fly app, their own volume, their own machine. One app per tenant, one machine per app.

Traffic is proxied through multiple layers to avoid exposing the Fly Machine directly. The Durable Object handles both control plane operations and access gating, so users interact with OpenClaw through the Cloudflare layer, never touching the machine’s network directly.

There’s an architecture diagram if you want the 10,000-foot view.

Why Fly

Three reasons.

The API was easy to use. The machines are always-on without needing to shut down during idle periods to stay cost-effective, which matters when your workload assumes it’s perpetually running.

Everything at Kilo moves at “Kilo Speed” and their Kilo Claw launch was no different. The team had about a week to ship, so a week of sales calls to feel out capacity support was not an option. Fly looked ready to go, so go they did.

Provisioning

User signs up, hits the KiloClaw feature in the UI, clicks provision. That instantly creates a Durable Object representing the deployment. The Cloudflare Worker then provisions a Fly app, volume, and machine.

The longest part is retrying when capacity issues come up, but it’s still fast. Couple of minutes.

Isolation

Every customer gets a Machine, a dedicated Firecracker microVM with its own kernel. Not a shared container. Not a slice of someone else’s runtime.

Five isolation layers run simultaneously: identity-based routing (destination derived from authenticated identity, not user input), dedicated Fly app per tenant, isolated WireGuard network mesh per app, the Firecracker VM boundary, and dedicated encrypted storage volumes.

KiloClaw commissioned an independent security assessment in February 2026 that included 35 adversarial tenant-isolation tests, 8 live cross-tenant network tests, and dozens of injection payloads. No cross-tenant access path was found during the assessment. Fly’s own infrastructure is tested annually by firms like Atredis, Doyen, and Tetrel.

Customer API keys and chat tokens are encrypted at rest and only decrypted when the customer’s VM starts.

Sizing

Initially, each instance was running on 2 shared vCPUs with 3 GB RAM and a 10 GB persistent SSD. But this was recently bumped to 1 performance CPU and 4GB RAM, as the newer versions of Openclaw have become resource hungry at boot time.

A future “pro” tier will likely deploy to higher-specced infra for heavier workloads. Offloading coding tasks, more intensive automation, things that extend the “it just works” philosophy to power users.

Always On

OpenClaw is more stable when it runs on an “always on” machine. Channels maintain outbound connections to Slack, Discord, Telegram. The built-in cron schedules tasks throughout the day. In many ways, shutting down between requests breaks the fundamental operating model.

This is the opposite of the typical “run the agent in an ephemeral sandbox” pattern. The agent isn’t a function that responds to events. It’s a process that lives somewhere, holds connections open, and acts on its own schedule.

The 3am Question

Everyone knows the most critical time for your agent to be working is when you have a question at 3am. I asked the Kiloclaw team to compare the experience of something breaking at 3am in a self-hosted Openclaw vs a Kiloclaw.

Self-hosted: A process dies silently. Maybe you have monitoring. If you do, you get paged, groggily SSH in, check logs, restart. If you don’t, you try to use your agent, wonder why it’s dead. Get sad. Begin debug cycle: is it the VPS? Docker? OOM? Expired API key? Best case, 15 minutes to recovery. Realistic case, you fix it tomorrow because you’re tired.

KiloClaw, 3am crash: a custom service wrapper detects the issue immediately. It controls OpenClaw directly and reports status independently of the agent itself. Agent restarts automatically with state preserved. You groggily ask your agent the question, get an answer and sleep like a baby.

The key here is Kiloclaw doesn’t rely on the agent to tell Kilo it’s healthy. The wrapper monitors from outside, so even a completely frozen agent gets detected and recovered.

Growing Pains

Capacity. Interest in Kiloclaw exploded and trying to get instances online across regions, figuring out whether to use meta regions, etc. was a challenge. But not an unexpected one; they knew growth would be fast and would likely hit constraints.

The provisioning pipeline isn’t perfect yet. Failed deployments sometimes leave orphaned volumes that need manual cleanup. At the pace they shipped, also expected.

In an ideal world they’d deploy into a handful of cheap regions and let users pick “US” or “EU.” Right now they are spread across all regions to minimize provisioning failures.

Fleet Updates

Since each deployment is backed by a Durable Object, fleet-wide operations are driven from the Cloudflare side. When an auth token is about to expire, the Durable Object orchestrates the refresh. Config changes, image updates, lifecycle events: all coordinated by the control plane, pushed to individual machines.

Persistent packages shipped in March. pip install and npm install -g now survive restarts. Previously everything installed at runtime would vanish on reboot. Package directories write to the durable volume, so the machine can cycle without losing user-installed dependencies.

The Security Surprise

The biggest thing keeping users around is unsurprisingly also the biggest concern among Openclaw holdouts: the security model. Kiloclaw security is invisible when it works, and it’s the thing they spend the most time on with early adopters.

Hosting Openclaw for thousands of users comes with an entirely different set of security responsibilities and concerns than running your own instance on your Mac does.

So the Kiloclaw team has been drawing lines with users directly. What actions need confirmation vs. happen automatically? Sending an email? Probably confirm. Checking your calendar? Just do it. But where’s the line between “look up this contact” and “email this contact”?

What data should the agent access vs. what’s off-limits? Your inbox is useful. Your password manager is not.

What happens if someone tricks your agent in a group chat? Prompt injection is what happens. The exec tool requires explicit user approval by default, enforced by the platform itself, not the agent. Even in a worst case, the blast radius stays inside the customer’s own VM.

The team is putting a huge amount of work into getting this right and learning that the only way to earn users’ trust is by building the security model with them, not for them. Users push back when it’s too restrictive (“why can’t it just send the email?”) and when it’s too permissive (“wait, it can do what without asking?”). That feedback loop is something you don’t get anywhere else.

Next post ↑: Struct: A Machine Per Agent, Thousands at a Time
Previous post ↓: Starchild: Vibe Trading Requires Infrastructure That Can Keep Up