
This whole paragraph is just one long sentence. God I love just random-ass blogging again.
This bit by Geoffrey Huntley is super interesting to me and, despite calling out that LLM-driven development agents like Cursor have something like a 40% success rate at actually building anything that passes acceptance criteria, makes me think that more of the future of our field belongs to people who figure out how to use this weird bags of model weights than any of us are comfortable with.
I’ve been dinking around with Cursor for a week now (if you haven’t, I think it’s something close to malpractice not to at least take it — or something like it — for a spin) and am just now from this post learning that Cursor has this rules feature.
The important thing for me is not how Cursor rules work, but rather how Huntley uses them. He turns them back on themselves, writing rules to tell Cursor how to organize the rules, and then teach Cursor how to write (under human supervision) its own rules.
Cursor kept trying to get Huntley to use Bazel as a build system. So he had cursor write a rule for itself: “no bazel”. And there was no more Bazel. If I’d known I could do this, I probably wouldn’t have bounced from the Elixir project I had Cursor doing, where trying to get it to write simple unit tests got it all tangled up trying to make Mox work.
But I’m burying the lead.
Security people have been for several years now somewhat in love with a tool called Semgrep. Semgrep is a semantics-aware code search tool; using symbolic variable placeholders and otherwise ordinary code, you can write rules to match pretty much arbitary expressions and control flow.
If you’re an appsec person, where you obviously go with this is: you build a library of Semgrep searches for well-known vulnerability patterns (or, if you’re like us at Fly.io, you work out how to get Semgrep to catch the Rust concurrency footgun of RWLocks inside if-lets).
The reality for most teams though is “ain’t nobody got time for that”.
But I just checked and, unsurprisingly, 4o seems to do reasonably well at generating Semgrep rules? Like: I have no idea if this rule is actually any good. But it looks like a Semgrep rule?
What interests me is this: it seems obvious that we’re going to do more and more “closed-loop” LLM agent code generation stuff. By “closed loop”, I mean that the thingy that generates code is going to get to run the code and watch what happens when it’s interacted with. You’re just a small bit of glue code and a lot of system prompting away from building something like that right now: Chris McCord is building a thingy that generates whole Elixir/Phoenix apps and runs them as Fly Machines. When you deploy these kinds of things, the LLM gets to see the errors when the code is run, and it can just go fix them. It also gets to see errors and exceptions in the logs when you hit a page on the app, and it can just go fix them.
With a bit more system prompting, you can get an LLM to try to generalize out from exceptions it fixes and generate unit test coverage for them.
With a little bit more system prompting, you can probably get an LLM to (1) generate a Semgrep rule for the generalized bug it caught, (2) test the Semgrep rule with a positive/negative control, (3) save the rule, (4) test the whole codebase with Semgrep for that rule, and (5) fix anything it finds that way.
That is a lot more interesting to me than tediously (and probably badly) trying to predict everything that will go wrong in my codebase a priori and Semgrepping for them. Which is to say: Semgrep — which I have always liked — is maybe a lot more interesting now? And tools like it?