From a record-setting benchmark to your next AI teammate - how Genie became our most productive engineer.
Genie is able to solve bugs, build features, refactor code, and everything in between either fully autonomously or paired with the user, like working with a colleague, not just a copilot.
We took everything we learned from the dogfooding process and applied it to building Genie 2. We focused on making the model more reliable, and the product more intuitive. We also incorporated user feedback to enhance the overall experience.
In this post-IDE era, Genie lets you assign any ticket — or even your entire backlog — and works fully asynchronously. You can return later to review, make small tweaks, and merge, all without opening your IDE. It writes unit tests, runs your CI, and when you want to collaborate in real time, Genie takes the lead while you copilot.
Genie is completely headless, but even the best code needs quick, on-the-spot adjustments. That’s why, in the past, when a Genie-generated pull request was 99% right, you could still feel stuck. This is why you have access to the same editor that Genie does so you can lean over its shoulder and make a change, instead of having to prompt it. It’s the closest experience we’ve had to working with a truly asynchronous teammate. We’re not claiming Genie can do everything for everyone. But it’s accelerated our own team’s velocity, and we know it can do the same for yours.
When we started using Genie 1 in real workflows, it highlighted that several of our assumptions about how the model and product were built were off. Our benchmark scores were strong, but they didn't reflect many of the challenges that show up in real-world use.
We learned the hard way that vibe coding doesn't scale, especially when working autonomously across large, existing codebases. So our approach had to become more nuanced.
By this time Genie 1 had already become our most contributive developer internally
Last summer, we made real progress.
We achieved the biggest score jump in SWE-Bench history and figured out how to generate billions of tokens of synthetic data that mimicked human reasoning, long before reasoning models were making headlines.
We were onto something big, but we celebrated too early.