How Codex is built
👋 Hi, this is Gergely with a subscriber-only issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and startups through the lens of engineering managers and senior engineers. If you’ve been forwarded this email, you can subscribe here. How Codex is builtA deepdive into how OpenAI's Codex team builds its coding agent, how engineers use it, and what it could mean for the future of software engineering. ExclusiveMore than a million developers use OpenAI’s command-line coding interface every week. Named Codex, usage has increased 5x since the start of January, the team tells me. In the first week of February, OpenAI launched the Codex desktop app, a macOS application that CEO Sam Altman calls “the most loved internal product we’ve ever had”. A few days later, OpenAI shipped GPT-5.3-Codex, which they describe as the first model that helped create itself. Personally, I’ve been warming up to Codex since doing an interview for The Pragmatic Engineer podcast with Peter Steinberger, the creator of OpenClaw, in which he revealed that he writes all of OpenClaw with Codex, preferring longer-running agentic loops. Update: on Monday, Peter announced he is joining OpenAI to work on building next-generation agents. It’s a major win for OpenAI and the Codex team, while OpenClaw remains independent and open source. Check out my podcast with Peter in his first in-depth interview, around when OpenClaw (back then: Clawd) was getting massive momentum. To find out how Codex was built, how teams at OpenAI use it, and what effect it’s having on software engineering practices at the ChatGPT maker, I spoke with three people at OpenAI:
This deep dive covers:
Last week’s debut Pragmatic Summit, in San Francisco, featured a fireside chat with Tibo, myself, and the audience, and featured new details about how Codex is built. Paid subscribers can watch this recording now. Free subscribers will get access to all videos from the Pragmatic Summit in a couple of weeks.
Longtime readers might recall a deepdive entitled How Claude Code is built, based on interviews with the founding engineers at Claude Code. Some comparisons with today’s topic are obvious: Codex and Claude Code have each made bets that seem to be paying off. I was initially skeptical when I talked with the Codex team last October because the cloud-first, long-running task approach didn’t click with me. But I’ve now changed my mind. The bottom of this article could be cut off in some email clients. Read the full article uninterrupted, online. 1. How it startedIn 2024, OpenAI was experimenting with various approaches for building a software agent. That fall, the company declared that building an aSWE (Autonomous Software Engineer) was to be a top-line goal for 2025. This vision came from the top: Greg Brockman and Sam Altman believed they should have an autonomous software engineer working alongside teams. Tibo describes the thinking:
A number of folks who’d worked on earlier prototypes were pulled into the effort, which featured:
OpenAI had two teams tackle different segments of the problem space: Codex Web would focus on an async, cloud-based solution, while Codex CLI targeted iterative, local development. Both products would launch in the spring, with Codex CLI being announced in April 2025, and Codex in ChatGPT introduced in May. 2. Technology and architecture choicesAn obvious difference between Codex and Claude Code is the programming language. Claude Code is written in TypeScript, “on distribution”, which plays to the underlying model’s strengths. Meanwhile, the Codex CLI is written in Rust. Tibo explains why:
There was also a practical concern about dependencies. Choosing TypeScript means using the npm package manager. Using npm often means building on top of packages that may not be fully understood – which could clearly be problematic. By going with Rust, the team has very few dependencies and can thoroughly look through the few dependencies there are. They also want to eventually run the Codex agent in all sorts of environments – not just laptops and data centers – and even places like embedded systems. Rust makes this more achievable from a performance perspective than TypeScript or Go. Tibo tells me that while Codex’s early performance was less standout with Rust than with TypeScript, they expected the model to catch up. Plus, choosing Rust gave them one more engineering challenge to work with. The Codex team also hired the maintainer of Ratatui – the Rust library for building terminal user interfaces (TUIs). He’s now full-time on the Codex team, doing open source work. The core agent and CLI are fully open source on GitHub. How Codex worksThe core loop is a state machine, and the agent loop is the core logic in the Codex CLI. This loop orchestrates the interaction between the user, the model, and the tools the model uses. This “agent loop” is something every AI agent uses, not just Codex, and below is how Codex implements it, at a high level:
Compaction is an important technique for efficiently running agents. As conversations grow lengthy, the context window fills up. Codex uses a compaction strategy: once the conversation exceeds a certain token count, it calls a special Responses API endpoint, which generates a smaller representation of the conversation history. This smaller version replaces the old input and avoids quadratic inference costs. We covered how self-attention scales quadratically in our 2024 ChatGPT deepdive. Safety is an important consideration because LLMs are nondeterministic. Codex runs in a sandbox environment that restricts network access and filesystem access by default. Tibo reflects on this choice:
There are several releases per week. Internally, the team ships a new version of Codex up to three or four times a day. Externally, new releases are cut every few days and are distributed via package managers, Homebrew, and npm. Michael Bolin’s recent blog post, “Unrolling the Codex Agent Loop,” lays out the internals of how the agent loop works. 3. How Codex builds itselfMore than ninety percent of the Codex app’s code was generated by Codex itself, the team estimates, which happens to be roughly in line with what Anthropic has reported for Claude Code, according to what its creator Boris Cherny told me. Both AI labs share the meta-circularity of using the coding tools to write their own code. Tibo tells me that a typical engineer on the Codex team runs between four and eight parallel agents, which do any one of a number of tasks:
Codex engineers are now “agent managers” and no longer just write code. Tibo says it’s common for an engineer to walk into the office with several tabs open on their laptop: a code review running in one, a feature being implemented in another, a security audit in a third, and a codebase summary being generated in a different tab. He says:
Frequently-used “skills”“Agent Skills” are ways to extend Codex with task-specific capabilities, which is pretty much the same concept as Claude Code’s skills. Internally, the Codex team built 100+ Skills to share and choose from. Three interesting examples:
The way Tibo thinks about skills is that they help steer the model to more specific behaviors, and they can also be combined. Skills are continuously published internally and team members copy from each other. Tiered code reviewThe team set up AI code review and it always runs. The team trained a bespoke model for code review, optimizing it for signal over noise. Around nine out of 10 comments point out valid issues, says Tibo, which is equal to or slightly better than human reviewers. AI reviews are run automatically whenever a pull request moves from “draft” state to “in-review” state, and are automated via a GitHub webhook. Following an AI review, there are two possible next steps:
Other engineering practices on the teamHere are more practices that Tibo rates as helpful for the Codex team:
Using Codex to debug CodexThe Codex team holds meetings to discuss Codex, during which it’s common for engineers to fire off a thread within Codex to see if it can come back with some information. This January, something interesting began happening, Tibo told me:
This feels like another meta-circularity of Codex debugging itself – or at least systems that power it! 4. ResearchCodex is built by researchers as well as software engineers, of whom one is SQ Mah. He managed to move into research from software engineering by competing in the Vesuvius Challenge, of reading millennia-old carbonized scrolls from the ancient Roman town of Pompeii which were buried by a cataclysmic volcanic eruption of Mount Vesuvius in AD 79. SQ finished second in the contest by renting GPUs on Google Cloud, training models, taking research ideas and turning them into useful algorithms. So, what is “research” like at OpenAI? SQ’s take:... Subscribe to The Pragmatic Engineer to unlock the rest.Become a paying subscriber of The Pragmatic Engineer to get access to this post and other subscriber-only content. A subscription gets you:
|


Comments
Post a Comment