Building With Agents
Personal infrastructure for running multi-agent development workflows — claude-docs, worknotes, timesheet-entry, and what I've observed running Claude Code and Codex CLI in parallel.
Why I run multiple agents
I run multiple frontier coding agents in parallel — Claude Code and Codex CLI as the primary drivers, with Claude.ai (web), the Codex app, and Cursor in rotation. Currently about 70/30 Claude/Codex. The intent isn't to pick a winner; it's to stay calibrated on what each tool is actually good at and to avoid lock-in as model quality shifts.
What each tool actually does in my stack
Claude Code CLI is the primary driver. Most real work happens here — multi-file edits, plan-then-execute flows, custom skills, the development surface where I spend the bulk of my agent time.
Claude.ai (web) is the long-form-thinking and planning surface. When I'm away from a repo — on mobile, in a coffee shop, or in early scoping conversations before there's any code to point at — this is where I work.
Codex CLI is in rotation for variety and calibration. The kinds of tasks I send it are the same kinds I send Claude Code — explaining unfamiliar codebases, adding tests, shipping small features. The point is parallel evaluation, not specialization.
Codex app I'm honestly still figuring out. It's an alternate surface on the OpenAI side and I haven't yet developed a strong sense of where it slots in relative to Codex CLI. That's a real gap; I'd rather name it than paper over it.
Cursor IDE is the IDE-resident layer. For the moments where staying in the editor beats opening a CLI — quick refactors, exploratory reading where inline completions help — Cursor is the right surface.
One observation from three weeks of parallel use
In three weeks of running both Claude Code and Codex CLI on equivalent work — explaining unfamiliar codebases, adding tests, shipping small features across three projects — the most consistent difference I've seen is in definition-of-done discipline. Claude reliably runs lint and tests as part of completing a task; it treats CI-passing as part of the work, not as the next thing to do. Codex tends to declare success earlier, which has bitten me in CI more than once — the diff looks reasonable, the task is marked complete, but the lint hadn't been run and the merge failed.
I'm not ready to publicly stake a position on which tool is better overall — I'm 70/30 Claude/Codex, I haven't run matched-prompt experiments, and three weeks isn't long enough to draw a sharp line. But the validation-discipline gap is the one specific differentiation I'd flag to anyone weighing them.
The workflow I'm moving toward, but haven't fully practiced yet, is cross-tool review: use one agent to review the other's recent commits before merge. Treat the second agent as a second opinion on the first one's work. It's the obvious extension of the parallel-evaluation practice, and the right way to convert the diversity of agents into actual quality lift rather than just diversity of failure modes.
The artifacts I've built
Three small pieces of personal infrastructure that solve adjacent problems in running agentic development across multiple projects.
claude-docs
Problem: Agent context decays between sessions and across context resets. When I'm running multiple projects in parallel and some stretch across months, "what did we decide three sessions ago" matters — to me, and to any agent picking up the work cold. Memory is the gap.
Design: A per-project archive layout. Each project gets a directory. Each story or feature gets its own subdirectory with a session-state.md carrying a six-field front-matter (status, project, story, title, branch, last_touched) and a canonical body shape (current status, files changed, verification done, suggested commit message, open items). An index script crawls the archive and rebuilds top-level and per-scope discoverability indexes.
The convention is small but load-bearing — agents and humans can resume cold off any story's session-state.md because the structure is the same everywhere. The layout is the data shape; an authored skill drives the lifecycle on top of it — start scaffolds a new story, resume briefs an agent cold by auto-detecting the right story from branch and working directory, update appends a session block, and ship flips status and rebuilds the indexes.
Where to look: github.com/arshrangi/building-with-agents → claude-docs/ for the layout spec, templates, and index builder, and skills/claude-docs/ for the orchestration skill.
worknotes
Problem: I was spending close to 10 minutes a day reconstructing "what did I actually do today" for personal memory, end-of-year SR&ED interviews, and PM-facing standups. Working without a durable log left me re-doing the mental work every time.
Design: A Claude skill that runs against a session's real work — files touched, decisions made, verification done — and produces a structured markdown work-note in one invocation. The output is durable; I can grep my work-notes archive months later and recover specifics I'd otherwise forget.
Where to look: the same repo → skills/worknotes/SKILL.md.
timesheet-entry
Problem: Same session source material, very different audiences. SR&ED documentation and PM standups don't want the technical log — they want a one-line headline and a few plain-language bullets. Re-translating manually after worknotes was annoying enough that I built a second skill.
Design: Companion to worknotes. Same input session, produces a one-line headline plus two-to-four bullets in plain-language voice — written for someone who doesn't need the file paths and just wants to know what got done. Same input, different audience, zero extra thinking.
Where to look: same repo → skills/timesheet-entry/SKILL.md.
A real session, briefly
Here's what one session looks like in the claude-docs archive. This is excerpted from the actual session-state.md for the dot-me portfolio's own mobile-performance pass, 2026-05-17. The shape — headline, status, files changed, verification, suggested commit message, open items — is the same across every project in the archive. That consistency is the point.
2026-05-17 — Mobile Lighthouse perf pass: self-host Roboto Mono + About image to WebP
Started attacking the mobile Lighthouse score in response to evaluating the Yelp/RepairPal "Full Stack Engineer" posting (the JD explicitly screens for "SEO-optimized, high-performance pages"). Fresh baseline: Lighthouse mobile against the live site now scores Perf 79, FCP 3.1s, LCP 4.0s, TBT 20ms, CLS 0.09 — significantly better than the 15-day-old session-state numbers (Perf 67, LCP 8.4s).
Lighthouse identified the LCP element as the hero subtitle paragraph, NOT the H1 or the About image. Phase breakdown: TTFB 587ms (15%), Render Delay 3,428ms (85%) — the entire bottleneck is render-delay, with two render-blocking resources accounting for it: the external Google Fonts stylesheet (976ms wasted) and the ScrollReveal CSS (290ms wasted).
Font self-hosting. Installed
@fontsource/roboto-monoand imported the four weights (400/500/600/700) at the top ofassets/css/tailwind.cssbefore the Tailwind layers. Removed the external<link rel="stylesheet">to fonts.googleapis.com plus its twopreconnectlinks fromnuxt.config.ts. Confirmed Inter and JetBrains Mono are completely unreferenced anywhere in the codebase, so they're safe to drop entirely rather than just trim.Image optimization. Source About-section JPEG is 542KB at 896×1200. Converted to WebP at 1120×1500 with
magick … -quality 82, output 49KB — an 11× reduction. Updated the consuming component with explicit width/height for CLS prevention,loading="lazy", anddecoding="async".
Every session block in the archive looks like this — a headline, two-to-three substantive paragraphs of what got done and why, a structured tail. The skills that produce work-notes and timesheet entries work off the same source material, just rendered for different audiences.
Where this is going
This is a three-week-old practice, not a polished system. The cross-tool review pattern is intent, not yet running. Matched-prompt experiments between Claude Code and Codex CLI haven't been done; my observations are working impressions, not data. The artifacts themselves are being refined as I use them — the claude-docs layout has had three small revisions in the last month as new edge cases surfaced.
What's stable is the parallel-evaluation discipline and the three small pieces of infrastructure that make running multi-project agentic work bearable. If you're hiring for an engineer who can ship in this kind of workflow — multi-agent, archive-aware, validation-disciplined — let's talk.