The Leverage Factor, Part 2: Defending the Numbers

The Leverage Factor: Measuring AI-Assisted Engineering Output generated more direct messages than anything else I have published. Some of the feedback was enthusiastic. A significant portion was hostile. "Exaggerated." "False." "No way those numbers are real." Fair enough. I published extraordinary claims with data but without enough context for readers to evaluate the methodology. This article fills that gap. I am going to take specific time records, break them apart, defend the human estimates with engineering detail, and then show that the original leverage calculation actually understates the real multiplier.

The Criticism

The pushback falls into three categories.

"The human estimates are inflated." This is the most common objection. A reader sees "120 human-equivalent hours, completed in 12 minutes" and concludes the estimate must be wrong. Eighty hours for a task? Nobody works 80 hours on a single thing. I will take several of these records apart below and walk through exactly what a senior engineer would need to do, step by step, to produce the same output.

"AI coding tools don't actually help." This objection typically cites the METR study from July 2025, which found that experienced open-source developers using AI tools completed tasks 19% slower than without AI. That study measured something fundamentally different from what I measure. More on that below.

"You're cherry-picking the best results." I publish every single leverage record. Every day. The daily time records include the 6x tasks alongside the 200x tasks. The weighted average across 140+ tasks over the first two weeks of tracking is 45x. That average includes everything: the slow debugging sessions, the I/O-bound batch jobs, the tedious mechanical work. Nothing is filtered out.

Why the METR Study Does Not Apply Here

The METR study is real science and I respect the methodology. Sixteen experienced developers. Randomized controlled trial. 246 real issues. The finding that AI made experienced developers 19% slower on their own repositories deserves attention.

It also measured the exact opposite of what I do.

Factor	METR Study	My Workflow
Task type	Bug fixes and maintenance on mature codebases	New features and iterative feature development
Codebase age	10+ years, 1M+ lines of code	Days to months old
AI tool	Cursor Pro (autocomplete-style)	Claude Code (agentic, autonomous)
Autonomy	Developer drives, AI suggests	AI drives, developer reviews
Context	Developer intimately familiar with codebase	AI builds its own understanding from specs
Task scope	Individual issues (~2 hours each)	Multi-file systems and full features

The METR authors explicitly state their findings do not generalize to junior developers, new projects, or unfamiliar codebases. My work sits squarely in the categories they excluded.

What These Numbers Actually Measure

Every leverage record published on this site comes from The Leverage Factor: Measuring AI-Assisted Engineering Output. Here is what the numbers capture and what they do not.

Scope limitation. These time records capture personal project work done with Claude Code (Anthropic) only. They exclude work done with ChatGPT (OpenAI), Gemini (Google), Grok (xAI), Cursor, GitHub Copilot, and every other tool I use regularly. Client work is excluded entirely, despite being primarily Claude Code. The published numbers represent a fraction of total AI-assisted output on any given day.

I run multiple Claude Code sessions simultaneously throughout the day. While one session builds a backend API, another writes domain specification documents, and a third drafts an architecture article. The leverage records capture each session independently. They do not capture the parallelism. Five sessions running concurrently for 20 minutes each log as 100 minutes of Claude time, not 20 minutes of wall clock. The actual throughput is higher than what the individual records suggest.

The human estimate is calibrated against my own experience and the experience of engineers I have worked with over a 30-year career. When I estimate that a task would take a senior engineer 40 hours, I mean a competent engineer who already knows the technology stack, already understands the domain, and does not need to ramp up. I am not estimating training time or meeting time or the overhead that organizations layer onto individual work. I am estimating hands-on-keyboard engineering time for someone who knows what they are doing.

Dissecting the Human Estimate

Let me take four specific records from the log and break down exactly what a human engineer would need to do.

Record 1: Production API (120 human-hours, 90 Claude-minutes, 80x)

From the February 25 leverage record.

This task built a complete production API across six phases: session management loop, stub endpoints, specification endpoints, event system, origin integration, and demo migration. The final output included dozens of files, database models, API routes, WebSocket handlers, test coverage, and documentation.

Here is the phase-by-phase human breakdown:

Phase	What It Involves	Human Time
Session management	Auth middleware, session store, JWT handling, WebSocket session tracking	16h
Stub endpoints	Route scaffolding, request validation, response schemas, error handling	8h
Specification endpoints	Domain-specific business logic, query builders, pagination, filtering	24h
Event system	WebSocket event routing, pub/sub, state synchronization, reconnection logic	20h
Origin integration	External service client, retry logic, data transformation, caching layer	24h
Demo migration	Port existing demo features, schema migration, data backfill, verify parity	16h
Testing and verification	Unit tests, integration tests, manual verification, bug fixes	12h
Total		120h

A senior engineer writes roughly 200 lines of production-quality code per day (industry average for complex systems, consistent with studies by Steve McConnell and others). This API contained thousands of lines across dozens of files. At 200 lines per day, the code alone takes 15+ working days. Add testing, debugging, and documentation, and three weeks is a conservative estimate.

Claude built it in 90 minutes.

Record 2: Media Export Pipeline (24 human-hours, 12 Claude-minutes, 120x)

From the February 25 leverage record.

This task produced 27 files and 2,481 lines of code with 38 passing tests. The system exports media files from a third-party API to S3, handling pagination, rate limiting, authentication, file format conversion, and error recovery.

The human breakdown:

Step	What It Involves	Human Time
API research	Read vendor API docs, understand auth model, pagination scheme, rate limits	2h
Architecture design	Design pipeline: ingestion, transformation, storage, error handling	2h
API client	HTTP client with retry logic, rate limit handling, pagination, auth refresh	4h
Pipeline implementation	Recording discovery, download, format conversion, S3 upload, state tracking	6h
Error recovery	Checkpoint/resume, dead letter queue, alerting, idempotency	4h
Test suite	38 unit tests covering happy paths, error cases, edge cases	4h
Integration testing	End-to-end verification against sandbox API	2h
Total		24h

Three full working days for a senior engineer. Claude produced the same output, with tests, in 12 minutes.

Record 3: Resume Generator (40 human-hours, 20 Claude-minutes, 120x)

From the February 28 leverage record.

This task built a complete resume generation system across six phases: data schemas, CLI interface, document parsers and importers, multi-source deduplication, LLM-powered content normalization, HTML and DOCX renderers, customizable templates, and a portfolio website generator.

Step	What It Involves	Human Time
Schema design	Data models for work history, skills, education, certifications, projects	4h
CLI and parsers	Command-line interface, JSON/YAML/DOCX/HTML import parsers	8h
Deduplication	Multi-signal matching across imported sources, conflict resolution	6h
LLM normalization	Prompt engineering for consistent formatting, skill extraction, title standardization	6h
Renderers	HTML and DOCX output with multiple templates, CSS styling, print layout	8h
Portfolio site	Static site generator for online portfolio with responsive design	4h
Testing	Unit tests across all phases, sample data fixtures, edge case coverage	4h
Total		40h

A full work week. The system touches multiple file format parsers, LLM integration, template rendering, and web generation. Claude completed it in 20 minutes because it could hold the entire data flow from import through normalization to multi-format output in a single context.

Record 4: Interactive Training Application (40 human-hours, 25 Claude-minutes, 96x)

From the February 26 leverage record.

This session built a complete interactive training frontend with seven phases: quiz engine, timed assessment mode, scoring system, a real-time analytics dashboard, progress visualization with animated charts, three additional interaction modes, and a full Playwright E2E test suite. Twenty-four files total, all React with TypeScript.

Step	What It Involves	Human Time
Quiz engine	Question rendering, answer validation, feedback display, state management	6h
Assessment mode	Timed sessions, question pooling, scoring, results summary	4h
Scoring system	Rating algorithm, difficulty calibration, score history tracking	4h
Dashboard	Real-time stats display, progress charts, focus area identification	6h
Progress charts	Canvas-based animated charts, responsive layout, theme support	4h
Interaction modes	Card-based review, timed recall, drag-and-drop exercises	8h
E2E tests	11 Playwright specs, data-testid attributes, error boundary, onboarding flow	8h
Total		40h

A full work week of focused React development. Claude produced all 24 files in 25 minutes because each phase built on the patterns established in previous phases. The component architecture, state management approach, and styling conventions carried forward automatically.

The Real Leverage: Supervisory Time

Here is where the original article's methodology understates the actual multiplier.

The published leverage factor compares Claude's wall-clock time to estimated human engineering time:

Published Leverage = Human-Equivalent Hours x 60 / Claude Minutes

This answers the question: "How much output did Claude produce relative to what a human could produce in the same elapsed time?" It is a valid measure of output expansion. But it is not a measure of my personal productivity multiplier.

The question most readers actually care about is: "How much engineering output do you get for every minute you personally spend?" My time on each task consists of writing the prompt (typically 1-3 minutes), reviewing the output when Claude finishes (typically 2-5 minutes), and occasionally providing a follow-up correction (1-2 minutes). For most tasks, my total involvement is under 10 minutes.

Here is the same dataset recalculated with estimated supervisory time:

Task	Human Eq.	Claude Time	My Time	Published Factor	Supervisory Factor
Production API	120h	90min	15min	80x	480x
Media Export Pipeline	24h	12min	5min	120x	288x
Resume Generator	40h	20min	5min	120x	480x
Training Application	40h	25min	8min	96x	300x
Distinction Engine	40h	45min	10min	53x	240x
Electron App Phase 1	40h	25min	8min	96x	300x
MCP Server + LLM Endpoint	16h	12min	5min	80x	192x

The "Published Factor" column is what appears in the daily leverage records. The "Supervisory Factor" column divides human-equivalent hours by my actual time investment. The supervisory factor is 3-6x higher than the published factor because Claude runs autonomously for the vast majority of each task.

The published factor of 80x for the Production API means Claude produced 80 minutes of human-equivalent work for every minute it ran. The supervisory factor of 480x means I received 480 minutes of human-equivalent output for every minute I personally invested. Both numbers are accurate. They measure different things.

I chose to publish the more conservative number. The leverage factor as defined compares AI output to human output on a capability basis. The supervisory factor measures my personal return on time invested. The latter is the number that matters for anyone evaluating whether to adopt this workflow, but the former is the number I can defend without having to prove how long I spent on each prompt.

Two perspectives on leverage: published factor vs. supervisory factor

Why My Numbers Are Higher Than Yours

I hear the same story constantly. "I tried AI and it generated garbage." "It can't do my job." "It doesn't work." Then the person pulls out their phone and shows me the ChatGPT app running on the free tier, or a Claude browser session with no subscription.

The gap between my results and the average experience is not mysterious. It comes from specific, compounding advantages that I have built over nearly a year of daily practice. None of these are secrets. All of them are reproducible. Most people will not bother.

Use the Best Available Models

This is the single biggest factor and the easiest to fix.

The free version of ChatGPT runs a limited model with a 16K context window and restricted reasoning capabilities. The free tier of Claude gives you Sonnet with usage caps. These are not the same tools I use. The difference between the free tier and the best paid tier of any major AI provider is not incremental. It is an order of magnitude in what the model can do competently.

Tier	Model Quality	Context Window	Reasoning	Monthly Cost
ChatGPT Free	GPT-5 Mini (limited)	16K tokens	Limited	$0
ChatGPT Plus	GPT-5	32K tokens	Standard	$20
ChatGPT Pro	GPT-5 Pro	128K tokens	Maximum	$200
Claude Free	Sonnet (capped)	200K tokens	Standard	$0
Claude Max	Opus 4.6	200K tokens	Maximum	$100-200
Gemini Pro	Gemini 2.5 Pro	1M tokens	Maximum	$20
Grok	Grok 3	128K tokens	Maximum	$30
Genspark.ai	Multi-model graphical	Varies	Varies	$20

I personally pay for the maximum usage tier of every major AI provider: Claude Max, ChatGPT Pro, Gemini Pro, Grok, and Genspark.ai for more graphical output. This is not casual experimentation. I invest hundreds of dollars per month in AI subscriptions because the difference between free and paid tiers is the difference between a novelty and a production tool.

I ran all of my Claude Code work last fall on Sonnet 4.5. It was capable but had obvious limitations. When Opus 4.6 became available, I upgraded my subscription to get the highest usage limits so I could count on using it for everything. The difference is not subtle. Opus 4.6 in Claude Code operates like pair programming with a computer science genius who has a strong work ethic, limitless energy, and no ego. It reads code, understands architecture, reasons about trade-offs, writes tests, fixes its own mistakes, and iterates until the work is done. Sonnet 4.5 could do some of this. Opus 4.6 does all of it, reliably, across sessions that run for hours.

Even Sonnet 4.6 improved dramatically over 4.5 because it has access to a 1M token context window (in beta), which allows it to reason over a much larger body of code simultaneously. Context window size is not a vanity metric. It directly determines how much of a codebase the model can hold in its working memory while making changes across multiple files.

If you are evaluating AI coding tools using a free tier or a model older than 3 months, you are testing a bicycle and concluding that motorized transport is overrated.

Keep Documentation Current

When Claude Code starts a task, it reads the project's documentation, code comments, and configuration files. If these are stale, you are feeding it contradictory information from the first keystroke. It reads the README, then reads the source files, and now it has two conflicting accounts of how the system works. It will waste time reconciling them or, worse, build on the wrong assumptions.

AI is excellent at writing documentation. I never write documentation myself. Every commit, Claude updates every README, every docstring, and every configuration comment to reflect the current state of the code. The cost is a few thousand tokens per commit. The return is that every future session starts with accurate context. Burn the tokens. Let it handle this.

Claude also handles all of my git commits. This matters more than it sounds. Claude writes detailed, descriptive commit messages that explain not just what changed but why it changed. The repository history becomes a form of living documentation. When I need to understand why a particular design decision was made three months ago, I point Claude at the git log and it traces back through its own commit messages to find the reasoning. Occasionally it surfaces context I had forgotten entirely. Verbose commit descriptions cost nothing and pay dividends every time you revisit old code.

A Year of Daily Practice

I started coding with AI in the spring of 2025. Nearly a year of daily use, across real projects, for hours every day. There is a meaningful difference between someone who has done something a thousand times and someone who has done it a few times.

I know how to write prompts that give Claude permission to operate autonomously while providing enough constraint to keep it on track. I know which tasks to batch together, when to intervene, and when to let it run. I know the failure modes and how to recover from them. This knowledge comes from volume. I work mornings, evenings, weekends, and holidays. That is my personality and I know it is unusual. What it means for this article is that I work almost every day using AI to build real projects, for hours, and have for the better part of a year. That accumulated experience directly boosts leverage. Every hour of practice compresses the next hour of output.

Stack Efficiency

Every application I build uses the same technology stack: React with TypeScript for the frontend, Python with Flask or FastAPI for the backend, WebSockets for real-time communication, Redis or Valkey for caching, Milvus for vector databases, and PostgreSQL for the database. Everything runs containerized in Docker on my laptop during development.

This consistency compounds. I can point Claude at an existing application and say "copy this architecture, but implement this other functionality" and it has a solid starting point. When I need a change that touches the data layer, the backend API, the frontend components, the unit tests, and the E2E tests, Claude makes all of those changes in a single pass. There is no division of labor where the DBA updates the database, the Python developer updates the backend, the React developer updates the frontend, and QA updates the test automation. A change that would require coordination across four specialists and three meetings happens atomically in one Claude session.

That division of labor across specialists is precisely what organizations need to move past. The traditional model of siloed expertise with handoffs and coordination overhead exists because no single human can maintain expert-level proficiency across an entire stack. AI eliminates that constraint. One person with AI can operate across the full stack with consistency that a four-person team struggles to achieve, because the AI applies the same architectural patterns and coding conventions from database to browser without the translation loss that happens at team boundaries.

Physical Workspace

I run three monitors: a MacBook Pro screen and two Apple Studio Displays. One Studio Display sits in portrait orientation, giving me roughly 2,500 pixels of vertical space dedicated to reading Claude Code output. That vertical real estate matters more than it sounds. Claude Code sessions produce long, detailed output: file diffs, test results, architecture reasoning, error traces. On a landscape monitor, you scroll constantly and lose context. In portrait, I can see 80-100 lines of terminal output at once without scrolling. I catch errors, review diffs, and evaluate architectural decisions faster because I am reading them in full rather than paging through fragments.

The second Studio Display in landscape holds browser windows, documentation, and the staging site for visual verification. The MacBook screen handles Slack, email, and secondary terminals. This three-screen setup means I never context-switch by toggling windows. Claude's output, the running application, and my communication channels are all visible simultaneously. When Claude finishes a task and announces it in the Slack thread, I can glance at the terminal output on the portrait display, check the rendered result on the landscape display, and respond in Slack on the laptop, all without touching a keyboard shortcut. That elimination of window management friction compounds across hundreds of interactions per day.

I Am the Product Owner

This is a major advantage with personal projects that most people overlook. I know exactly what I want to build. I do not need to schedule a meeting, conduct a customer study, get stakeholder approval, or wait for a product manager to prioritize the backlog. The first step for any new feature is a collaborative requirements session with Claude where we hash out the specification together. Then Claude builds it, and I have something to evaluate in minutes. If it is wrong, I throw it out. If it is close, I refine it. I can iterate from concept to finished feature faster than most teams can schedule the kickoff meeting.

This advantage is specific to personal projects and small teams. In an enterprise setting, the requirement-gathering overhead is real and necessary. But for solo builders, AI eliminates the execution bottleneck entirely, leaving only the creative decision of what to build next.

Greenfield Work Dominates

A large portion of the work in my leverage records is creating things from scratch, not maintaining existing applications. Greenfield work produces the highest leverage because Claude can build without navigating legacy constraints, implicit dependencies, or institutional knowledge embedded in years of accumulated code.

When I start a greenfield project, Claude stays busy for extended periods, building entire subsystems with tests and documentation. The leverage factors for this kind of work routinely exceed 80x. If I were only asking Claude to implement one function at a time in an existing codebase, the numbers would be dramatically lower.

On existing applications, I work in features, not fragments. I do not prompt Claude to build one small piece, then prompt again for the next piece. I describe the entire feature I want, ask for a design, review the design for gaps, and then let it execute the full implementation. Batch scope drives leverage.

Automation Compounds

Over months of daily Claude Code use, I have built a deep layer of automation that accelerates every session. Most people know about the project-level CLAUDE.md file that governs a single codebase. Fewer people realize there is a global CLAUDE.md that governs all projects. That is where I put cross-project automations, preferences, and workflows.

I never edit these files manually. I say "take a note about how we just handled that so we remember it globally next time" and Claude adds the entry. Every future session in every project picks up that knowledge automatically.

Examples: Claude speaks task completions aloud through ElevenLabs voice synthesis. Every commit triggers an automatic documentation update pass. Every new article runs through an AI detection scan before deployment. Every session announces its progress in a dedicated Slack thread. Each of these automations was set up once, in a single prompt, and runs automatically in every subsequent session. The compound effect of dozens of these small automations is substantial. My sessions are more autonomous, more consistent, and more productive than they were six months ago because the automation layer keeps growing.

Belief

This is the factor nobody talks about because it sounds unscientific. But it is the most important one.

I choose to believe that AI can get the job done until it proves otherwise. I know that Claude is a more capable coder, architect, data analyst, and infrastructure engineer than I will ever be at the implementation level. My job is to understand how to direct that capability toward building the right things. "Ask and you shall receive" applies directly. The prompts that produce 200x leverage are the ones that ask for something ambitious and give Claude the space to deliver.

When it goes wrong, I try again. When the second attempt goes wrong, I try a different angle. When it figures out the right approach, I have it record the solution so the next session benefits from the lesson. Most people give up after the first failure and conclude the technology does not work. Most people also never attempt using AI for anything beyond simple chat questions because they do not believe it can handle real engineering work.

I understand why. A lot of engineers are not just skeptical; they are scared. If they acknowledge that AI can do the implementation work of a senior engineer, the logical next question is whether their own job is secure. That fear is rational. Jobs will change. Some will disappear. But the engineers who will remain valuable are the ones who learn to maximize their leverage: the ones who can direct AI to build the right things, catch the mistakes, and iterate faster than anyone working without it. Resisting the technology does not protect your career. Learning to wield it does.

It is also genuinely difficult to put faith in something that operates as a complete black box. Nobody using Claude Code actually understands how it produces what it produces. Neither do the people who built it. As the New Yorker reported in "What Is Claude? Anthropic Doesn't Know, Either", Anthropic's own researchers describe large language models as black boxes: "We don't really understand how they work." The company has an entire interpretability team dedicated to examining Claude's neurons and running it through psychology experiments, and they still cannot fully explain its behavior. I cannot explain why it writes correct WebSocket reconnection logic or why it knows the right way to structure a Redux store. I just know that it does, consistently, and that the output holds up under testing. Trusting a tool that nobody on the planet can fully explain requires a different kind of confidence than trusting your own hands-on-keyboard skills. That transition is uncomfortable. It is also necessary.

Last spring, that skepticism was often justified. Models were weaker. Context windows were shorter. Agentic tooling barely existed. The landscape in early 2026 is categorically different. The capabilities are there. The question is whether you believe enough to push through the learning curve and find out.

Factors that compound to produce high leverage

Key Takeaways

The leverage factor numbers are real. They are also the conservative version of the story. The supervisory factor, measuring output against my actual time investment rather than Claude's runtime, runs 3-6x higher than what I publish.

The METR study measured something valid: autocomplete-style AI tools slow down experienced developers working on mature codebases they already know. My workflow is the inverse: agentic AI building greenfield systems from specifications. Different tool, different task type, different result.

The gap between my numbers and the average developer's experience comes from compounding advantages: best-tier models, current documentation, a year of daily practice, stack consistency, a purpose-built physical workspace, product ownership, greenfield scope, accumulated automation, and the willingness to push the tools further than most people attempt.

None of these advantages are proprietary. Every one of them is available to anyone willing to invest the time and the subscription fees. The technology is ready. The question is whether you are.

This experiment is still early. I have been tracking leverage records daily for only a few weeks. New data accrues every day. I will report again after a month of solid data, with enough volume to show trends, outliers, and how the numbers evolve as I push the workflow into new problem domains.

Additional Resources

The Leverage Factor: Measuring AI-Assisted Engineering Output for the original methodology and dataset
Daily Leverage Records published on this site
Overlooked Productivity Boosts with Claude Code for practical Claude Code productivity techniques
Giving Claude Code a Voice with ElevenLabs for the voice automation workflow
METR Study: Measuring the Impact of Early AI Assistance for the counterargument
Faros AI: The AI Productivity Paradox for enterprise-level data
Anthropic: How AI Is Transforming Work for Anthropic's internal productivity data
MIT Sloan: The Hidden Costs of Coding with Generative AI for the comprehension-performance gap
The New Yorker: What Is Claude? Anthropic Doesn't Know, Either on the black box nature of LLMs
Single Serving Applications - The Clones for examples of high-leverage application generation