About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

The Leverage Factor: Measuring AI-Assisted Engineering Output generated more direct messages than anything else I have published. Some of the feedback was enthusiastic. A significant portion was hostile. "Exaggerated." "False." "No way those numbers are real." Fair enough. I published extraordinary claims with data but without enough context for readers to evaluate the methodology. This article fills that gap. I am going to take specific time records, break them apart, defend the human estimates with engineering detail, and then show that the original leverage calculation actually understates the real multiplier.
The Criticism
The pushback falls into three categories.
"The human estimates are inflated." This is the most common objection. A reader sees "120 human-equivalent hours, completed in 12 minutes" and concludes the estimate must be wrong. Eighty hours for a task? Nobody works 80 hours on a single thing. I will take several of these records apart below and walk through exactly what a senior engineer would need to do, step by step, to produce the same output.
"AI coding tools don't actually help." This objection typically cites the METR study from July 2025, which found that experienced open-source developers using AI tools completed tasks 19% slower than without AI. That study measured something fundamentally different from what I measure. More on that below.
"You're cherry-picking the best results." I publish every single leverage record. Every day. The daily time records include the 6x tasks alongside the 200x tasks. The weighted average across 140+ tasks over the first two weeks of tracking is 45x. That average includes everything: the slow debugging sessions, the I/O-bound batch jobs, the tedious mechanical work. Nothing is filtered out.
Why the METR Study Does Not Apply Here
The METR study is real science and I respect the methodology. Sixteen experienced developers. Randomized controlled trial. 246 real issues. The finding that AI made experienced developers 19% slower on their own repositories deserves attention.
It also measured the exact opposite of what I do.
| Factor | METR Study | My Workflow |
|---|---|---|
| Task type | Bug fixes and maintenance on mature codebases | New features and iterative feature development |
| Codebase age | 10+ years, 1M+ lines of code | Days to months old |
| AI tool | Cursor Pro (autocomplete-style) | Claude Code (agentic, autonomous) |
| Autonomy | Developer drives, AI suggests | AI drives, developer reviews |
| Context | Developer intimately familiar with codebase | AI builds its own understanding from specs |
| Task scope | Individual issues (~2 hours each) | Multi-file systems and full features |
The METR authors explicitly state their findings do not generalize to junior developers, new projects, or unfamiliar codebases. My work sits squarely in the categories they excluded.
What These Numbers Actually Measure
Every leverage record published on this site comes from The Leverage Factor: Measuring AI-Assisted Engineering Output. Here is what the numbers capture and what they do not.
I run multiple Claude Code sessions simultaneously throughout the day. While one session builds a backend API, another writes domain specification documents, and a third drafts an architecture article. The leverage records capture each session independently. They do not capture the parallelism. Five sessions running concurrently for 20 minutes each log as 100 minutes of Claude time, not 20 minutes of wall clock. The actual throughput is higher than what the individual records suggest.
The human estimate is calibrated against my own experience and the experience of engineers I have worked with over a 30-year career. When I estimate that a task would take a senior engineer 40 hours, I mean a competent engineer who already knows the technology stack, already understands the domain, and does not need to ramp up. I am not estimating training time or meeting time or the overhead that organizations layer onto individual work. I am estimating hands-on-keyboard engineering time for someone who knows what they are doing.
Dissecting the Human Estimate
Let me take four specific records from the log and break down exactly what a human engineer would need to do.
Record 1: Production API (120 human-hours, 90 Claude-minutes, 80x)
From the February 25 leverage record.
This task built a complete production API across six phases: session management loop, stub endpoints, specification endpoints, event system, origin integration, and demo migration. The final output included dozens of files, database models, API routes, WebSocket handlers, test coverage, and documentation.
Here is the phase-by-phase human breakdown:
| Phase | What It Involves | Human Time |
|---|---|---|
| Session management | Auth middleware, session store, JWT handling, WebSocket session tracking | 16h |
| Stub endpoints | Route scaffolding, request validation, response schemas, error handling | 8h |
| Specification endpoints | Domain-specific business logic, query builders, pagination, filtering | 24h |
| Event system | WebSocket event routing, pub/sub, state synchronization, reconnection logic | 20h |
| Origin integration | External service client, retry logic, data transformation, caching layer | 24h |
| Demo migration | Port existing demo features, schema migration, data backfill, verify parity | 16h |
| Testing and verification | Unit tests, integration tests, manual verification, bug fixes | 12h |
| Total | 120h |
A senior engineer writes roughly 200 lines of production-quality code per day (industry average for complex systems, consistent with studies by Steve McConnell and others). This API contained thousands of lines across dozens of files. At 200 lines per day, the code alone takes 15+ working days. Add testing, debugging, and documentation, and three weeks is a conservative estimate.
Claude built it in 90 minutes.
Record 2: Media Export Pipeline (24 human-hours, 12 Claude-minutes, 120x)
From the February 25 leverage record.
This task produced 27 files and 2,481 lines of code with 38 passing tests. The system exports media files from a third-party API to S3, handling pagination, rate limiting, authentication, file format conversion, and error recovery.
The human breakdown:
| Step | What It Involves | Human Time |
|---|---|---|
| API research | Read vendor API docs, understand auth model, pagination scheme, rate limits | 2h |
| Architecture design | Design pipeline: ingestion, transformation, storage, error handling | 2h |
| API client | HTTP client with retry logic, rate limit handling, pagination, auth refresh | 4h |
| Pipeline implementation | Recording discovery, download, format conversion, S3 upload, state tracking | 6h |
| Error recovery | Checkpoint/resume, dead letter queue, alerting, idempotency | 4h |
| Test suite | 38 unit tests covering happy paths, error cases, edge cases | 4h |
| Integration testing | End-to-end verification against sandbox API | 2h |
| Total | 24h |
Three full working days for a senior engineer. Claude produced the same output, with tests, in 12 minutes.
Record 3: Resume Generator (40 human-hours, 20 Claude-minutes, 120x)
From the February 28 leverage record.
This task built a complete resume generation system across six phases: data schemas, CLI interface, document parsers and importers, multi-source deduplication, LLM-powered content normalization, HTML and DOCX renderers, customizable templates, and a portfolio website generator.
| Step | What It Involves | Human Time |
|---|---|---|
| Schema design | Data models for work history, skills, education, certifications, projects | 4h |
| CLI and parsers | Command-line interface, JSON/YAML/DOCX/HTML import parsers | 8h |
| Deduplication | Multi-signal matching across imported sources, conflict resolution | 6h |
| LLM normalization | Prompt engineering for consistent formatting, skill extraction, title standardization | 6h |
| Renderers | HTML and DOCX output with multiple templates, CSS styling, print layout | 8h |
| Portfolio site | Static site generator for online portfolio with responsive design | 4h |
| Testing | Unit tests across all phases, sample data fixtures, edge case coverage | 4h |
| Total | 40h |
A full work week. The system touches multiple file format parsers, LLM integration, template rendering, and web generation. Claude completed it in 20 minutes because it could hold the entire data flow from import through normalization to multi-format output in a single context.
Record 4: Interactive Training Application (40 human-hours, 25 Claude-minutes, 96x)
From the February 26 leverage record.
This session built a complete interactive training frontend with seven phases: quiz engine, timed assessment mode, scoring system, a real-time analytics dashboard, progress visualization with animated charts, three additional interaction modes, and a full Playwright E2E test suite. Twenty-four files total, all React with TypeScript.
| Step | What It Involves | Human Time |
|---|---|---|
| Quiz engine | Question rendering, answer validation, feedback display, state management | 6h |
| Assessment mode | Timed sessions, question pooling, scoring, results summary | 4h |
| Scoring system | Rating algorithm, difficulty calibration, score history tracking | 4h |
| Dashboard | Real-time stats display, progress charts, focus area identification | 6h |
| Progress charts | Canvas-based animated charts, responsive layout, theme support | 4h |
| Interaction modes | Card-based review, timed recall, drag-and-drop exercises | 8h |
| E2E tests | 11 Playwright specs, data-testid attributes, error boundary, onboarding flow | 8h |
| Total | 40h |
A full work week of focused React development. Claude produced all 24 files in 25 minutes because each phase built on the patterns established in previous phases. The component architecture, state management approach, and styling conventions carried forward automatically.
The Real Leverage: Supervisory Time
Here is where the original article's methodology understates the actual multiplier.
The published leverage factor compares Claude's wall-clock time to estimated human engineering time:
Published Leverage = Human-Equivalent Hours x 60 / Claude Minutes
This answers the question: "How much output did Claude produce relative to what a human could produce in the same elapsed time?" It is a valid measure of output expansion. But it is not a measure of my personal productivity multiplier.
The question most readers actually care about is: "How much engineering output do you get for every minute you personally spend?" My time on each task consists of writing the prompt (typically 1-3 minutes), reviewing the output when Claude finishes (typically 2-5 minutes), and occasionally providing a follow-up correction (1-2 minutes). For most tasks, my total involvement is under 10 minutes.
Here is the same dataset recalculated with estimated supervisory time:
| Task | Human Eq. | Claude Time | My Time | Published Factor | Supervisory Factor |
|---|---|---|---|---|---|
| Production API | 120h | 90min | 15min | 80x | 480x |
| Media Export Pipeline | 24h | 12min | 5min | 120x | 288x |
| Resume Generator | 40h | 20min | 5min | 120x | 480x |
| Training Application | 40h | 25min | 8min | 96x | 300x |
| Distinction Engine | 40h | 45min | 10min | 53x | 240x |
| Electron App Phase 1 | 40h | 25min | 8min | 96x | 300x |
| MCP Server + LLM Endpoint | 16h | 12min | 5min | 80x | 192x |
The "Published Factor" column is what appears in the daily leverage records. The "Supervisory Factor" column divides human-equivalent hours by my actual time investment. The supervisory factor is 3-6x higher than the published factor because Claude runs autonomously for the vast majority of each task.
The published factor of 80x for the Production API means Claude produced 80 minutes of human-equivalent work for every minute it ran. The supervisory factor of 480x means I received 480 minutes of human-equivalent output for every minute I personally invested. Both numbers are accurate. They measure different things.
I chose to publish the more conservative number. The leverage factor as defined compares AI output to human output on a capability basis. The supervisory factor measures my personal return on time invested. The latter is the number that matters for anyone evaluating whether to adopt this workflow, but the former is the number I can defend without having to prove how long I spent on each prompt.
Why My Numbers Are Higher Than Yours
I hear the same story constantly. "I tried AI and it generated garbage." "It can't do my job." "It doesn't work." Then the person pulls out their phone and shows me the ChatGPT app running on the free tier, or a Claude browser session with no subscription.
The gap between my results and the average experience is not mysterious. It comes from specific, compounding advantages that I have built over nearly a year of daily practice. None of these are secrets. All of them are reproducible. Most people will not bother.
Use the Best Available Models
This is the single biggest factor and the easiest to fix.
The free version of ChatGPT runs a limited model with a 16K context window and restricted reasoning capabilities. The free tier of Claude gives you Sonnet with usage caps. These are not the same tools I use. The difference between the free tier and the best paid tier of any major AI provider is not incremental. It is an order of magnitude in what the model can do competently.
| Tier | Model Quality | Context Window | Reasoning | Monthly Cost |
|---|---|---|---|---|
| ChatGPT Free | GPT-5 Mini (limited) | 16K tokens | Limited | $0 |
| ChatGPT Plus | GPT-5 | 32K tokens | Standard | $20 |
| ChatGPT Pro | GPT-5 Pro | 128K tokens | Maximum | $200 |
| Claude Free | Sonnet (capped) | 200K tokens | Standard | $0 |
| Claude Max | Opus 4.6 | 200K tokens | Maximum | $100-200 |
| Gemini Pro | Gemini 2.5 Pro | 1M tokens | Maximum | $20 |
| Grok | Grok 3 | 128K tokens | Maximum | $30 |
| Genspark.ai | Multi-model graphical | Varies | Varies | $20 |
I personally pay for the maximum usage tier of every major AI provider: Claude Max, ChatGPT Pro, Gemini Pro, Grok, and Genspark.ai for more graphical output. This is not casual experimentation. I invest hundreds of dollars per month in AI subscriptions because the difference between free and paid tiers is the difference between a novelty and a production tool.
I ran all of my Claude Code work last fall on Sonnet 4.5. It was capable but had obvious limitations. When Opus 4.6 became available, I upgraded my subscription to get the highest usage limits so I could count on using it for everything. The difference is not subtle. Opus 4.6 in Claude Code operates like pair programming with a computer science genius who has a strong work ethic, limitless energy, and no ego. It reads code, understands architecture, reasons about trade-offs, writes tests, fixes its own mistakes, and iterates until the work is done. Sonnet 4.5 could do some of this. Opus 4.6 does all of it, reliably, across sessions that run for hours.
Even Sonnet 4.6 improved dramatically over 4.5 because it has access to a 1M token context window (in beta), which allows it to reason over a much larger body of code simultaneously. Context window size is not a vanity metric. It directly determines how much of a codebase the model can hold in its working memory while making changes across multiple files.
If you are evaluating AI coding tools using a free tier or a model older than 3 months, you are testing a bicycle and concluding that motorized transport is overrated.
Keep Documentation Current
When Claude Code starts a task, it reads the project's documentation, code comments, and configuration files. If these are stale, you are feeding it contradictory information from the first keystroke. It reads the README, then reads the source files, and now it has two conflicting accounts of how the system works. It will waste time reconciling them or, worse, build on the wrong assumptions.
AI is excellent at writing documentation. I never write documentation myself. Every commit, Claude updates every README, every docstring, and every configuration comment to reflect the current state of the code. The cost is a few thousand tokens per commit. The return is that every future session starts with accurate context. Burn the tokens. Let it handle this.
Claude also handles all of my git commits. This matters more than it sounds. Claude writes detailed, descriptive commit messages that explain not just what changed but why it changed. The repository history becomes a form of living documentation. When I need to understand why a particular design decision was made three months ago, I point Claude at the git log and it traces back through its own commit messages to find the reasoning. Occasionally it surfaces context I had forgotten entirely. Verbose commit descriptions cost nothing and pay dividends every time you revisit old code.
A Year of Daily Practice
I started coding with AI in the spring of 2025. Nearly a year of daily use, across real projects, for hours every day. There is a meaningful difference between someone who has done something a thousand times and someone who has done it a few times.
I know how to write prompts that give Claude permission to operate autonomously while providing enough constraint to keep it on track. I know which tasks to batch together, when to intervene, and when to let it run. I know the failure modes and how to recover from them. This knowledge comes from volume. I work mornings, evenings, weekends, and holidays. That is my personality and I know it is unusual. What it means for this article is that I work almost every day using AI to build real projects, for hours, and have for the better part of a year. That accumulated experience directly boosts leverage. Every hour of practice compresses the next hour of output.
Stack Efficiency
Every application I build uses the same technology stack: React with TypeScript for the frontend, Python with Flask or FastAPI for the backend, WebSockets for real-time communication, Redis or Valkey for caching, Milvus for vector databases, and PostgreSQL for the database. Everything runs containerized in Docker on my laptop during development.
This consistency compounds. I can point Claude at an existing application and say "copy this architecture, but implement this other functionality" and it has a solid starting point. When I need a change that touches the data layer, the backend API, the frontend components, the unit tests, and the E2E tests, Claude makes all of those changes in a single pass. There is no division of labor where the DBA updates the database, the Python developer updates the backend, the React developer updates the frontend, and QA updates the test automation. A change that would require coordination across four specialists and three meetings happens atomically in one Claude session.
That division of labor across specialists is precisely what organizations need to move past. The traditional model of siloed expertise with handoffs and coordination overhead exists because no single human can maintain expert-level proficiency across an entire stack. AI eliminates that constraint. One person with AI can operate across the full stack with consistency that a four-person team struggles to achieve, because the AI applies the same architectural patterns and coding conventions from database to browser without the translation loss that happens at team boundaries.
I Am the Product Owner
This is a major advantage with personal projects that most people overlook. I know exactly what I want to build. I do not need to schedule a meeting, conduct a customer study, get stakeholder approval, or wait for a product manager to prioritize the backlog. The first step for any new feature is a collaborative requirements session with Claude where we hash out the specification together. Then Claude builds it, and I have something to evaluate in minutes. If it is wrong, I throw it out. If it is close, I refine it. I can iterate from concept to finished feature faster than most teams can schedule the kickoff meeting.
This advantage is specific to personal projects and small teams. In an enterprise setting, the requirement-gathering overhead is real and necessary. But for solo builders, AI eliminates the execution bottleneck entirely, leaving only the creative decision of what to build next.
Greenfield Work Dominates
A large portion of the work in my leverage records is creating things from scratch, not maintaining existing applications. Greenfield work produces the highest leverage because Claude can build without navigating legacy constraints, implicit dependencies, or institutional knowledge embedded in years of accumulated code.
When I start a greenfield project, Claude stays busy for extended periods, building entire subsystems with tests and documentation. The leverage factors for this kind of work routinely exceed 80x. If I were only asking Claude to implement one function at a time in an existing codebase, the numbers would be dramatically lower.
On existing applications, I work in features, not fragments. I do not prompt Claude to build one small piece, then prompt again for the next piece. I describe the entire feature I want, ask for a design, review the design for gaps, and then let it execute the full implementation. Batch scope drives leverage.
Automation Compounds
Over months of daily Claude Code use, I have built a deep layer of automation that accelerates every session. Most people know about the project-level CLAUDE.md file that governs a single codebase. Fewer people realize there is a global CLAUDE.md that governs all projects. That is where I put cross-project automations, preferences, and workflows.
I never edit these files manually. I say "take a note about how we just handled that so we remember it globally next time" and Claude adds the entry. Every future session in every project picks up that knowledge automatically.
Examples: Claude speaks task completions aloud through ElevenLabs voice synthesis. Every commit triggers an automatic documentation update pass. Every new article runs through an AI detection scan before deployment. Every session announces its progress in a dedicated Slack thread. Each of these automations was set up once, in a single prompt, and runs automatically in every subsequent session. The compound effect of dozens of these small automations is substantial. My sessions are more autonomous, more consistent, and more productive than they were six months ago because the automation layer keeps growing.
Belief
This is the factor nobody talks about because it sounds unscientific. But it is the most important one.
I choose to believe that AI can get the job done until it proves otherwise. I know that Claude is a more capable coder, architect, data analyst, and infrastructure engineer than I will ever be at the implementation level. My job is to understand how to direct that capability toward building the right things. "Ask and you shall receive" applies directly. The prompts that produce 200x leverage are the ones that ask for something ambitious and give Claude the space to deliver.
When it goes wrong, I try again. When the second attempt goes wrong, I try a different angle. When it figures out the right approach, I have it record the solution so the next session benefits from the lesson. Most people give up after the first failure and conclude the technology does not work. Most people also never attempt using AI for anything beyond simple chat questions because they do not believe it can handle real engineering work.
I understand why. A lot of engineers are not just skeptical; they are scared. If they acknowledge that AI can do the implementation work of a senior engineer, the logical next question is whether their own job is secure. That fear is rational. Jobs will change. Some will disappear. But the engineers who will remain valuable are the ones who learn to maximize their leverage: the ones who can direct AI to build the right things, catch the mistakes, and iterate faster than anyone working without it. Resisting the technology does not protect your career. Learning to wield it does.
It is also genuinely difficult to put faith in something that operates as a complete black box. Nobody using Claude Code actually understands how it produces what it produces. Neither do the people who built it. As the New Yorker reported in "What Is Claude? Anthropic Doesn't Know, Either", Anthropic's own researchers describe large language models as black boxes: "We don't really understand how they work." The company has an entire interpretability team dedicated to examining Claude's neurons and running it through psychology experiments, and they still cannot fully explain its behavior. I cannot explain why it writes correct WebSocket reconnection logic or why it knows the right way to structure a Redux store. I just know that it does, consistently, and that the output holds up under testing. Trusting a tool that nobody on the planet can fully explain requires a different kind of confidence than trusting your own hands-on-keyboard skills. That transition is uncomfortable. It is also necessary.
Last spring, that skepticism was often justified. Models were weaker. Context windows were shorter. Agentic tooling barely existed. The landscape in early 2026 is categorically different. The capabilities are there. The question is whether you believe enough to push through the learning curve and find out.
Key Takeaways
The leverage factor numbers are real. They are also the conservative version of the story. The supervisory factor, measuring output against my actual time investment rather than Claude's runtime, runs 3-6x higher than what I publish.
The METR study measured something valid: autocomplete-style AI tools slow down experienced developers working on mature codebases they already know. My workflow is the inverse: agentic AI building greenfield systems from specifications. Different tool, different task type, different result.
The gap between my numbers and the average developer's experience comes from compounding advantages: best-tier models, current documentation, a year of daily practice, stack consistency, product ownership, greenfield scope, accumulated automation, and the willingness to push the tools further than most people attempt.
None of these advantages are proprietary. Every one of them is available to anyone willing to invest the time and the subscription fees. The technology is ready. The question is whether you are.
This experiment is still early. I have been tracking leverage records daily for only a few weeks. New data accrues every day. I will report again after a month of solid data, with enough volume to show trends, outliers, and how the numbers evolve as I push the workflow into new problem domains.
Additional Resources
- The Leverage Factor: Measuring AI-Assisted Engineering Output for the original methodology and dataset
- Daily Leverage Records published on this site
- Overlooked Productivity Boosts with Claude Code for practical Claude Code productivity techniques
- Giving Claude Code a Voice with ElevenLabs for the voice automation workflow
- METR Study: Measuring the Impact of Early AI Assistance for the counterargument
- Faros AI: The AI Productivity Paradox for enterprise-level data
- Anthropic: How AI Is Transforming Work for Anthropic's internal productivity data
- MIT Sloan: The Hidden Costs of Coding with Generative AI for the comprehension-performance gap
- The New Yorker: What Is Claude? Anthropic Doesn't Know, Either on the black box nature of LLMs
- Single Serving Applications - The Clones for examples of high-leverage application generation
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.