Skip to main content

Every week I'd hit Claude Code's quota wall by Thursday. Here's what the audit turned up.

Three changes that tripled my weekly Claude Code quota: opusplan, Context Mode, local LLM delegation. Measured over 24 days, with open-source telemetry scripts.

AC
Arturo Cárdenas
Founder & Chief Data Analytics & AI Officer
May 18, 2026 · 19 min read
Every week I'd hit Claude Code's quota wall by Thursday. Here's what the audit turned up.

Key Takeaway

For weeks I was hitting Claude Max's weekly limit by Thursday. I audited 24 days of JSONL transcripts and found 99% of work routing to Opus 4.x by default. Three changes brought the week back with headroom. opusplan did 94% of the gain. Context Mode and local LLM delegation finished the job.

Thursday afternoon. Quota banner: Weekly usage limit reached. Two days early. Again.

(My quota resets Saturday at 1am, which is why the wall lands Wednesday or Thursday — five or six days into a seven-day cycle. Claude Max anchors your week to your subscription date, so yours could reset on any day. The shape is the same: out N days early, every week.)

I know the pattern by now. When something breaks at the same hour every week, the problem isn't the limit. It's the approach.

So instead of trying to dodge the quota, I ran cost-per-day.sh against my Claude Code JSONL and went to see what was actually going on.

Cumulative weekly credit burn — before and after the full stack This is what "running out of quota" looked like every week. Three days of sprint still ahead, banner up, work on hold. The teal line is the same workload with all four levers active. Full week, with headroom.


The short version

This is a Max 200 story. The $200/month invoice doesn't change. What changes is how much work fits in the same weekly window.

Quick heads-up on the math: every $/day number below is an API-equivalent estimate computed from JSONL token counts at Anthropic's published rates. On a flat-rate Max plan those dollars don't exist on any invoice — they're the cleanest proxy I have for quota consumption. Full formula in the methodology block under "The audit."

Four changes produced a 3× improvement in quota capacity. Here they are:

LeverWhat it isEffort
ANTHROPIC_MODEL=opusplanSonnet by default; Opus only in Plan ModeOne environment variable
Context Mode MCPSandboxes tool output in a local index; only what you search for enters contextOne .mcp.json block + hooks
SocratiCode (semantic codebase search)Returns ~5 ranked chunks per query instead of grep/Read dumping full filesLocal MCP + a nudge hook on grep/rg/find
Local LLM delegationFiles never enter Claude's context. Path in, result outRequires a non-Claude executor

And the results:

PeriodWhat changed$/day$/turnValue mult
W0 Apr 25–30Baseline — 99% Opus$858$0.215
W1 May 1–7opusplan + Context Mode wiring$356$0.1082.4×
W2 May 8–14Stack stabilized$283$0.0903.0×
W3 May 15–18Steady state$326$0.0952.6×

A single change did about 94% of the work. The other three did the remaining 6% — and I almost didn't ship them.

This is the individual / Max-plan version of what teams have been figuring out at API scale, with measurement scripts you can run on your own JSONL. The repo lives at clarivant/claude-credits-multiplier. Everything below is reproducible.


The audit

Claude Code writes one JSONL file per project at ~/.claude/projects/<hash>/*.jsonl. One entry per assistant turn, with the model, the token counts (input / cache-creation / cache-read / output), and a timestamp. I wrote a script, cost-per-day.sh, that walks the log and breaks down spend by day and model tier. It's in the repo along with everything else in this post.

Methodology in one block. The numbers come from JSONL transcripts running April 25 to May 18: 24 days, 82,973 turns. $/turn is computed as (input × $/MTok_input + output × $/MTok_output) / turns at Anthropic's published April 2026 rates ($25/MTok for Opus output, $15/MTok for Sonnet output). I tag numbers as Hard when measured directly and as Estimated when computed from a declared formula. The tags appear on every figure where the distinction matters.

Sample output:

$ ./cost-per-day.sh
2026-04-28  $1267  Opus:99% Sonnet:1%   turns:4892
2026-05-02  $203   Opus:14% Sonnet:86%  turns:3201  ← opusplan activated
2026-05-09  $102   Opus:11% Sonnet:88%  turns:2987
2026-05-14  $78    Opus:12% Sonnet:87%  turns:2654

What came out when I ran it:

  • 99% of turns on Opus 4.x. Every single one, no matter the task.
  • Average cost per turn: $0.2568 at API-equivalent pricing.
  • Opus runs 1.67× more expensive than Sonnet at the output rate ($25/MTok Opus vs $15/MTok Sonnet 4.6, per Anthropic's April 2026 pricing — same rate across Opus 4.6 and 4.7). Every turn was paying the Opus premium.
  • Weekly trajectory: out of quota by Thursday or Friday.

Before going further, how the quota actually works. This matters for understanding the problem. Claude Max 200 has a total pool of weekly credits shared across all models. Sonnet also has its own per-model cap, separate from the total. If you only use Sonnet, you can exhaust the Sonnet cap without touching the total pool, and your Opus and Haiku credits stay intact. If you only use Opus, you drain the total pool entirely and the Sonnet cap never moves. I was doing the second one. Every one of those 99% Opus turns drained the shared pool. Sonnet's cap stayed full. I was burning the only resource that decides Thursday. (Anthropic documents the two-layer structure officially. Different layer, same wall.)

Two-pool quota mechanic — depleted total pool vs balanced usage Two scenarios, same starting credits. Left: Opus drains the shared pool fast while Sonnet's credits sit untouched. Right: Sonnet handling 84% of turns means both pools have headroom on Sunday. The gap between the two right-side fill levels is the 3× multiplier in visual form.

The obvious thing, said plainly: every task, however mechanical, was routing to the most expensive model by default. A bash command: Opus. A commit message: Opus. A three-line diff review: Opus. (Yes, all of them.)

And it kept stacking up. A plugin I rely on was dispatching three subagents per task: one implementer and two reviewers. All of them inheriting Opus from the parent session. Every planning turn was firing off a small Opus cluster I couldn't even see.

I had no telemetry until I wrote some. That's the actual lesson here. Before you can fix quota consumption, you need to be able to see it.


I had this in the wrong order

For weeks I'd been burning through my credit two, three, four days before it should reset. I was getting frustrated. The first lever I built was local LLM delegation — figured that's where the win had to be, since that's where the biggest tokens were going to Claude. And no. I was saving a couple of dollars a week.

So I built out the rest of the stack: opusplan, Context Mode, SocratiCode, the LiteLLM proxy. Per-call dollar savings barely moved. The marginal effect on top of local delegation was significant in percentage but small in absolute terms — under 1.3% of the call cost. Looking at the per-call number, I almost didn't ship any of this.

What I finally noticed, after sitting with the data longer, was that my credits were lasting longer. I was reaching the end of the week without hitting the wall. That's when it clicked. It wasn't just the tokens that weren't being sent because the local model processed them. It was the tokens being saved on every call, every turn, compounding across the rest of the session. Saving and saving and saving.

That's why this post goes in order of impact — opusplan first, Context Mode second, SocratiCode third, local LLM fourth. It isn't the order I figured them out. The order of discovery was exactly reversed.


Lever 1: opusplan — one env var, ~94% of the gain

Daily API-equivalent spend — opusplan activated May 2 Daily API-equivalent spend, computed from JSONL token counts at April 2026 Opus/Sonnet rates. The W0 baseline (magenta) averaged $858/day. The W2 stretch (teal-shaded), with all four levers active, averaged $283. The $1,267 spike on April 28 turned into May 14's $78. 16× cheaper day-to-day.

The claude binary supports a routing mode called opusplan. You activate it as a shell environment variable:

export ANTHROPIC_MODEL=opusplan

When it's active, Opus only runs in Plan Mode (Shift+Tab). Everything else routes to Sonnet: execution turns, tool use, subagent dispatches.

Before relying on it, I verified it was actually wired up:

$ strings $(which claude) | grep opusplan
opusplan   # appears 5× in the binary

The flag is officially documented, but it doesn't appear in the standard /model picker. That's why most of the devs I've talked to describe it as "hidden." The grep above is reproducible verification you can run on your own install. It's been stable across multiple Claude Code releases through May 2026.

I set the variable on May 2. That day Opus dropped from 99% to 14.2% of turns. The following Monday was the first full sprint week that didn't hit a quota wall.

Where the 94% comes from. opusplan owns every non-Plan routing decision, which is 99% of all turns. The other three levers compound on top via the cache-read multiplier, and their savings show up as quota stretch rather than per-turn dollars — harder to attribute cleanly. The 94% is the share of $/turn reduction attributable to model routing alone. The remaining 6% is Context Mode, SocratiCode, and local LLM delegation acting together. (You won't find that exact decomposition derived line-by-line in the post — the dollars don't add up linearly because the levers interact through cache reads. The number reflects routing's dominance of the dollar-cost lever, not a 94/6 split of the multiplier.)

The 2.4× vs 3.0× distinction. Fair objection: didn't I just do less work in W2? The per-turn number answers that on its own. W0 = $0.215/turn, W2 = $0.090/turn. 2.4× cheaper per unit of work, independent of volume. The 3.0× $/day number includes a slight volume drop (3,164 turns/day vs W0's 3,996). Both numbers hold.

Turn-share isn't token-share. Across all turns after activating opusplan, Opus settled at 15.8%. That single May-2 day was 14.2%, but the running average is what matters for the multiplier. That 15.8% of turns translates to 30.5% of output tokens, because Plan Mode responses run longer (1,838 tokens vs 790 for Sonnet). The savings come from Sonnet at $15/MTok handling the 84% of turns that don't need architectural judgment. Not from Opus doing less work.

Subagent inheritance. The obvious question: if Plan Mode fires on Opus and dispatches subagents, do those subagents inherit Opus? They don't. ANTHROPIC_MODEL=opusplan is a shell environment variable. Subprocesses spawned by Agent inherit the parent's full environment. Every dispatch follows the same routing. I audited the JSONL specifically for this. Zero Opus turns marked as subagent (isSidechain=True) across 9,298 Opus calls after May 2. The 15.8% is honest across orchestrator and every subagent alike.

Your number will be different. The 15.8% average mixes projects with very different workloads:

Work typeProject (anonymized)Post-activation Opus%
Architecture + analysisllm-infra34.4%
Web + content strategyconsultancy-site32.4%
Feature implementationfintech-tool15.2%
Content only / no Plan Modeclient-cms0%

If your work is mostly architectural planning, expect somewhere between 25 and 40%. If it's mostly implementation, expect between 5 and 15%.


Lever 2: Context Mode and the cache-read multiplier

This lever shows up less in the data, but it compounds harder.

Every turn in a Claude Code session re-reads the session cache. The cache hit rate ran at 96.1% across 82,973 turns over 24 days. A token that enters context early gets re-read on practically every subsequent turn until the session closes.

The math is annoying but matters. A token that enters context at turn 50 of a 500-turn session gets re-read about 450 times. Keep it out and you've avoided 450 reads, not one. Expected value across the session: roughly M/2 reads avoided per token you divert.

Context Mode MCP (maintained by mksglu, installed via npx context-mode) takes tool output and sandboxes it into a local FTS5 index on your machine. Web fetches, large file reads, command output. All of it goes to the sandbox. Only what you explicitly search for with ctx_search enters Claude's context. The maintainer reports about 98% context reduction on typical sessions. My own 15-day numbers are below.

Setup is one block in .mcp.json:

{
  "mcpServers": {
    "context-mode": {
      "command": "npx",
      "args": ["-y", "context-mode"]
    }
  }
}

And then enforce it with hooks. Any Read over 200 lines without a limit parameter gets hard-blocked and rerouted to ctx_execute_file. Any web fetch routes through ctx_fetch_and_index. Without the hooks, Context Mode exists but applies inconsistently.

15-day results (May 3 → May 18):

MetricValueConfidence
Tokens kept out of Claude context8,109,886Hard — via ctx_stats per session
Direct savings at Opus input rate$121.65Hard
Total ctx_* calls across 58 sessions2,145Hard
Avoided cache reads (10-day est.)~1.69 billionEstimated
Estimated cache-read multiplier value~$2,528Estimated

The Estimated numbers use tokens_kept_out × (session_turns / 2). I keep the Hard/Estimated labels throughout the post. The distinction matters.

The spike days, May 9 (2.1M tokens sandboxed) and May 11 (2.2M), were multi-file sprint sessions where tool output would have saturated the context window. Those are the days where Context Mode shows up most clearly in the data.

One caveat on quality. No controlled A/B test here. The claim is that sandboxing tool output and retrieving relevant chunks doesn't degrade response quality. In practice it held across every session. Code compiled, searches returned useful results, tasks completed correctly. But this is anecdotal verification, not a benchmark. A failure case would be hard to tell apart from a normal mistake.


Lever 3: SocratiCode — semantic codebase search

The other place Claude's context fills up is when you grep a codebase and end up reading the top 20 matches.

SocratiCode is a local MCP server (Docker-based, fully offline) that indexes the project via local embeddings (nomic-embed-text on GPU 1 → Qdrant) and returns ranked chunks via hybrid semantic + BM25 search. Each codebase_search call returns ~5 ranked chunks (about 500 lines total) instead of grep/Read dumping full files into Claude's context.

The savings compound through the same cache-read mechanic as Lever 2. A chunk that never enters context at turn N doesn't get re-read on every subsequent turn for the rest of the session. Same mechanism, different path.

Operational numbers (May 2 → May 18, 17 days post-fix):

MetricPre-fix (Apr 26–28 sprint days)Post-fixLift
codebase_search calls per day4–741.87.8×
Total codebase_search calls711(May 2 → May 18)
Indexing tokens to local GPU (never touched Claude)29.4M(Apr 25 → May 18)

Why there isn't a Hard token-savings number here. Unlike Context Mode's ctx_stats, SocratiCode doesn't expose a per-call metrics hook. Each search replaces an unknowable count of grep+Read sequences that would have flooded context. The 711 call count is the cleanest measurement of how much I'm using it. The tokens kept out are estimated downstream from the cache-read multiplier math, same mechanic as Lever 2, just unmeasured per-call.

Why the pre-fix → post-fix jump was 7.8×. The wiring was broken in two ways at once. The base64 embedding bug (covered in "What didn't work" below) meant the index was effectively unusable. And CLAUDE.md had the wrong MCP tool name (socraticode_* instead of mcp__socraticode__*), so Claude Code couldn't reliably reach the tool. Both got fixed May 2, along with socraticode-nudge.sh, a PreToolUse hook that fires on grep / rg / find and redirects to semantic search. After all three landed, usage jumped from 4–7 calls/day on heavy sprint days to 41.8/day.

For exact strings (brand names, CSS classes, constants), grep stays the right tool. SocratiCode is for "where is X defined / who calls Y / where does config Z get read."


Lever 4: Local LLM delegation

I run two RTX 3090s with a LiteLLM proxy in front (Qwen3.6-27B on GPU 0, Qwen3-14B on GPU 1, embeddings on GPU 0 alongside the larger model). The hardware matters less than the protocol. code_task_files passes a path to the local model instead of the content. The model reads from disk, writes the output to disk, and Claude gets the result path. The file content never enters Claude's context.

MethodTokens to Claude
Inline file content (code_task)~7,500
File path via code_task_files~250

97% fewer Claude tokens per call. In absolute terms across the week it's the smallest of the four levers, but the per-call ratio is the most dramatic.

The honest part. This needs a non-Claude executor. 890 calls, 91% offload ratio from May 1 to May 18. Zero errors in the most recent 389-call LiteLLM-proxied reconciliation. Without local hardware, you'd delegate to an external API and the savings move to a different bill. The pattern is valuable either way. The economics depend on what your executor is.


What didn't work

Three things I tried and had to revert or fix. (This is the section that lets me sleep at night publishing the numbers above.)

nomic-embed via LiteLLM. I tried routing embedding calls through the LiteLLM proxy for unified telemetry. Cosine equivalence failed silently. Root cause: nomic-embed-text needs task-prefix injection (search_query: / search_document:), which the LiteLLM → Ollama adapter drops without warning. Results looked plausible. They weren't equivalent. I stayed on Ollama direct (port 11434). Lesson: test cosine similarity before trusting any embedding adapter.

The base64 dimension bug (and a vendored patch). Lesson upfront: when a tool returns wrong-shaped embeddings, the bug is almost always upstream in the SDK + adapter chain. Not in the tool you're holding. This wasn't a SocratiCode failure (the same bug had landed in three other projects: langchain-ai/langchainjs#8221, mem0ai/mem0#4057, getzep/graphiti#868). I'm still running SocratiCode in production. The patch is one line. What happened: SocratiCode started returning 192-dimensional embeddings against a 768-dim Qdrant collection. Searches throwing 400s, upserts failing, total confusion until I checked the actual embedding dimensions. Root-cause chain: OpenAI Node SDK 6.34.0 sends encoding_format: "base64" by default (since openai-node #1310). LiteLLM 1.83.13's Ollama adapter ignores the flag and returns raw float arrays. The SDK mis-decodes them as a Float32Array that's exactly 4× too short (768 → 192 dim). Fix: vendored SocratiCode 1.6.1 with a one-line patch forcing encoding_format: "float".

The "40% slower at 280W" power cap. My initial docs claimed that capping the GPUs at 280W made the stack 40% slower than stock. The benchmark showed 9%. This workload is memory-bandwidth bound. More power adds FLOPS, but VRAM bandwidth is the bottleneck. The 40% figure came from forum posts about a completely different workload class. Lesson: don't import benchmarks without checking whether the workload matches yours.


What this is and isn't

On Max flat-rate, none of this lowers the invoice. The $200/month bill doesn't change. The value lives entirely in quota capacity. More sprint weeks before the cap, fewer Thursday afternoon pauses.

One nuance worth flagging. After opusplan, most turns route to Sonnet. If your workload is extremely execution-heavy, you might hit the Sonnet per-model cap before the total pool. A different Thursday wall. The telemetry tells you which one you're approaching. The fix is the same either way: measure first.

On usage-based API billing, the savings are direct dollar reductions. opusplan moves most work from Opus ($25/MTok output) to Sonnet ($15/MTok). That's roughly 40-60% off your API bill depending on your current model mix. Context Mode adds more on top. The 3× multiplier is Max flat-rate math. Your usage-based equivalent will be a different number, but the direction is the same.

None of this is visible without telemetry. Read your own JSONL or you're guessing.


Frequently asked questions

Do I need local GPUs to get most of the benefit?

No. The two biggest levers (opusplan and Context Mode) are fully hardware-agnostic. SocratiCode and local LLM delegation both need a non-Claude executor — local GPU works, but a remote embeddings API or remote LLM also works. The 94% gain from opusplan alone is one env var, no infrastructure required. Start there.

What's the simplest place to start?

export ANTHROPIC_MODEL=opusplan in your shell profile. One line, nothing else. Confirm it took: look for "model": "claude-sonnet" in your JSONL on the next session, or run opusplan-validation.sh from the claude-credits-multiplier repo. Measure one baseline week, set the variable, measure the next week. The inflection shows up plainly in the data.

Will Anthropic change or remove opusplan without notice?

The flag is officially documented but doesn't appear in the standard /model picker. That's why most practitioners describe it as "hidden." It's been stable across multiple Claude Code releases through May 2026. To confirm it's wired on your install:

strings $(which claude) | grep opusplan

If it disappears after an update, the feature was removed. Official documentation gives me more confidence than "found only in the binary" would. But I wouldn't treat any specific model-routing flag as a permanent guarantee.


Closing

What surprised me most: 94% of the quota improvement came from a single shell environment variable that changed which model handles which kind of turn. The GPU rig is genuinely useful infrastructure. It also contributed the smallest fraction of the weekly quota improvement. (That bothered me a little.)

The infrastructure that actually mattered was the measurement. Without cost-per-day.sh surfacing what the JSONL was really saying, none of this would have been visible until I ran out of quota on a Thursday again.

The telemetry scripts live in claude-credits-multiplier: cost-per-day.sh for per-day spend by model, opusplan-validation.sh to confirm your binary supports the routing flag, ctx-stats-parser.sh to summarize Context Mode savings. Run them against your own JSONL. Your baseline will be different, your gaps will be different, your levers will be different.

The shape of this work — measure the real cost of something opaque, find the lever that does 94% of the work, ship a measurement script so the gain is verifiable — is the same shape I use on data pipeline cost, analytics trust, and AI infrastructure at Clarivant. Different surface, same engineering discipline. If you want to understand where your quota is actually going (or why your analytics warehouse bill keeps creeping up), let's talk.

Topics

claude codeclaude maxopusplancontext modequota optimizationtoken efficiencyai costanthropiclocal llm delegationjsonl telemetry
Share this article:
AC

Arturo Cárdenas

Founder & Chief Data Analytics & AI Officer

Arturo is a senior analytics and AI consultant helping mid-market companies cut through data chaos to unlock clarity, speed, and measurable ROI.

Ready to turn data into decisions?

Let's discuss how Clarivant can help you achieve measurable ROI in months.