Pocket Dev
Development Tools AI

Stop Winging It: How I Use OpenSpec to Keep AI Changes Structured and Costs Sane

Nigel Holder ·
Stop Winging It: How I Use OpenSpec to Keep AI Changes Structured and Costs Sane

I’ve been burned enough times by AI-generated code that “solved” the wrong problem to know that the issue usually isn’t the model — it’s the absence of any agreement on what we’re building before the first file gets touched. The AI goes, I approve changes that look reasonable, and somewhere around the third follow-up session I realize we drifted from the original intent two sessions ago.

OpenSpec is what I use to prevent that. It’s spec-driven development for AI coding assistants — a lightweight layer that forces alignment on what before anyone writes how.

This post covers the core workflow, how to customize what the spec generates, and a lesson I learned the hard way about model selection that cost me more than it should have.

The Core Idea

OpenSpec is installed as a global npm package:

npm install -g @fission-ai/openspec@latest
cd your-project
openspec init

After init, your AI coding assistant gets a set of slash commands it auto-detects. The standard loop is three steps:

/opsx:propose add-dark-mode   →  creates a change folder with proposal, specs, design, tasks
/opsx:apply                   →  works through the task list, checking off as it goes
/opsx:archive                 →  moves the completed change to archive, syncs specs

Each change gets its own folder under openspec/changes/:

openspec/changes/add-dark-mode/
├── proposal.md    — why we're doing this
├── specs/         — requirements and scenarios
├── design.md      — technical approach
└── tasks.md       — implementation checklist

The idea is that you and the AI agree on the contents of those files before /opsx:apply runs. You can edit them. You can push back on the proposal. You can say “the design is wrong, here’s why” and regenerate. The code doesn’t start until the spec is right.

That’s the piece that was missing from my workflow before. Not the code generation — I had plenty of that. The checkpoint.

Customizing What Gets Generated

Out of the box, OpenSpec generates reasonable artifacts. But “reasonable for everyone” and “right for my project” are not the same thing. OpenSpec lets you customize at two levels: project config and custom schemas.

Project Config

openspec/config.yaml is the fastest way to inject project context and add per-artifact rules. Here’s a condensed version of mine:

# openspec/config.yaml
schema: spec-driven

context: |
  Tech stack: TypeScript, Node.js, Fastify, Knex, Postgres, Redis, MinIO, Jest.
  Work is organized in waves under planning/agent-tickets/ and changes should stay
  aligned with that dependency order.

  Code conventions:
  - Class-based design: services, repositories, and engines are classes with
    constructor-injected dependencies.
  - Use TypeScript enum (not string unions) for any closed set of values.
  - No any, no as-any casts.
  - PHI must never appear in operational logs; correlationId ULID on every log line.

  Model selection per task:
  - Use lighter models (haiku-4-5, gpt-4o-mini) for audits, doc-only changes,
    and narrow validation.
  - Use standard models (sonnet-4-6, gpt-4o) for feature implementation with
    clear spec.
  - Use stronger models (opus-4-7, o3) for cross-cutting architecture,
    concurrency logic, or correctness-critical code.

rules:
  changes:
    - Each ticket (SIS-NNN-NN) gets its own OpenSpec change directory. Do not
      group multiple tickets unless they are logically inseparable.
    - Every task entry in tasks.md must include a Fibonacci effort score and two
      model options — one Anthropic, one alternative provider.
      Format: "Effort: 5 | Models: claude-sonnet-4-6 / gpt-4o".
      Fibonacci scale 1–13 (1=trivial, 2=small, 3=medium-small, 5=medium,
      8=large, 13=very large).
  tasks:
    - Keep tasks aligned to wave dependencies so independent tickets can run
      in parallel.
    - Include a Fibonacci effort score and two model options (Anthropic +
      alternative) on every task.

There’s a lot happening in that file. The three blocks do different things:

  • context is injected into every artifact — so the AI knows the tech stack and code conventions when writing a proposal just as much as when writing tasks. The model selection guidelines live here so they’re always visible.
  • rules are artifact-specific. The tasks rules only fire when generating tasks.md. That’s where I enforce the format.
  • The double model options (one Anthropic, one alternative) are intentional. I don’t want to be locked into one provider, and for lighter tasks I’ll often reach for whichever has better availability or pricing that day.

Custom Schemas

If config rules aren’t enough, you can fork a built-in schema and own the templates entirely:

openspec schema fork spec-driven my-workflow

That drops the full schema into openspec/schemas/my-workflow/schema.yaml plus markdown templates for each artifact. Editing the tasks.md template directly gives you full control over the structure every generated task list follows. Between the config rules and the template, the AI has no excuse for producing a task list that’s missing effort scores or model suggestions.

Why the Model Column Matters

This is the part I learned the hard way.

When /opsx:apply runs through a task list, it can fan out independent tasks to subagents. That’s genuinely useful — tasks that don’t depend on each other can run in parallel, and wall-clock time drops significantly. I covered the fan-out pattern in detail in my GitHub repo audit post.

What I didn’t think through early on: by default, every subagent inherits the same model as the parent session. If I’m running the parent in Opus, every subagent spins up in Opus. For a task list with eight independent tasks, that’s eight simultaneous Opus sessions running in parallel.

I had a session where I was implementing a medium-complexity feature — a few API routes, some UI updates, a migration. Twelve tasks, seven marked parallel. I ran /opsx:apply, went to make coffee, came back expecting a finished implementation.

Instead I came back to stalled progress and terminals going yellow. Usage warnings everywhere. I upped my limits to keep it going, and while I was doing that I started actually reading the subagent outputs — following the thinking on each one, noting the models. A form component had burned through seven iterations of reasoning before settling on an answer. “You didn’t need to think that hard, bro.” The same refrain came up again and again as I scrolled through. Every subagent. Every task. All Opus. I was shook.

The thing is, most of those tasks didn’t need Opus. Writing a migration file? Haiku handles that. Wiring up a form component? Sonnet is more than capable. Opus earns its place on tasks that need genuine reasoning — complex business logic, tricky architectural decisions, anything where getting it subtly wrong has downstream consequences. Boilerplate doesn’t need Opus-level reasoning. It just runs on Opus because nobody told it not to.

Enforcing Model Selection in Practice

The model selection guidelines in the context block are the spec-side piece — the AI sees them when it’s generating tasks.md and assigns models accordingly. But context alone doesn’t enforce anything at runtime. The enforcement piece lives in CLAUDE.md, which Claude Code reads at session start:

## Agent Model Selection

When spawning subagents to implement tasks in parallel, respect the model
assigned in tasks.md. Never default all subagents to the parent session's model.

Three tiers:
- **Lighter** (haiku-4-5, gpt-4o-mini): audits, doc-only changes, narrow
  validation, boilerplate generation.
- **Standard** (sonnet-4-6, gpt-4o): feature implementation with a clear spec.
- **Stronger** (opus-4-7, o3): cross-cutting architecture, concurrency logic,
  correctness-critical code where subtle errors have downstream consequences.

If a task is not explicitly assigned a stronger model, do not use one.

Between these two pieces — the config rules that shape what gets generated, and the CLAUDE.md instructions that govern runtime behavior — I have a predictable cost profile. Before I run /opsx:apply I can look at the task table, tally the model column, and have a reasonable sense of what the session will cost.

What the Workflow Actually Looks Like

Here’s a realistic task table from a recent change — adding a webhook handler for a third-party integration. The format matches what my config rules enforce: Fibonacci effort score, two model options (Anthropic + alternative):

#TaskEffortModels
1.1Create webhook route handler skeleton (Fastify plugin)3haiku-4-5 / gpt-4o-mini
1.2Add HMAC signature verification with timing-safe compare5sonnet-4-6 / gpt-4o
2.1Write event type discriminator + handler dispatch5sonnet-4-6 / gpt-4o
2.2Add Knex writes for each event type3haiku-4-5 / gpt-4o-mini
2.3Add structured error logging (no PHI, correlationId on every line)2haiku-4-5 / gpt-4o-mini
3.1Unit tests for signature verification5sonnet-4-6 / gpt-4o
3.2Integration tests for each event handler8sonnet-4-6 / gpt-4o
4.1Update API documentation2haiku-4-5 / gpt-4o-mini

Tasks 2.1 through 4.1 are in the same wave — no blocking dependencies between them. When /opsx:apply hits that group, it fans them out as parallel subagents. Most of them are haiku-tier. None of them need Opus. The implementation runs in parallel, finishes fast, and costs a fraction of what it would have if every subagent defaulted to the parent session’s model.

A Few Things Worth Knowing

The parallelism flag is advisory, not enforced. Wave dependency order shapes which tasks can run concurrently, but it’s the AI that decides to fan them out. The CLAUDE.md instructions are what make that behavior consistent. Both pieces need to be in place.

Fibonacci scores are calibration, not precision. The main value is catching scope creep early — if a “small” feature produces a task table with three 8s and a 13, that’s worth interrogating before implementation starts. The scores also let you estimate session cost before you commit: a column of haiku-tier 3s costs very differently from a column of opus-tier 8s.

Dual model options give you flexibility. I always ask for one Anthropic model and one alternative per task. If Anthropic is having a bad latency day, or a particular task genuinely runs better on GPT or o3, the guidance is already there. I don’t have to think about it mid-session.

Custom schemas are version-controlled. The openspec/schemas/ folder lives in your repo. If a teammate clones it, they get the same templates and the same output shape. Project config and schema together replace a whole category of “how does the AI know to format it this way?” conversations.

OpenSpec works with 25+ AI tools, not just Claude Code. The config and schema travel with you if you switch.

The One-Line Version

The spec layer costs you five minutes of alignment. The alternative is discovering you built the wrong thing in the review, or discovering your billing dashboard after an hour of parallel Opus sessions. The five minutes is the better deal.

If you want to try it: npm install -g @fission-ai/openspec@latest, then openspec init in your project. The Discord is active if you get stuck.

See also