Spec-Driven Development with AI: A Spec Kit + Claude Code Case Study

Spec Kit + Claude Code: Spec-Driven Development Experiment

Here is the number that gets attention: 195 tasks. A functional backend and first-pass mobile frontend generated in roughly two hours of recorded code-generation time.

Here is the number that keeps you honest: testing, integration debugging, database seeding, senior review, and stabilization took four days.

That gap is what this post is about.

In our previous post on agentic programming, the lesson was that agents compress implementation, not judgment. This experiment picks up there: what happens when the specification becomes the operating layer for the build?

In early 2026, we ran a structured experiment using Spec-Driven Development with Spec Kit inside Claude Code to build a full-stack mobile app from existing documentation and OrangeLoops base templates, without visual designs in the initial phases. This is what we found, including the parts that did not go as planned.

In Short

Spec-driven development with AI compressed the code-generation phase dramatically in this experiment: 195 structured tasks, a functional backend, and a mobile frontend in roughly two hours of recorded generation time. But stabilization, integration debugging, design fidelity, database seeding, and human review still took four days. The workflow works best when scope, API contracts, design requirements, and acceptance criteria are explicit before generation starts.

 

What Is Spec-Driven Development – and What Is Spec Kit?

Spec-Driven Development (SDD) is a methodology where you begin with a formal, structured specification of the system before writing code. Instead of starting from loose task descriptions, you produce a scope document that defines features, data models, API contracts, and implementation requirements at a granular level.

The AI does not simply respond to ad-hoc prompts. The implementation is driven by the specification, then refined through planning, task breakdown, implementation, and validation.

Spec Kit is GitHub’s open-source toolkit for Spec-Driven Development. In this experiment, we used its Claude Code integration and workflow commands to move from specification to plan, tasks, and implementation. The key distinction from simply asking Claude to write code: the structure lives in the spec, not in the developer’s prompts. The process is more reproducible, more auditable, and closer to how engineers plan before they code.

 

The Experiment: Building a Mobile App with Spec Kit and Claude Code

We wanted to answer a direct question: can SDD with Spec Kit realistically build a complete functional app – backend and frontend – from documentation and base templates, without visual designs at the start?

The project was Halo, a pseudonym, a mobile application with separate backend and frontend repositories.

Setup:

  • Tools: Claude Code + Spec Kit
  • Model: Claude Sonnet 4.6
  • Base templates: OrangeLoops NestJS + Expo/React Native
  • Starting point: existing project documentation; no Figma designs in the initial phases

We ran it in four sequential phases.

 

How It Worked – The Four Phases

Phase 1 – Backend (NestJS)

We used Spec Kit’s specification workflow to pass the complete backend scope. Claude generated 94 tasks and implemented the backend in approximately 1 hour 6 minutes, consuming roughly 95% of the Claude Code session context used for that phase.

The result: a functional backend for the defined scope, with implemented endpoints documented and accessible via Swagger.

Phase 2 – Frontend Mobile (Expo)

We created a new Spec Kit configuration for the Expo/React Native frontend. One deliberate decision: we used the backend’s OpenAPI/Swagger contract as the frontend service-layer reference.

This let Claude generate frontend service code consistent with the API contracts without directly accessing the backend source code. That boundary worked well, but it did not eliminate all integration risk.

Claude generated 101 tasks and completed the recorded frontend implementation in approximately 1 hour. One caveat worth naming: during phase regeneration, agent-managed timing files were overwritten, and we lost timing data for 4 of the 12 implementation phases. The roughly 1 hour figure covers only the 8 phases with recorded data. This was our first encounter with a metric we could not fully trust.

Phase 3 – Applying Figma Designs

With a functional app built, we applied Figma designs after the fact, one feature at a time to avoid overloading Claude’s context window. Total time: approximately 4 hours.

What we learned: Claude replicated layout and positioning accurately, but not visual styles. Font weights, color application, and spacing required explicit prompting on every pass.

Design fidelity is not a polish step in an agentic workflow. It has to be part of the specification.

We also discovered mid-process that parallel subagents may have reduced elapsed time if screens were independent and worktree or session isolation was configured. We did not measure that in this run, so we are treating it as a likely improvement, not a proven result.

Phase 4 – Testing, Bug Fixes, and Database Seeding

End-to-end functional testing across the full stack, integration bug fixes, realistic test data for the database, senior review, and stabilization required the most sustained manual intervention.

This phase took four days. That is considerably longer than any code-generation phase, and it was the primary driver of real-time cost in the project. Human review and hands-on testing were included in this stabilization window; we did not track them as separate time categories.

 

The Real Numbers

Item Value Notes
Total tasks generated 195 94 backend + 101 frontend
Recorded code-generation time Roughly 2 hours Frontend timing is partial because 4 of 12 phase logs were lost
Design application time Roughly 4 hours Applied after functional implementation
Testing, bugs, seeding, senior review, and stabilization 4 days Human review and QA/debugging effort were included here, not tracked separately
Claude Code session context usage Roughly 95% backend + 92% frontend Based on the session context used in each phase
Conventional build planning baseline About 6 weeks with 2 developers Internal estimate, not a controlled benchmark
AI-assisted workflow elapsed timeline Under 2 weeks Includes the 4-day stabilization phase

Note: The frontend implementation time is a partial figure. Tracking data for 4 of 12 phases was lost when agent-managed files were overwritten mid-session.

 

What Went Wrong

Authentication errors post-generation

After both layers were built, authentication consistently failed across endpoints. In this run, the generated client did not carry over the token-header behavior from OrangeLoops’ base template, specifically attaching the auth token to request headers.

The result was a steady stream of UNAUTHORIZED responses. Fixing it required multiple iterative manual passes, not a single targeted fix.

This is the kind of issue that separates generated code from delivered software. The generated implementation was useful, but it still required human-led integration testing before it could be treated as release-ready.

Debugging across two separate repositories

When a bug crosses the frontend/backend boundary, which integration bugs almost always do, separate repositories create real friction.

Diagnosing what is missing on the frontend means either manually extracting context from the backend session or explicitly instructing Claude to open the other folder. When a fix requires changes on both sides, you are coordinating two separate Claude Code sessions and bridging context between them by hand.

That slows down exactly the phase where speed matters most.

 

What We Would Do Differently

Six concrete changes:

  1. Include Figma designs from the start. Applying them after a functional build created avoidable rework. Designs should be incorporated into the specification phase, even if they are incomplete.
  2. Prompt explicitly for visual styles. Claude replicated structure reliably; it did not automatically replicate visual style. Font weights, spacing, color usage, and visual states need to be specified from the beginning.
  3. Use parallel subagents for design application where the work is independent. Multiple screens may be handled in parallel if each subagent has enough context and isolation to avoid file conflicts. We did not measure this in the run, so we would treat it as an experiment to validate.
  4. For tightly coupled frontend/backend builds, either use a monorepo or create an explicit context bridge between repositories before implementation starts. The separate-repo setup made sense structurally, but it created real cost during stabilization.
  5. Set up persistent metric logging outside the agent-edited workspace. In this run, agent-managed timing files were overwritten. Future experiments should log metrics somewhere the implementation agent will not rewrite.
  6. Track human review, debugging, and QA separately. In this experiment, that work was included inside the four-day stabilization phase. For future planning, those categories should be visible on their own.

What This Means for Delivery Planning

The useful question is not whether AI can generate code quickly. It can.

The real question is what needs to be specified, reviewed, integrated, tested, and stabilized so the output can become usable software.

At OrangeLoops, we see spec-driven development as part of AI-native product engineering: built with agents, governed by engineers. Agents can compress implementation time, but architecture, integration, testing, and release readiness still require engineering judgment.

The risk is not that AI fails to generate code. The risk is mistaking generated code for delivered software.

 

The Verdict

SDD with Spec Kit and Claude Code can compress the code-generation phase of a project dramatically. In this experiment, 195 structured tasks, a functional backend, and a mobile frontend were generated in roughly two hours of recorded AI-assisted implementation time.

But the ROI calculation has to account for what comes after code generation. Integration debugging, design application, database seeding, human review, and stabilization do not compress at the same rate.

Phase 4 alone took four days, longer than the entire recorded code-generation effort combined. That is not a failure of the methodology. It is an honest characteristic of it that needs to be in the project plan from day one.

This approach works best when you have complete scope documentation, well-defined API contracts, clear design requirements, and a realistic stabilization budget. It gets expensive when design fidelity matters from day one, or when tight frontend/backend coupling makes cross-repo debugging unavoidable.

Based on the team’s internal estimate, the same documented scope would likely have taken about six weeks with two developers using a conventional workflow. That is a planning baseline, not a controlled benchmark. The AI-assisted workflow completed in under two weeks of elapsed project time, including the four-day stabilization phase.

The compression is real. Most of it concentrates in code generation, not in debugging.

The workflow is productive. The planning assumptions need to match reality.

 

FAQ

What is spec-driven development?

Spec-driven development is a software development workflow where the specification becomes the operating layer for the build. Instead of starting from loose prompts or isolated tickets, the team defines features, data models, API contracts, requirements, and acceptance criteria before implementation starts.

What is Spec Kit?

Spec Kit is GitHub’s open-source toolkit for Spec-Driven Development. It helps teams turn specifications into plans, task breakdowns, and implementation workflows that AI coding agents can follow more systematically.

Is Spec Kit only for Claude Code?

No. Spec Kit is not only for Claude Code. In this experiment, we used Spec Kit through Claude Code, but Spec Kit is designed as an agent-agnostic toolkit for spec-driven workflows.

How long did this experiment take?

The recorded code-generation phases took roughly two hours, but that was not the full delivery effort. Applying designs took about four hours, and testing, integration debugging, database seeding, senior review, and stabilization took four days. The overall AI-assisted workflow completed in under two weeks of elapsed project time.

When is spec-driven development with AI a good fit?

It works best when scope is clear, API contracts are well defined, design expectations are documented, and the team has enough engineering oversight to review, test, and stabilize the generated output. It is especially useful when the problem can be decomposed into structured tasks.

When should teams be careful with this approach?

Teams should be careful when requirements are still ambiguous, design fidelity matters but is not specified, frontend/backend integration is tightly coupled, or the project involves security, compliance, or production-readiness requirements that need explicit human review.

What is the difference between spec-driven development and ad-hoc AI prompting?

Ad-hoc prompting asks an AI tool to generate code from local instructions. Spec-driven development gives the agent a structured source of truth: scope, contracts, tasks, and implementation expectations. The result is easier to audit, repeat, and correct.

 

Keep in Touch

We are continuing to run these experiments and document what we learn.

If you are planning an AI-assisted build and want to pressure-test scope, specifications, and stabilization effort before committing budget, get in touch with OrangeLoops.

Leave a Reply