Skills Evaluation.

How Markdown Skills turned a lower model
into a production engineer — shipping products
that survive deployment, testing, and scale.

Type
Technical Article
Topic
AI Governance
Date
February 2026
Status
Draft
Abstract

Can AI Build Real Products?

This article evaluates a real experiment: the Design Genome Pipeline, a system that uses Markdown-based instruction files called “Skills” to govern AI-generated code — not just for visual polish, but for building production-ready products.

The experiment tested the same Invoices Dashboard in two ways:

Standard Prompting

“Build me an invoices page” — no rules, no constraints.

Governed Prompting

The same request, but with Skills loaded into context (the AI's working memory).

The test ran on a lower-intelligence, accessible model — deliberately. If Skills make a weaker model produce deployable code, they'll work even better on stronger ones. We tested the floor, not the ceiling.

Key Question: Do Skills transform AI from a “UI sketch tool” into a “production engineering assistant” — one that handles responsiveness, accessibility, backend integration, performance, testing, and deployment?
The Problem

AI Builds Screens, Not Products

Every current AI code tool can generate a good-looking UI. The problem isn't appearance — it's everything else:

Responsiveness
Does it work on phones and tablets, not just desktop?
Accessibility
Do colors pass contrast ratios for vision-impaired users?
Backend Readiness
Can the frontend connect to an API without hallucinated endpoints?
Performance
Does it load fast? Are images optimized? Is code split?
Testing
Can the code be tested automatically without manual clicking?
Deployment
Can it go from laptop to live server without breaking?

Standard AI prompting handles approximately one of these (appearance). Skills aim to handle all six.

Definition

What Are “Skills”?

Skills are structured Markdown files that are loaded into an AI model's context window (its working memory for the current session). They act as runtime instructions — not permanent training, not database lookups, but literal rules the AI reads before generating code.

Think of Skills as an employee handbook for AI. Instead of hoping the AI “figures it out,” you hand it a rulebook. If a rule doesn't cover something, the AI stops and asks — instead of improvising.

The Skills Tested (1st Generation)

SkillRoleLinesTokens
filtering-design-systemsThe Gatekeeper — Controls what components can exist186~2,800
governing-accessibilityThe Contrast Police — WCAG compliance97~1,500
governing-layoutsThe Structural Engineer — Prevents layout drift147~1,200
governing-responsivenessThe Adaptive Engine — Breakpoint transformations143~1,400
Total Skill Overhead573~6,900

Note on 1st Gen Skills: These original Skills were exploratory — intentionally exhaustive. For later projects, these were optimized to consume 40-60% fewer tokens by removing redundancy and merging related rules. On frontier models like Claude Opus, even leaner Skills can be used.

Application

Skills for Products, Not Just Pixels

Skills are not a design tool. They are a production engineering framework. The same pattern — structured rules injected into AI context — works across the entire product lifecycle:

Frontend (What Was Tested)

Frontend SkillRule
Responsiveness SkillsExplicit rules for how each component adapts at each breakpoint.
Accessibility SkillsContrast ratios, keyboard navigation, screen reader compatibility.
Layout Lock SkillsPrevents structural drift over long editing sessions.
Component RegistryThe AI can only use components that actually exist in the codebase.

Backend Integration

Backend SkillRule
API Contract SkillsDefine exact endpoint shapes, preventing hallucinated endpoints.
State Management SkillsEnforce where data lives (server vs. client) and how it’s cached.
Authentication SkillsLock down auth patterns (JWT, OAuth) to prevent insecure shortcuts.
Database Query SkillsConstrain query patterns to prevent N+1 problems.

Bug Testing & Restoration

Testing SkillRule
Test Coverage SkillsEvery new component must include tests for all interaction states.
Regression PreventionBefore modifying any component, verify existing tests still pass.
Restoration SkillsWhen rolling back, remove only added code, never modify pre-existing logic.
Error Boundary SkillsExactly how errors should be caught, logged, and reported.

Website Speed & Performance

Performance SkillRule
Bundle Size GovernanceNo component import may exceed 50KB gzipped. If too large, use dynamic import.
Image OptimizationAll images must use optimized formats (WebP). No unoptimized raw images allowed.
Core Web VitalsNo layout shift above 0.1. Page must be interactive within 200ms.
Code SplittingRoute-level splitting mandatory — each page only loads its own code.
Font LoadingUse font-display: swap. Maximum 2 font families per page.
CachingStatic assets must use content-hash caching for permanent browser storage.

Deployment Pipelines

Deployment SkillRule
Pre-Deployment SkillsRun linting, type checking, and tests before any build. If any fails, halt.
Environment SkillsNever hardcode environment variables. All configs must reference .env files.
Migration SkillsDatabase schema changes must be reversible. Every up migration needs a down.
Monitoring SkillsEvery deployed endpoint must include health checks. Error rates above 1% trigger automated rollback.

Why Predictability Beats Intelligence

A governed lower model produces reliably deployable code. An ungoverned higher model produces brilliant but unpredictable code that requires human review of every line. Skills make the difference.

Skills as Guardrails for Humans, Not Just AI

Here's what most people miss: Skills don't just constrain the AI — they constrain the human developer too. And that's a feature, not a limitation.

A junior developer who has never handled WCAG accessibility, database Row-Level Security, or Core Web Vitals optimization gets those standards enforced automatically through Skills — without needing to study them first. The governing-accessibility Skill doesn't just tell the AI to follow WCAG — it teaches the developer what matters and why.

Cross-Model TransferWritten by one AI (e.g., Claude), consumed by another (e.g., Gemini or GPT). The Skill becomes a bridge between models.
Senior → JuniorA lead developer writes the deployment Skill once, encoding years of experience. Every junior benefits.
Specialist → GeneralistA security expert writes authentication patterns, a performance engineer writes load optimization. A solo developer uses all simultaneously.
Skills are not just an AI governance tool. They are a knowledge distribution system that raises the floor for every developer who uses them — human or artificial.
Infrastructure

The Supporting Infrastructure

Skills reference a constellation of supporting documents that together form the pipeline:

Contracts
Markdown files defining the exact API surface of each UI component — properties, types, visual variants. Prevents hallucinated features.
Locked Components
17 production-tested components with layouting primitives sourced from the Titan UI Kit — the AI must import and use them, not rewrite. Shifts AI from Creator Mode to Assembly Mode.
Policies
Machine-interpretable rules defining when each component can be used, where it can be placed, and how many are allowed per region.
Registry
Exhaustive list of valid components. If it's not in the registry, it does not exist. The AI cannot approximate or invent.

Titan UI Kit — The Layouting Foundation

The AI doesn't design layouts from scratch; it assembles them from Titan's pre-defined structural components that I created using figma mcp and antigravity to code.

Page Shell
Sidebar + header + content area with governed breakpoint behavior
Data Table
Sortable columns, row selection, pagination — locked structure the AI cannot reinvent
Filter Bar
Search, dropdowns, date pickers — with defined responsive collapse rules
Card Layouts
Mobile-first stacked cards that replace table rows below 768px
Navigation
Sidebar with icon-only collapse at tablet, hamburger at mobile
Status System
Consistent badge colors and indicator patterns across all views

Pipeline Governance — 6-Step Execution Sequence

1Screen Decomposition
2Layout Lock Validation
3Component Policy Compliance
4Responsive Governance
5Accessibility Check
6Final Decision
Experiment

What AI Tried and What It Achieved

An Invoices Dashboard — a realistic SaaS screen with sidebar navigation, page-level actions, data filters, a data table, status indicators, and pagination. This type of screen combines layout, data display, user controls, and state management.

The Failed Experiment (Without Skills)

The failed experiment was not incompetent. The AI genuinely tried to build a responsive, production-ready page:

What It Got Right
Used isMobileMenuOpen state for mobile sidebar
Implemented backdrop overlay with blur
Added hamburger button visible only on mobile
Used Tailwind responsive prefixes throughout
Imported and used the locked component library
Where It Fell Apart
Tailwind-dependent: no system for what should change at each breakpoint
Table: horizontal scroll instead of card transformation
No sidebar collapse for tablets — all or nothing
Spacing values drifted from design tokens over iterations
Color leaking — values not in the token system appeared

The Passed Experiment (With Skills)

Rule-based responsiveness — Sidebar collapses at tablet, converts to hamburger at mobile
Table adaptation — Data table transforms into stacked card layout on mobile
Consistent tokens — All spacing, colors, typography stayed within the defined system
No hallucination — Zero invented components. Every element traced to the Registry
Stable across sessions — Re-running the same prompt produced structurally identical output
Production-deployable — Could be deployed live without manual responsive fixes

The model followed governing-responsiveness literally. It didn't guess. It followed instructions.

Proof of Work

Interact with both dashboards live — resize your browser to see governance in action.

Evaluation

Skill-by-Skill Evaluation (Unbiased)

filtering-design-systems
The Gatekeeper
Effectiveness
Clarity
Token Efficiency
Transferability
Strengths
+ Phased progression (Phase 0→8) creates a clear mental model
+ Token Integrity Enforcement prevents inventing new design values
+ Hard Stop Rule prevents scope creep
Weaknesses
Tries to do too much — construction guide AND enforcement tool at once
Could be ~40% shorter without losing effectiveness
governing-accessibility
The Contrast Police
Effectiveness
Clarity
Token Efficiency
Transferability
Strengths
+ Pure Color-Only Change Rule prevents structural "fixes"
+ Nested Component Depth catches edge cases most AI misses
Weaknesses
Some sections show signs of mid-session editing
Could benefit from concrete before/after examples
governing-layouts
The Structural Engineer
Effectiveness
Clarity
Token Efficiency
Transferability
Strengths
+ 3-word motto is highly effective AI guidance
+ Clean separation of content vs. structural changes
Weaknesses
Depends on existence of a layout.md reference file
governing-responsiveness
The Adaptive Engine
Effectiveness
Clarity
Token Efficiency
Transferability
Strengths
+ Created the largest measurable difference between passed and failed experiments
+ Priority cascade: PRESERVE > SHRINK > COLLAPSE > OVERFLOW
Weaknesses
Breakpoint values are hardcoded — should be tokens
SaaS-specific; needs adaptation for other product types
Token Economy

Is It Worth It?

Tokens (roughly 4 characters each) are the currency of AI interaction. Skills consume tokens from the context window.

Cost Analysis

ResourceTokens% of 128K
4 Governance Skills~6,9005.4%
Contracts (9 files)~3,5002.7%
Registry + Policies~4,2003.3%
Pipeline Governance~8000.6%
Total Pipeline Overhead~15,40012.0%

1st Gen vs. Optimized Skills

VersionTokensContextNotes
1st Generation~15,40012.0%Verbose, exploratory
Optimized (2nd Gen)~8,000–9,000~7%Merged rules, shorthand
Frontier-Optimized (3rd Gen)~4,000–5,000~4%Minimal rules for strong models

Return on Investment

MetricWithout SkillsWith Skills
Responsive behaviorAttempted but fragileRule-based and stable
Layout stabilityDegrades visiblyConsistent
Component hallucinationPresent in long sessionsEliminated
AccessibilityInconsistentRules-enforced
Correction rounds3-4 (50K+ tokens each)0-1
Net token costHigher (corrections)Lower (upfront)
Guardrails

How Skills Act as Guardrails: The Session Effect

What Works Well
Anti-Hallucination
Explicit "Never" lists measurably reduce AI improvisation.
Temporal Stability
Rules are re-anchored on every turn, preventing drift over long sessions.
Behavioral Enforcement
Skills enforce logic, not just output — forcing state management the AI would skip.
Production Readiness
Output transforms from "prototype code" into "deployable code that needs review."
What Doesn't Work Well
Context Window Saturation
At 15K tokens (1st gen), the pipeline can crowd out nuanced user instructions in smaller models.
Rigidity
Skills are binary — no "soft guidance." Creative exploration becomes harder.
Maintenance Burden
Skills are static documents. A new component requires updates to Registry, Policies, Contracts, AND Skills.
Case Study

Living Proof: Digihive — Skills Govern a Deployed Product

The Design Genome Pipeline was a controlled test. Digihive answers the real question: do Skills survive weeks of development on a real, deployed product?

Supabase Backend
Row-Level Security, real-time sync, OAuth authentication
1,426 Lines
Canvas panning, zoom, drag-and-drop, session saves, undo history, debounced cloud sync
Fetch APIs
Song and book metadata extraction to populate physical components — vinyl sleeves, book covers, journals

Context Amnesia

Digihive was built across hundreds of prompting rounds over weeks. By prompt 50, a standard AI starts “forgetting” how the sidebar works. By prompt 100, it reinvents the deletion pipeline. By prompt 150, spacing values drift. Skills fix this by encoding critical decisions outside the conversation history.

## 2. ZONE DEFINITIONS ### Canvas Zone - Position: relative - Dimensions: 100vw × 100vh viewport, 3000 × 2000px inner canvas - Scrolling: Disabled (pan via mouse drag) - Zoom: 0.5x – 2x range - Background: #F3F1E7 (canvas color)

The conversation is ephemeral; the Skills are permanent. This is the real power of Skills: they are the product's institutional memory, not the AI's.

Production Skills

SkillLinesPurpose
error-handling-patterns642Circuit breakers, retry with backoff, graceful degradation
performance-optimization218CDN strategy, bundle budgets, Core Web Vitals targets
react-best-practices70Waterfall elimination, bundle optimization, re-render prevention

Notice the pattern: Skills get leaner with practice. The react-best-practices Skill is only 70 lines — proof that Skills can be compact and still effective.

Design Genome vs. Digihive

DimensionDesign GenomeDigihive
Scope1 screenFull application
DurationSingle session (~2h)Weeks of development
BackendNone (static UI)Supabase: Auth, DB, RLS, Sync
Files~570+
Production Skills4 (UI governance)7 (UI + backend + perf)
DeploymentLocal previewLive at digihive.space
The Design Genome experiment proved Skills work for quality. Digihive proves Skills work for longevity.

Token Savings at Scale

~9,300 tokens upfront vs. 200,000+ tokens in corrections over a long project. The larger the project, the higher the return on the Skill investment.

Beyond

Skill Areas Beyond These Projects

Skills are a pattern, not a product. The same structure works anywhere AI generates code:

AreaExample SkillImpact
SEOUnique titles, meta descriptions, single h1, structured dataPrevents SEO-blind pages
SecurityNo plain text passwords. All input sanitized. Parameterized SQL.Prevents common vulnerabilities
InternationalizationNo hardcoded strings. All text references translation files.Ensures translatable app
AnalyticsEvery user action triggers a tracking event with defined properties.Consistent usage data
DocumentationEvery exported function includes JSDoc with types and examples.Docs as a byproduct of dev
Conclusion

The Core Insight

The Design Genome Pipeline was the experiment. Digihive is the proof. Together, they prove that Skills are not a research curiosity but a production engineering framework that works at scale.

1Skills work for production. Even on lower-intelligence models, structured governance produces deployable code.
2The AI tried without Skills — and partially succeeded. The difference: consistency and reliability.
3The cost is real but amortized. 12% overhead (1st gen) pays for itself by preventing corrections. Optimized: ~4%.
4Skills shine brightest in long-lived projects — fighting context amnesia across hundreds of prompting rounds.
5Skills scale across the entire product stack — backend, bug fixing, performance, and responsive behavior.
6Skills get leaner as you learn. 186 lines → 70 lines without losing effectiveness.
7Skills are the product's institutional memory. The conversation is ephemeral; the Skills are permanent.

What Remains Unproven

? Frontier model performance with 3rd-gen Skills — How much can Skills be reduced?
? Multi-model consistency — Would the same Skills produce identical output across Claude, Gemini, and GPT?
? Long-term maintenance — How does the documentation-code sync burden scale?
? Creative loss quantification — How much creative potential is sacrificed for predictability?
? Team adoption — Can developers who didn't write the Skills effectively maintain them?
Skills don't make AI smarter. They make AI more predictable. And predictability — not intelligence — is what production systems actually need.

Visit Projects

Products built and governed using this workflow:

This evaluation was conducted as an unbiased analysis. The author acknowledges that the experiment was designed by the same team that built the pipeline. The failed experiment's genuine responsiveness attempts are documented fairly. Digihive is referenced as a deployed product that used the same governance methodology. An independent replication with adversarial test selection would strengthen these findings.

More Writing
Edge Case Evaluator: AI-native testing framework for UI resilience
ANTIGRAVITYTESTINGAI GOVERNANCE
Feb 14, 2026