✦Skills Evaluation.

How Markdown Skills turned a lower model
into a production engineer — shipping products
that survive deployment, testing, and scale.

Type

Technical Article

Topic

AI Governance

Date

February 2026

Status

Draft

Abstract

Can AI Build Real Products?

This article evaluates a real experiment: the Design Genome Pipeline, a system that uses Markdown-based instruction files called “Skills” to govern AI-generated code — not just for visual polish, but for building production-ready products.

The experiment tested the same Invoices Dashboard in two ways:

Standard Prompting

“Build me an invoices page” — no rules, no constraints.

Governed Prompting

The same request, but with Skills loaded into context (the AI's working memory).

The test ran on a lower-intelligence, accessible model — deliberately. If Skills make a weaker model produce deployable code, they'll work even better on stronger ones. We tested the floor, not the ceiling.

Key Question: Do Skills transform AI from a “UI sketch tool” into a “production engineering assistant” — one that handles responsiveness, accessibility, backend integration, performance, testing, and deployment?

The Problem

AI Builds Screens, Not Products

Every current AI code tool can generate a good-looking UI. The problem isn't appearance — it's everything else:

Responsiveness

Does it work on phones and tablets, not just desktop?

Accessibility

Do colors pass contrast ratios for vision-impaired users?

Backend Readiness

Can the frontend connect to an API without hallucinated endpoints?

Performance

Does it load fast? Are images optimized? Is code split?

Testing

Can the code be tested automatically without manual clicking?

Deployment

Can it go from laptop to live server without breaking?

Standard AI prompting handles approximately one of these (appearance). Skills aim to handle all six.

Definition

What Are “Skills”?

Skills are structured Markdown files that are loaded into an AI model's context window (its working memory for the current session). They act as runtime instructions — not permanent training, not database lookups, but literal rules the AI reads before generating code.

Think of Skills as an employee handbook for AI. Instead of hoping the AI “figures it out,” you hand it a rulebook. If a rule doesn't cover something, the AI stops and asks — instead of improvising.

The Skills Tested (1st Generation)

Skill	Role	Lines	Tokens
filtering-design-systems	The Gatekeeper — Controls what components can exist	186	~2,800
governing-accessibility	The Contrast Police — WCAG compliance	97	~1,500
governing-layouts	The Structural Engineer — Prevents layout drift	147	~1,200
governing-responsiveness	The Adaptive Engine — Breakpoint transformations	143	~1,400
Total Skill Overhead		573	~6,900

Note on 1st Gen Skills: These original Skills were exploratory — intentionally exhaustive. For later projects, these were optimized to consume 40-60% fewer tokens by removing redundancy and merging related rules. On frontier models like Claude Opus, even leaner Skills can be used.

Application

Skills for Products, Not Just Pixels

Skills are not a design tool. They are a production engineering framework. The same pattern — structured rules injected into AI context — works across the entire product lifecycle:

Frontend (What Was Tested)

Frontend Skill	Rule
Responsiveness Skills	Explicit rules for how each component adapts at each breakpoint.
Accessibility Skills	Contrast ratios, keyboard navigation, screen reader compatibility.
Layout Lock Skills	Prevents structural drift over long editing sessions.
Component Registry	The AI can only use components that actually exist in the codebase.

Backend Integration

Backend Skill	Rule
API Contract Skills	Define exact endpoint shapes, preventing hallucinated endpoints.
State Management Skills	Enforce where data lives (server vs. client) and how it’s cached.
Authentication Skills	Lock down auth patterns (JWT, OAuth) to prevent insecure shortcuts.
Database Query Skills	Constrain query patterns to prevent N+1 problems.

Bug Testing & Restoration

Testing Skill	Rule
Test Coverage Skills	Every new component must include tests for all interaction states.
Regression Prevention	Before modifying any component, verify existing tests still pass.
Restoration Skills	When rolling back, remove only added code, never modify pre-existing logic.
Error Boundary Skills	Exactly how errors should be caught, logged, and reported.

Website Speed & Performance

Performance Skill	Rule
Bundle Size Governance	No component import may exceed 50KB gzipped. If too large, use dynamic import.
Image Optimization	All images must use optimized formats (WebP). No unoptimized raw images allowed.
Core Web Vitals	No layout shift above 0.1. Page must be interactive within 200ms.
Code Splitting	Route-level splitting mandatory — each page only loads its own code.
Font Loading	Use font-display: swap. Maximum 2 font families per page.
Caching	Static assets must use content-hash caching for permanent browser storage.

Deployment Pipelines

Deployment Skill	Rule
Pre-Deployment Skills	Run linting, type checking, and tests before any build. If any fails, halt.
Environment Skills	Never hardcode environment variables. All configs must reference .env files.
Migration Skills	Database schema changes must be reversible. Every up migration needs a down.
Monitoring Skills	Every deployed endpoint must include health checks. Error rates above 1% trigger automated rollback.

Why Predictability Beats Intelligence

A governed lower model produces reliably deployable code. An ungoverned higher model produces brilliant but unpredictable code that requires human review of every line. Skills make the difference.

Skills as Guardrails for Humans, Not Just AI

Here's what most people miss: Skills don't just constrain the AI — they constrain the human developer too. And that's a feature, not a limitation.

A junior developer who has never handled WCAG accessibility, database Row-Level Security, or Core Web Vitals optimization gets those standards enforced automatically through Skills — without needing to study them first. The governing-accessibility Skill doesn't just tell the AI to follow WCAG — it teaches the developer what matters and why.

Cross-Model TransferWritten by one AI (e.g., Claude), consumed by another (e.g., Gemini or GPT). The Skill becomes a bridge between models.

Senior → JuniorA lead developer writes the deployment Skill once, encoding years of experience. Every junior benefits.

Specialist → GeneralistA security expert writes authentication patterns, a performance engineer writes load optimization. A solo developer uses all simultaneously.

Skills are not just an AI governance tool. They are a knowledge distribution system that raises the floor for every developer who uses them — human or artificial.

Infrastructure

The Supporting Infrastructure

Skills reference a constellation of supporting documents that together form the pipeline:

Contracts

Markdown files defining the exact API surface of each UI component — properties, types, visual variants. Prevents hallucinated features.

Locked Components

17 production-tested components with layouting primitives sourced from the Titan UI Kit — the AI must import and use them, not rewrite. Shifts AI from Creator Mode to Assembly Mode.

Policies

Machine-interpretable rules defining when each component can be used, where it can be placed, and how many are allowed per region.

Registry

Exhaustive list of valid components. If it's not in the registry, it does not exist. The AI cannot approximate or invent.

Titan UI Kit — The Layouting Foundation

The AI doesn't design layouts from scratch; it assembles them from Titan's pre-defined structural components that I created using figma mcp and antigravity to code.

Page Shell

Sidebar + header + content area with governed breakpoint behavior

Data Table

Sortable columns, row selection, pagination — locked structure the AI cannot reinvent

Filter Bar

Search, dropdowns, date pickers — with defined responsive collapse rules

Card Layouts

Mobile-first stacked cards that replace table rows below 768px

Navigation

Sidebar with icon-only collapse at tablet, hamburger at mobile

Status System

Consistent badge colors and indicator patterns across all views

Pipeline Governance — 6-Step Execution Sequence

1Screen Decomposition

→

2Layout Lock Validation

→

3Component Policy Compliance

→

4Responsive Governance

→

5Accessibility Check

→

6Final Decision

Experiment

What AI Tried and What It Achieved

An Invoices Dashboard — a realistic SaaS screen with sidebar navigation, page-level actions, data filters, a data table, status indicators, and pagination. This type of screen combines layout, data display, user controls, and state management.

The Failed Experiment (Without Skills)

The failed experiment was not incompetent. The AI genuinely tried to build a responsive, production-ready page:

What It Got Right

✓ Used isMobileMenuOpen state for mobile sidebar

✓ Implemented backdrop overlay with blur

✓ Added hamburger button visible only on mobile

✓ Used Tailwind responsive prefixes throughout

✓ Imported and used the locked component library

Where It Fell Apart

✗ Tailwind-dependent: no system for what should change at each breakpoint

✗ Table: horizontal scroll instead of card transformation

✗ No sidebar collapse for tablets — all or nothing

✗ Spacing values drifted from design tokens over iterations

✗ Color leaking — values not in the token system appeared

The Passed Experiment (With Skills)

✓ Rule-based responsiveness — Sidebar collapses at tablet, converts to hamburger at mobile

✓ Table adaptation — Data table transforms into stacked card layout on mobile

✓ Consistent tokens — All spacing, colors, typography stayed within the defined system

✓ No hallucination — Zero invented components. Every element traced to the Registry

✓ Stable across sessions — Re-running the same prompt produced structurally identical output

✓ Production-deployable — Could be deployed live without manual responsive fixes

The model followed governing-responsiveness literally. It didn't guess. It followed instructions.

Proof of Work

Interact with both dashboards live — resize your browser to see governance in action.

✗ View Ungoverned Demo ✓ View Governed Demo

Evaluation

Skill-by-Skill Evaluation (Unbiased)

filtering-design-systems

The Gatekeeper

Effectiveness

Clarity

Token Efficiency

Transferability

Strengths

+ Phased progression (Phase 0→8) creates a clear mental model

+ Token Integrity Enforcement prevents inventing new design values

+ Hard Stop Rule prevents scope creep

Weaknesses

− Tries to do too much — construction guide AND enforcement tool at once

− Could be ~40% shorter without losing effectiveness

governing-accessibility

The Contrast Police

Effectiveness

Clarity

Token Efficiency

Transferability

Strengths

+ Pure Color-Only Change Rule prevents structural "fixes"

+ Nested Component Depth catches edge cases most AI misses

Weaknesses

− Some sections show signs of mid-session editing

− Could benefit from concrete before/after examples

governing-layouts

The Structural Engineer

Effectiveness

Clarity

Token Efficiency

Transferability

Strengths

+ 3-word motto is highly effective AI guidance

+ Clean separation of content vs. structural changes

Weaknesses

− Depends on existence of a layout.md reference file

governing-responsiveness

The Adaptive Engine

Effectiveness

Clarity

Token Efficiency

Transferability

Strengths

+ Created the largest measurable difference between passed and failed experiments

+ Priority cascade: PRESERVE > SHRINK > COLLAPSE > OVERFLOW

Weaknesses

− Breakpoint values are hardcoded — should be tokens

− SaaS-specific; needs adaptation for other product types

Token Economy

Is It Worth It?

Tokens (roughly 4 characters each) are the currency of AI interaction. Skills consume tokens from the context window.

Cost Analysis

Resource	Tokens	% of 128K
4 Governance Skills	~6,900	5.4%
Contracts (9 files)	~3,500	2.7%
Registry + Policies	~4,200	3.3%
Pipeline Governance	~800	0.6%
Total Pipeline Overhead	~15,400	12.0%

1st Gen vs. Optimized Skills

Version	Tokens	Context	Notes
1st Generation	~15,400	12.0%	Verbose, exploratory
Optimized (2nd Gen)	~8,000–9,000	~7%	Merged rules, shorthand
Frontier-Optimized (3rd Gen)	~4,000–5,000	~4%	Minimal rules for strong models

Return on Investment

Metric	Without Skills	With Skills
Responsive behavior	Attempted but fragile	Rule-based and stable
Layout stability	Degrades visibly	Consistent
Component hallucination	Present in long sessions	Eliminated
Accessibility	Inconsistent	Rules-enforced
Correction rounds	3-4 (50K+ tokens each)	0-1
Net token cost	Higher (corrections)	Lower (upfront)

Guardrails

How Skills Act as Guardrails: The Session Effect

What Works Well

Anti-Hallucination

Explicit "Never" lists measurably reduce AI improvisation.

Temporal Stability

Rules are re-anchored on every turn, preventing drift over long sessions.

Behavioral Enforcement

Skills enforce logic, not just output — forcing state management the AI would skip.

Production Readiness

Output transforms from "prototype code" into "deployable code that needs review."

What Doesn't Work Well

Context Window Saturation

At 15K tokens (1st gen), the pipeline can crowd out nuanced user instructions in smaller models.

Rigidity

Skills are binary — no "soft guidance." Creative exploration becomes harder.

Maintenance Burden

Skills are static documents. A new component requires updates to Registry, Policies, Contracts, AND Skills.

Case Study

Living Proof: Digihive — Skills Govern a Deployed Product

The Design Genome Pipeline was a controlled test. Digihive answers the real question: do Skills survive weeks of development on a real, deployed product?

Supabase Backend

Row-Level Security, real-time sync, OAuth authentication

1,426 Lines

Canvas panning, zoom, drag-and-drop, session saves, undo history, debounced cloud sync

Fetch APIs

Song and book metadata extraction to populate physical components — vinyl sleeves, book covers, journals

Context Amnesia

Digihive was built across hundreds of prompting rounds over weeks. By prompt 50, a standard AI starts “forgetting” how the sidebar works. By prompt 100, it reinvents the deletion pipeline. By prompt 150, spacing values drift. Skills fix this by encoding critical decisions outside the conversation history.

## 2. ZONE DEFINITIONS ### Canvas Zone - Position: relative - Dimensions: 100vw × 100vh viewport, 3000 × 2000px inner canvas - Scrolling: Disabled (pan via mouse drag) - Zoom: 0.5x – 2x range - Background: #F3F1E7 (canvas color)

The conversation is ephemeral; the Skills are permanent. This is the real power of Skills: they are the product's institutional memory, not the AI's.

Production Skills

Skill	Lines	Purpose
error-handling-patterns	642	Circuit breakers, retry with backoff, graceful degradation
performance-optimization	218	CDN strategy, bundle budgets, Core Web Vitals targets
react-best-practices	70	Waterfall elimination, bundle optimization, re-render prevention

Notice the pattern: Skills get leaner with practice. The react-best-practices Skill is only 70 lines — proof that Skills can be compact and still effective.

Design Genome vs. Digihive

Dimension	Design Genome	Digihive
Scope	1 screen	Full application
Duration	Single session (~2h)	Weeks of development
Backend	None (static UI)	Supabase: Auth, DB, RLS, Sync
Files	~5	70+
Production Skills	4 (UI governance)	7 (UI + backend + perf)
Deployment	Local preview	Live at digihive.space

The Design Genome experiment proved Skills work for quality. Digihive proves Skills work for longevity.

Token Savings at Scale

~9,300 tokens upfront vs. 200,000+ tokens in corrections over a long project. The larger the project, the higher the return on the Skill investment.

Beyond

Skill Areas Beyond These Projects

Skills are a pattern, not a product. The same structure works anywhere AI generates code:

Area	Example Skill	Impact
SEO	Unique titles, meta descriptions, single h1, structured data	Prevents SEO-blind pages
Security	No plain text passwords. All input sanitized. Parameterized SQL.	Prevents common vulnerabilities
Internationalization	No hardcoded strings. All text references translation files.	Ensures translatable app
Analytics	Every user action triggers a tracking event with defined properties.	Consistent usage data
Documentation	Every exported function includes JSDoc with types and examples.	Docs as a byproduct of dev

Conclusion

The Core Insight

The Design Genome Pipeline was the experiment. Digihive is the proof. Together, they prove that Skills are not a research curiosity but a production engineering framework that works at scale.

1Skills work for production. Even on lower-intelligence models, structured governance produces deployable code.

2The AI tried without Skills — and partially succeeded. The difference: consistency and reliability.

3The cost is real but amortized. 12% overhead (1st gen) pays for itself by preventing corrections. Optimized: ~4%.

4Skills shine brightest in long-lived projects — fighting context amnesia across hundreds of prompting rounds.

5Skills scale across the entire product stack — backend, bug fixing, performance, and responsive behavior.

6Skills get leaner as you learn. 186 lines → 70 lines without losing effectiveness.

7Skills are the product's institutional memory. The conversation is ephemeral; the Skills are permanent.

What Remains Unproven

? Frontier model performance with 3rd-gen Skills — How much can Skills be reduced?

? Multi-model consistency — Would the same Skills produce identical output across Claude, Gemini, and GPT?

? Long-term maintenance — How does the documentation-code sync burden scale?

? Creative loss quantification — How much creative potential is sacrificed for predictability?

? Team adoption — Can developers who didn't write the Skills effectively maintain them?

Skills don't make AI smarter. They make AI more predictable. And predictability — not intelligence — is what production systems actually need.

Visit Projects

Products built and governed using this workflow:

Digihive

Sentimental digital archive with physics-based canvas — Supabase, OAuth, real-time sync

↗

TrickoTreat

Creative studio — Three.js cinematic 3D environment with CRT monitors and 8-ball morphing

↗

This evaluation was conducted as an unbiased analysis. The author acknowledges that the experiment was designed by the same team that built the pipeline. The failed experiment's genuine responsiveness attempts are documented fairly. Digihive is referenced as a deployed product that used the same governance methodology. An independent replication with adversarial test selection would strengthen these findings.

More Writing

Edge Case Evaluator: AI-native testing framework for UI resilience

ANTIGRAVITYTESTINGAI GOVERNANCE

Feb 14, 2026

How Markdown Skills turned a lower modelinto a production engineer — shipping productsthat survive deployment, testing, and scale.

Can AI Build Real Products?

AI Builds Screens, Not Products

What Are “Skills”?

The Skills Tested (1st Generation)

Skills for Products, Not Just Pixels

Frontend (What Was Tested)

Backend Integration

Bug Testing & Restoration

Website Speed & Performance

Deployment Pipelines

Why Predictability Beats Intelligence

Skills as Guardrails for Humans, Not Just AI

The Supporting Infrastructure

Titan UI Kit — The Layouting Foundation

Pipeline Governance — 6-Step Execution Sequence

What AI Tried and What It Achieved

The Failed Experiment (Without Skills)

The Passed Experiment (With Skills)

Skill-by-Skill Evaluation (Unbiased)

Is It Worth It?

Cost Analysis

1st Gen vs. Optimized Skills

Return on Investment

How Skills Act as Guardrails: The Session Effect

Living Proof: Digihive — Skills Govern a Deployed Product

Context Amnesia

Production Skills

Design Genome vs. Digihive

Token Savings at Scale

Skill Areas Beyond These Projects

The Core Insight

What Remains Unproven

Visit Projects

How Markdown Skills turned a lower model
into a production engineer — shipping products
that survive deployment, testing, and scale.