✦Edge Case Evaluator.

How I built a system that forces AI
to map every possible failure before
writing a single line of code.

Type

Technical Article

Topic

AI Risk Analysis

Date

February 2026

Commands

Abstract

AI Solves Before It Thinks

AI coding assistants are remarkably good at generating solutions. Give them a problem, and they produce working code in seconds. But there is a fundamental flaw in this workflow:

AI jumps to solutions before understanding the full landscape of what could go wrong.

This is not a capability problem. It is a sequencing problem. The model has the knowledge to identify edge cases, trace consequences, and weigh trade-offs. It just does not do it unless you force it to.

I built ECE (Edge Case Evaluator) to solve this. It is a structured command system that makes AI map the “butterfly effect” of every decision before proposing any solution. Think of it as a pre-flight checklist for problem-solving.

The Problem

What AI Does Not Do

When you ask an AI assistant to solve a design or engineering problem, it typically does three things:

1Picks the first reasonable approach

→

2Generates a solution for it

→

3Moves on

What it does not do:

Compare Approaches

Consider 3 to 5 fundamentally different solutions to the same problem.

Trace Failures

Identify what could go wrong with each approach and follow those failures to worst-case outcomes.

Audit Itself

Check whether its own analysis has been influenced by anchoring, confirmation, or user-preference bias.

The result: Solutions that work in the happy path but break in production. Edge cases discovered during QA instead of during design. Approaches chosen because they “felt right,” not because they were systematically evaluated.

Solution

How ECE Works

ECE is a set of 17 structured commands that install into any AI coding environment. Each command forces the AI into a specific mode of thinking, with strict rules about what it can and cannot do.

The core workflow follows a deliberate sequence:

1Map

→

2Expand

→

3Validate

→

4Audit

→

5Optimize

→

6Synthesize

→

7Review

→

8Close

Each step has guardrails. The AI cannot skip ahead. It cannot propose solutions during the mapping phase. It cannot optimize without first checking its own bias.

Architecture

Spine + Leaf: Token-Efficient Evaluation

ECE uses a Spine + Leaf architecture designed for token efficiency:

The Spine

A compact map file (spine.md) containing:

+ Mermaid diagram of the full decision tree

+ Risk table listing every identified edge case

+ Risk ratings, probability estimates, confidence levels

+ The recommended route (once optimization is done)

The Leaves

Individual edge case files (nodes/<id>.md) containing:

+ Full butterfly effect chain (cause to terminal outcome)

+ Mitigation strategies

+ Cross-branch dependency links

+ Validation results

The key constraint: only one leaf file is loaded at a time. This prevents token bloat and forces focused analysis. The spine is always loaded for context, but detailed work happens one node at a time.

Commands

The Full Command Set

Core Loop (8 commands)

Command	Purpose
/ece map	Creates the initial butterfly-effect tree with 3 to 5 approaches and their immediate edge cases
/ece recommend	Analyzes all open nodes and suggests exactly which one to investigate next, with reasoning
/ece expand	Deep-dives into a specific node, tracing its downstream effects to a terminal outcome
/ece validate	Stress-tests a proposed mitigation to check if the fix itself introduces new risks
/ece trace	Follows a complete chain from any node to all its terminal outcomes
/ece blindspot	Adversarial self-audit for anchoring bias, confirmation bias, user influence contamination
/ece optimize	Scores all approaches using weighted risk-probability and recommends the safest route
/ece synthesize	Generates a concrete solution blueprint with requirements and acceptance criteria

Solution Verification (2 commands)

Command	Purpose
/ece review	Evaluates a UI screenshot against the solution blueprint, checking acceptance criteria and accessibility
/ece close	Finalizes the evaluation with a health score, archives results, and generates a history summary

Prototyping (1 command)

Command	Purpose
/ece wireframe	Generates an interactive React/Next.js wireframe from the solution blueprint. Requires prior /ece synthesize. Produces a two-panel layout: a control panel with toggle conditions, priority badges, and acceptance criteria on the left, and the primary interaction zone with state visualization on the right. All conditions derive from solution.md, nothing is invented.

Session Management (2 commands)

Command	Purpose
/ece resume	Detects where a previous evaluation was paused and tells you exactly what to do next
/ece status	Full node listing with risk, probability, confidence, and a smart next-action suggestion

Utilities (4 commands)

Command	Purpose
/ece test-gen	Auto-generates Given/When/Then test cases from butterfly effect chains
/ece impact	Builds a cross-reference matrix for compound failures between edge cases
/ece patterns	Identifies recurring risk categories across completed evaluations
/ece reset	Clears all evaluation data with confirmation

Scoring

Risk Quantification

Every node in the tree gets rated on two dimensions. The combination produces a weighted score that makes risk comparison objective rather than intuitive.

Risk Level	Points
Critical (Red)	10
Medium (Yellow)	3
Low (Green)	1

Probability	Multiplier
Almost Certain	1.0
Likely	0.7
Possible	0.4
Unlikely	0.2
Rare	0.05

Weighted Score = Risk Level × Probability. A critical edge case that is almost certain to occur (score: 10.0) is treated very differently from a critical edge case that is rare (score: 0.5). During optimization, approaches are scored by summing all weighted scores in their branch.

Key Feature

The Blindspot System

This is the feature that makes ECE different from any other planning tool. AI models are susceptible to specific biases during extended analysis:

Anchoring Bias

The first approach identified becomes the unconscious favorite.

Confirmation Bias

Once a risk is rated, the AI tends to confirm rather than challenge it.

Sunk Cost Bias

Nodes that have been heavily expanded feel more important even when they are not.

External Contamination

If the user expresses a preference mid-evaluation, the AI may unconsciously weight it.

Recency Bias

The most recently expanded node feels most urgent.

Missing Perspectives

The AI defaults to its most familiar lens, missing security, accessibility, or business angles.

The /ece blindspot command runs an adversarial self-audit with five questions on every node:

1Risk Accuracy: Would I rate this the same if I had never seen the other nodes?

2Probability Validity: Is this probability based on evidence or assumption?

3Anchoring Check: Was this rating influenced by its parent or sibling?

4Contamination Check: Did user preferences influence this rating?

5Missing Perspectives: Is there a risk angle I have not considered?

With the --lens flag, it adds a sixth question through a specific expert perspective (security, accessibility, performance, business, or resilience).

ECE requires at least one blindspot check before allowing optimization. Unchecked bias produces unreliable recommendations.

Validation

Real-World Test: Blind User Navigation

To validate ECE, I ran it on a real accessibility problem:

Problem Statement

A navigation app for blind users has two large buttons (Stop and Repeat). The user needs a third “Call for Assistance” button, but the existing buttons are already maximized for accessibility. How do you integrate a third interaction without compromising the experience?

What ECE Mapped

Four fundamentally different approaches:

Triple Vertical Split

Add a third button by shrinking all three

Long-Press Gesture

Keep two buttons, add a hidden gesture

Shake to Call

Use device motion as input

Voice Command

Use speech recognition

What the Blindspot Check Caught

Rating Corrections

! “Muscle Memory Conflict” upgraded from Medium to Critical. Changing a Stop button position in a safety-critical nav app is a safety hazard, not a UX issue.

! “Voice Command” reliability corrected from Low risk to Medium. Blind users navigate in noisy real-world environments.

Missing Perspective

? The entire evaluation was missing a “Network Latency” perspective. What happens if the assistance call fails to connect?

Without the blindspot check, the optimization would have recommended an approach with undetected critical risks.

Process

Building and Testing the 55 Edge Cases

The blind user navigation evaluation did not stop at four approaches with a handful of risks. Using the full ECE workflow, I expanded every critical and medium-risk node, traced each one to its terminal outcome, and validated mitigations. The initial /ece map produced 4 approaches with 12 immediate edge cases. Successive rounds of /ece expand and /ece trace uncovered downstream failures that were invisible at the surface level. By the time the tree was fully mapped, the evaluation contained 55 distinct edge cases across the four approach branches.

How the Count Grew

Phase	Command Used	Edge Cases Found	What Surfaced
Initial Map	/ece map	12	First-level risks per approach: gesture conflicts, haptic ambiguity, voice reliability, touch target sizing
Critical Expansion	/ece expand	+18	Second-level cascades: what happens when the first failure triggers a second (e.g., haptic misfire during obstacle proximity causes wrong turn)
Terminal Tracing	/ece trace	+11	Worst-case chains: a missed obstacle alert leading to a wall collision leading to panic leading to emergency call failure
Validation Risks	/ece validate	+8	Mitigation side-effects: adding a confirmation dialog to the call button introduces a delay that is dangerous during an emergency
Blindspot Audit	/ece blindspot	+6	Missing perspectives: network latency during calls, battery drain from continuous haptic feedback, multi-floor elevator transitions

Testing Against the Design

Each of the 55 edge cases was tested against the final UI design using a three-step process:

1Test Case Generation: Used /ece test-gen to produce Given/When/Then acceptance tests from each butterfly chain. 55 edge cases produced 78 test scenarios (some chains had multiple terminal paths).

2Screenshot Review: Used /ece review to evaluate each screen against the synthesized blueprint. The review checked whether acceptance criteria were met and flagged accessibility gaps.

3Impact Cross-Reference: Used /ece impact to build a compound failure matrix. This identified 7 cases where two independent edge cases, if triggered simultaneously, would produce a third failure not captured in either branch.

From Edge Cases to Wireframe

After optimization selected the long-press gesture approach as the safest route, /ece synthesize generated the solution blueprint with requirements and constraints derived directly from the 55 edge cases. From there, /ece wireframe produced an interactive React prototype with a two-panel layout: a left panel showing all toggle conditions, priority badges, and the acceptance criteria checklist, and a right panel with the primary interaction zone where toggling conditions visually demonstrated how the UI responded to each edge case. The wireframe became the bridge between the abstract risk analysis and the concrete design decisions visible in the final screens.

55 edge cases sounds like a large number. But consider that standard AI planning for the same problem identified 3. The remaining 52 were discovered only because the evaluation forced systematic expansion, tracing, and bias auditing at every layer.

Self-Evaluation

Does ECE Actually Work?

I used ECE to evaluate ECE. The full evaluation is at .ece/evaluations/ece-vs-standard-planning/. I mapped 4 approaches with 16 edge cases, expanded 5 critical nodes, ran a blindspot audit on my own analysis, and optimized. Here is what the evaluation found.

Where ECE Outperforms Standard AI

Dimension	Standard AI	ECE
Edge case coverage	Identifies 1 to 3 inline	Maps 12+ across 4+ approaches
Approach comparison	Recommends one (usually the most common)	Enumerates and scores multiple
Bias detection	None	Adversarial self-audit with targeted corrections
Risk quantification	Qualitative ("this could be an issue")	Weighted scoring (Risk × Probability)

Where ECE is Overhead

ECE is not a replacement for everyday AI planning. For most day-to-day decisions, standard AI is faster and sufficient. ECE's value scales with the cost of getting the decision wrong.

Problem Type	Use ECE?	Why
Simple CRUD feature	No	Happy path is obvious. Edge cases are minor.
UI layout decision	Probably not	Multiple valid approaches, but consequences are cosmetic.
Authentication architecture	Yes	Security edge cases are non-obvious. Consequences are severe.
Accessibility for disabled users	Yes	Safety implications. Missing edge cases cause harm.
Payment processing	Yes	Financial edge cases compound. Retry storms, race conditions.
Multi-service architecture	Yes	Cascading failures. Single-point analysis misses cross-system risks.

Minimum Viable Workflow

You do not need all 17 commands for every problem. The minimum captures 80% of the value in 10 to 15 minutes:

/ece map~5 min

→

/ece expand~5 min (1-2x)

→

/ece optimize~5 min

The full workflow (with blindspot, synthesize, review, close) is reserved for high-stakes evaluations where thoroughness justifies the 20 to 40 minute investment.

Blindspot Audit on This Evaluation

I ran the blindspot check on my own self-evaluation. It flagged real issues:

!Creator Bias: This evaluation was conducted by ECE’s creator. There is inherent anchoring toward proving ECE works. Mitigated by expanding counter-argument nodes and rating them honestly.

!Single Test Case: The strongest evidence (blind user navigation test) is a single evaluation. Confidence in node A1 was downgraded from High to Medium until more evaluations are documented.

!User-Preference Contamination: The question "does it actually create any upper hand?" implies an expectation of a positive answer, which could unconsciously weight findings.

The self-evaluation demonstrates what ECE is designed for: forcing an honest, structured analysis that includes checking your own bias. Even when evaluating your own tool.

Structure

Repository Layout

.ece/ SYSTEM.md # Core identity, rules, command registry (~65 lines) commands/ # 17 command files map.md expand.md trace.md recommend.md validate.md blindspot.md optimize.md synthesize.md test-gen.md impact.md review.md close.md resume.md status.md wireframe.md patterns.md reset.md templates/ spine.template.md # Standardized evaluation map format node.template.md # Standardized edge case format evaluations/ # Runtime data (one directory per problem) <problem-slug>/ spine.md nodes/ solution.md reviews/ history/ # Completed evaluation summaries

Proof of Concept

/ece wireframe in Action

The wireframe below was generated from a real ECE evaluation on the Indoor Navigator project — specifically the Blind Navigation Assistance Button problem. No states were invented. Every toggle condition, acceptance criterion, and edge state derives directly from the solution.md produced by /ece synthesize.

ECE Wireframe output — Blind Navigation Assistance Button

The full session recording below shows the complete flow — running /ece wireframe, toggling conditions (touch geometry, network status, battery level, onboarding state), and watching the state inspector update in real time.

View on GitHub

Conclusion

The Verdict

1ECE outperforms standard AI planning for complex, high-stakes problems. It maps 12+ edge cases where standard AI identifies 1 to 3, detects bias drift that standard AI does not self-detect, and replaces gut-feel recommendations with quantified scoring.

2ECE is overhead for simple problems. If the best approach is obvious and the edge cases are minor, standard AI is faster and sufficient. ECE's value scales with the cost of getting the decision wrong.

3ECE fills a gap that planning frameworks do not address. Planning tools decide what to build. ECE evaluates which approach to take before building begins. They are complementary, not competing.

4Bias detection is not optional. It is structural. The blindspot check found real drift in every evaluation tested, including this self-evaluation. Risk ratings shifted. Missing perspectives were identified. Creator bias was detected and flagged.

What Remains Unproven

- Independent validation: no one outside the project has run ECE on their own unfamiliar problem yet.

- Broader evaluation dataset: only 2 fully worked examples are documented (blind user navigation, payment gateway).

- Multi-model consistency: tested on one AI model. Results on other models are unverified.

- Team adoption patterns: ECE has only been used by a solo developer so far.

AI is capable of deep analysis. It just defaults to shallow answers unless you build the structure that demands depth. ECE is that structure. Its upper hand is real, but conditional: it emerges for complex problems and disappears for simple ones.

ECE is open source at github.com/rohitnischal01/ece. 17 commands, MIT license, installs into any AI coding environment with npx ece-evaluator ece-install.

More Writing

Evaluating AI-generated code governed by Markdown Skills

SKILLS.MDANTIGRAVITYDESIGN TRANSFER

Feb 12, 2026

How I built a system that forces AIto map every possible failure beforewriting a single line of code.