Engineering Process

What to Expect From Me

The way an engineer designs, specifies, and operates complex systems matters as much as the systems themselves. This section exposes that process directly. Instead of describing capabilities in abstract terms, it shows the architecture, constraints, specifications, and control layers used to build real systems. For potential collaborators, employers, or clients, the goal is simple: evaluate the work itself. The materials below demonstrate how problems are structured, how specifications are written, and how systems are shipped. If the approach aligns with your needs, we should talk.

How to Read This Section

Most portfolios show finished artifacts. This page shows the work behind those artifacts. It documents how problems are framed, how constraints are locked, how blueprints and engineering specs are written, how CI/CD is used as an enforcement mechanism, how AI-assisted development is bounded, and where human inspection remains authoritative.

The intent is to remove ambiguity about what someone is actually hiring when they hire me. The output matters, but the process matters more. This page is meant for technical readers, operators, founders, and engineering leadership who want to inspect the system behind the artifact rather than stop at the artifact itself.

Important Notice

The material presented here reflects proprietary engineering processes and system design work. These processes, architectures, methodologies, and planning artifacts are intellectual property. All rights reserved.

This page is provided for evaluation purposes only to demonstrate engineering capability, architectural reasoning, system discipline, and execution quality.

Case 01 — Thesis Chain AI DevKit

This case study examines the engineering process behind the Thesis Chain AI DevKit. The DevKit exists to safely integrate AI-assisted development into production-grade engineering workflows while controlling cost, behavior, nondeterminism, and security risk.

Rather than treating AI as an autonomous authority, the DevKit treats model output as untrusted input. Guardrails, validation layers, budget controls, policy checks, and human inspection surround the model so the engineering system remains predictable, auditable, and reviewable.

Problem Definition

Modern AI models can accelerate engineering work dramatically, but naive integration introduces severe risk: prompt injection, uncontrolled costs, nondeterministic output, accidental data disclosure, weak reviewability, and silent drift in system behavior.

Most AI-assisted development tooling assumes the model can be trusted to generate correct or safe output. In practice this assumption fails often enough to make an unconstrained approach unacceptable in serious engineering environments.

The engineering problem addressed by the Thesis Chain AI DevKit is therefore:

How can AI-assisted development be integrated into real engineering workflows while maintaining deterministic control, cost discipline, bounded authority, and meaningful security guarantees?

Engineering Constraints

Before architecture begins, the system must operate under explicit constraints. These constraints shape every architectural decision and prevent design drift.

Token Budget Control
AI usage must operate under strict token ceilings. Models are selected intentionally by task class to avoid unnecessary cost.
Deterministic Behavior
Outputs must be inspectable and reproducible wherever possible. AI responses are never treated as authoritative system state.
Guardrail-First Architecture
Every model interaction must pass through validation layers including prompt screening, context restriction, redaction, schema validation, and disposition control.
Model Stratification
Different models are used for different classes of work. Expensive models are reserved for tasks that genuinely require deeper synthesis. Lower-cost models handle mechanical or narrow work.
Fail-Closed System Behavior
If validation fails or a guardrail triggers, the system rejects the output rather than attempting to recover silently.
Human Inspection Authority
Human engineers retain final authority over merges, deployments, architecture, and policy changes.
Security Isolation
Sensitive information and secrets must never enter model context. The system must operate under the assumption that model output could be malicious, confused, or incorrect.

Blueprint Architecture

The blueprint phase exists to lock system intent before implementation begins. Its purpose is not to describe code. Its purpose is to define the operational shape of the system: what the system must do, what it must never do, how risk is bounded, where authority resides, how inputs move, and what acceptance looks like before implementation starts.

For the Thesis Chain AI DevKit, the blueprint establishes a guardrail-first architecture. The model is never placed at the center of the system. Instead, the model is wrapped inside a deterministic control envelope that constrains what context may be passed in, how requests are formed, how outputs are parsed, and what conditions cause the system to reject the response.

At blueprint level, the architecture is divided into ordered layers rather than loose feature ideas. That matters because order determines safety. Cheap and deterministic checks execute first. Expensive and probabilistic work executes later, only after the input has been reduced, normalized, screened, and validated.

Input Boundary
Raw source content, prompts, instructions, repository diffs, and execution context are treated as separate classes of input with different trust levels.
Redaction and Sanitization Layer
Secret-bearing content, irrelevant data, and structurally dangerous prompt material are removed or transformed before a provider call is even possible.
Context Minimization Layer
Only the minimum useful context should move forward. This prevents whole-repo dumping, wasted spend, and low-signal prompts.
Budget and Routing Layer
The system decides whether a task deserves an AI call at all, and if it does, which model tier should receive it.
Provider Abstraction Layer
Providers are execution surfaces, not sources of truth. Core engineering logic is not coupled to one vendor.
Schema and Validation Layer
Output must fit a declared contract. If parsing fails or required structure is absent, the result is rejected.
Decision Boundary Layer
Even valid model output does not become authority automatically. The system classifies it as blocked, advisory, review-required, or safe to surface.
Audit and Replay Layer
Every meaningful run should be inspectable after the fact. Useful engineering systems must be reviewable, explainable, and diagnosable under failure.

Canonical Blueprint Markdown

The following appendix is mirrored locally from the AI DevKit source material and displayed here as canonical markdown.

The Thesis Chain AI DevKit — Blueprint

Version: 1.0.0
Status: Canonical Blueprint
Project: the-thesis-chain-ai-devkit
Document Type: System Blueprint
Primary Audience: Engineering leadership, platform engineers, security reviewers, implementation engineers
Authoring Intent: Define the operational architecture, trust boundaries, guardrails, authority model, and implementation shape for a safe AI-assisted engineering system.

1. Purpose

The Thesis Chain AI DevKit exists to integrate AI-assisted development into real engineering workflows without giving model output uncontrolled authority over code, repository state, infrastructure, or policy.

The system is designed around a simple premise:

AI output is useful, but untrusted.

The DevKit therefore does not treat the model as a builder with implicit authority. It treats the model as an external probabilistic subsystem wrapped inside deterministic engineering controls. The value of the system comes from how inputs are reduced, how context is bounded, how outputs are validated, how budget is controlled, how risk is isolated, and where human authority is retained.

This project is not a chatbot wrapper. It is an engineering control framework for structured, auditable, bounded AI-assisted workflows.

2. Problem Statement

Modern model providers can accelerate review, synthesis, linting, threat sketching, and ambiguity detection. However, naive adoption creates a compound engineering risk surface:

unbounded token spend
accidental data disclosure
prompt injection through repository text
nondeterministic output treated as truth
silent workflow drift
provider coupling
weak auditability
unclear merge authority
inappropriate use of write-capable automation

The actual engineering problem is:

How can AI-assisted engineering workflows produce useful structured output while preserving deterministic safety, bounded cost, auditability, and human control?

This blueprint answers that question at architecture level.

3. Design Position

3.1 What AI is allowed to be

AI may act as:

a reviewer
a synthesizer
a contradiction detector
an ambiguity finder
a threat-category sketcher
a structured advisory instrument

3.2 What AI is not allowed to be

AI is not:

a source of truth
an autonomous merger
a deployment authority
a secrets-bearing execution surface
a repository-wide reader by default
a policy mutator
a privileged system actor

3.3 Core architectural stance

The system is guardrail-first, fail-closed, and authority-constrained.

The model sits inside a layered deterministic envelope. The envelope, not the model, is the system.

4. Non-Negotiable Constraints

Before implementation, the following constraints are locked.

4.1 Bounded authority

AI output may be rendered, scored, cached, audited, and surfaced for review, but it may not directly merge code, deploy infrastructure, rotate secrets, or mutate policy without explicit human approval.

4.2 Diff-limited context

The system must operate on narrowed, task-relevant, allowlisted context. Whole-repo dumping is prohibited by design.

4.3 Redaction before provider access

Redaction and path filtering occur before any provider call is possible.

4.4 Strict schema at boundaries

Model output must be parsed into declared structure. If parsing fails, the system rejects the result.

4.5 Fail-closed behavior

Validation, policy, or budget failure must produce rejection rather than silent degradation.

4.6 Deterministic gates remain authoritative

Deterministic checks keep final authority. AI output is advisory even when structurally valid.

4.7 Provider abstraction

Core system logic may not be tightly coupled to a single model vendor.

4.8 Full run traceability

Meaningful executions must emit auditable artifacts sufficient for replay, diagnosis, and review.

5. System Goals

The DevKit is intended to provide the following outcomes.

Increase engineering leverage on review-heavy work.
Reduce ambiguity and contradiction in specs, diffs, and architectural material.
Bound the safety and cost risks of model usage.
Produce repeatable structured outputs.
Preserve explainability and post-run auditability.
Support both local and GitHub-mediated workflows.
Remain useful even when provider integrations are stubbed or offline.

6. Out of Scope

The following are explicitly out of scope for this version.

autonomous code merge
autonomous deployment
autonomous policy modification
secret retrieval from protected systems
unrestricted repo ingestion
write-capable agent swarms
unsupervised multi-step tool execution against production systems
treating schema-valid output as semantically correct by default

7. Operational Model

The DevKit is organized as a layered pipeline.

7.1 Layer 0 — Input boundary

Inputs enter as typed engineering artifacts:

repository reference
pull request reference
diff summary
changed files
prompt template version
task class
runtime policy
optional provider configuration

All inputs are assigned trust levels.

7.2 Layer 1 — Path policy and context eligibility

Files are filtered through allow/deny policy. Sensitive directories and structurally dangerous paths are excluded from model context.

7.3 Layer 2 — Redaction and sanitization

Eligible content is passed through redaction rules to suppress obvious secret and PII patterns and to reduce accidental disclosure.

7.4 Layer 3 — Prompt injection preflight

Repository text, diffs, and instructions are screened for prompt injection patterns. Safety mode accepts false positives over false negatives.

7.5 Layer 4 — Context minimization

Only the minimum useful diff and file content move forward. The system reduces low-signal input before any expensive operation.

7.6 Layer 5 — Budget and routing

The system decides whether the task deserves an AI call at all, and if so, what model class should receive it.

7.7 Layer 6 — Provider execution

Providers are treated as external execution surfaces. Their output is raw material, not authority.

7.8 Layer 7 — Parse and schema validation

Response text must parse to valid structured output. Invalid output is rejected.

7.9 Layer 8 — Decision boundary

A valid report is still classified as advisory. It may be rendered to markdown, attached to a PR, cached, audited, or flagged for manual review.

7.10 Layer 9 — Audit, metrics, replay

The run emits enough metadata to reconstruct what happened without trusting memory or provider logs alone.

8. High-Level Architecture

8.1 Principal subsystems

Policy subsystem
- allow paths
- deny paths
- strict schema enforcement
- prompt injection guard enablement
- budget limits
- model selection defaults
Context control subsystem
- changed-file assembly
- diff summary ingestion
- size reduction
- path gating
- content shaping
Safety subsystem
- redaction
- prompt injection heuristics
- fail-closed validation
Provider abstraction subsystem
- provider interface
- stub provider
- future provider adapters
Schema boundary subsystem
- output contract
- parse failure handling
- structure validation
Audit subsystem
- request event
- response event
- error event
- hashes and token usage
Cache subsystem
- deterministic keying
- TTL-based storage
- duplicate-spend prevention
Agent subsystem
- task-specific templates
- structured report generation
- agent versioning
Runner subsystem
- local runner
- GitHub Actions runner
- GitHub App / webhook architecture

9. Agent Model

Agents in this system are not autonomous personas. They are typed task modules with fixed contracts.

Each agent must define:

an agent name
an agent version
a prompt template
constraints
an output schema
a deterministic validation boundary
a rendering target

Example task classes supported by the current architecture include:

specification linting
PR synthesis
threat sketching

The architectural rule is that an agent is not defined by a clever prompt. It is defined by a prompt-plus-contract-plus-boundary package.

10. Trust Boundaries

This system has several hard trust boundaries.

10.1 Repository text is untrusted

Pull request content, spec text, comments, and changed files may contain adversarial instructions.

10.2 Model provider is external

Provider calls move data beyond the local boundary. Context must be reduced before crossing that line.

10.3 Model output is untrusted

Even well-formed output may be wrong, incomplete, or subtly misleading.

10.4 Human reviewers remain authoritative

Human approval is the boundary at which advisory output may influence actual engineering decisions.

11. Safety Architecture

11.1 Prompt injection resistance

The system uses conservative preflight heuristics to reject obvious attempts to override role, reveal secrets, or alter instructions.

11.2 Path isolation

The system denies unsafe path classes by default and only sends allowlisted engineering material.

11.3 Secret and PII redaction

Sensitive patterns are removed or masked before request assembly.

11.4 Schema-gated output

Only output that fits the declared report structure is accepted into downstream systems.

11.5 Read-only default integration

Integrations should default to read-only scope with comment-only feedback unless explicitly elevated.

11.6 Human-held merge authority

No report, score, or advisory comment is permitted to stand in for merge authority.

12. Budget and Cost Control Model

The DevKit treats cost as a first-class systems problem.

12.1 Budget primitives

For a run r:

calls(r) = number of provider calls
Tin(r) = total input tokens
Tout(r) = total output tokens

The budget envelope is:

calls(r) <= C_max
Tin(r) <= I_max
Tout(r) <= O_max

The run is rejected when any inequality is violated.

12.2 Cost equation

For provider pricing:

alpha = cost per input token
beta = cost per output token

Then expected run cost is:

Cost(r) = alpha * Tin(r) + beta * Tout(r)

System-level budget discipline requires that expected spend be bounded before scale is allowed.

12.3 Caching principle

Repeated calls on equivalent prompt and context should not re-spend budget.

A canonical cache key shape is:

K = H(provider || model || prompt_version || prompt_hash || context_hash || policy_version)

Where H() is a collision-resistant digest.

13. Auditability Model

Every meaningful run should emit structured audit events.

At minimum, the system records:

request id
provider
model
prompt hash
context hash
output hash
timestamp
token usage
error state, if any

This allows operators to answer:

what was asked
what input class was sent
what provider/model handled it
whether the output was cached
whether the output validated
what it cost
what failed if the run was rejected

Audit exists to support diagnosis, governance, and trust.

14. GitHub Integration Model

The DevKit supports two primary integration modes.

14.1 CI-driven mode

A GitHub Action runs on PR events, assembles eligible context, executes the advisory pipeline, and posts structured review comments.

14.2 App-driven mode

A webhook service verifies GitHub signatures, mints installation tokens, fetches changed files, runs the advisory pipeline, and posts PR comments or check runs.

The blueprint preference is:

read-only by default
no content mutation by default
comment/check-run surfaces preferred over write surfaces
deterministic verification before any pipeline execution

15. Human Roles

The system explicitly retains human authority in the following roles.

15.1 Architect

Defines the allowed shape of the system, agent classes, boundaries, and non-negotiables.

15.2 Security reviewer

Owns threat posture, path policy, redaction strategy, integration scope, and escalation policy.

15.3 Implementation engineer

Builds adapters, runners, validators, and renderers against the blueprint and spec.

15.4 Reviewer / operator

Interprets advisory output, checks evidence, and decides whether action is warranted.

15.5 Release authority

Retains final authority for merges, deployment, and policy change.

16. Acceptance Criteria

The blueprint is considered implemented correctly when the system can demonstrably do the following:

accept diff-limited engineering context
reject disallowed paths before provider access
redact obvious secrets and PII before request creation
detect and block obvious prompt injection patterns
assemble versioned prompt envelopes
enforce hard token/call budgets
cache equivalent requests deterministically
parse and schema-validate response structure
emit auditable request/response/error events
surface advisory reports without granting write authority
support both local and GitHub-oriented execution paths
fail closed on malformed output or policy violation

17. Failure Philosophy

The DevKit is intentionally conservative.

When uncertain, it should:

reduce context
reject unsafe paths
block suspicious instructions
refuse malformed output
mark uncertainty explicitly
escalate to human review

The preferred failure mode is lost convenience, not silent compromise.

18. Future Evolution

The architecture permits future additions, but only within the same control posture.

Possible later extensions include:

stronger schema validators
scored evidence confidence
richer path-policy classes
provider multiplexing
offline replay tooling
diff chunking for large PRs
policy version pinning
richer evaluation harnesses
more agent classes

These are valid only if they preserve the current authority model: deterministic controls first, advisory AI second.

19. Blueprint Summary

The Thesis Chain AI DevKit is a control architecture for AI-assisted engineering, not an AI-first automation toy.

Its core principles are:

AI remains untrusted
deterministic boundaries remain authoritative
context is minimized before exposure
cost is bounded
outputs are schema-gated
audit is mandatory
write authority is withheld by default
humans retain final control

That is the system this blueprint defines.

Engineering Specifications

If the blueprint defines intent, the engineering specification defines execution. This is where high-level architectural ideas are converted into a buildable, inspectable, and testable system. In my process, the engineering spec is not a light outline. It is the document that removes ambiguity from implementation.

The engineering spec for an AI-assisted development system must answer several questions explicitly:

What modules exist, and what are their exact responsibilities?
What data enters and leaves each boundary?
What conditions are blocking conditions versus warning conditions?
Where does the system fail closed?
What is human-reviewed, and what is machine-validated?
How are token budgets measured, enforced, and audited?
How are outputs replayed, inspected, and compared?

For the Thesis Chain AI DevKit, the engineering spec acts as a discipline document. It translates “AI should help here” into precise, enforceable behavior.

1. Module Boundaries

The spec separates the system into modules with narrow responsibilities: input preparation, sanitization, routing, provider calls, parsing, validation, budget accounting, result classification, and human inspection. If a module cannot be named and bounded, it is not ready to be implemented.

2. Ordered Guardrails

Guardrails are fixed in sequence. They are not optional helpers. They are part of the main execution path.

3. Output Contracts

The spec defines what a valid response looks like. Structured output contracts reduce hidden interpretation costs and unstable downstream behavior.

4. Failure Semantics

The spec identifies when the system must stop. A malformed response, budget breach, unsafe context match, or policy violation should terminate the path and surface a visible failure state.

5. Token and Cost Discipline

Work is divided into classes: mechanical, evaluative, synthesis-heavy, and ambiguous. These classes map to different model tiers and different budget thresholds.

6. Inspection Requirements

The spec defines what must be visible to a human reviewer: prompt class, sanitized input summary, chosen model tier, token consumption, validation results, classification outcome, and final disposition.

7. Non-Negotiables

The strongest specs contain non-negotiables that implementation is not allowed to reinterpret: no hidden globals, no silent fallback behavior, no speculative scope expansion, no unbounded model calls, and no accepting model output as trusted state without validation and review.

Canonical Engineering Spec Markdown

The following appendix is mirrored locally from the AI DevKit source material and displayed here as canonical markdown.

The Thesis Chain AI DevKit — Engineering Specification

Version: 1.0.0
Status: Canonical Engineering Specification
Project: the-thesis-chain-ai-devkit
Document Type: Engineering Specification
Primary Audience: Implementation engineers, reviewers, maintainers, CI/CD operators
Depends On: the-thesis-chain-ai-devkit-blueprint-1-0-0.md

1. Specification Intent

This engineering specification defines the concrete implementation contract for the Thesis Chain AI DevKit.

It exists to translate blueprint-level architectural intent into:

module boundaries
runtime data contracts
algorithmic flow
validation rules
budget equations
cache semantics
audit event structure
runner behavior
GitHub integration behavior
acceptance tests

This spec is written so an implementation engineer can build or extend the system without guessing.

2. System Summary

The DevKit is a provider-agnostic, schema-gated, guardrail-first framework for AI-assisted engineering workflows.

At runtime, the system:

receives a task-specific request
filters context by policy
redacts content
screens for prompt injection
assembles a prompt envelope
computes deterministic hashes
checks cache
enforces budget
calls a provider adapter
parses and validates response structure
records audit events
returns an advisory report to a runner

The implementation must preserve that order.

3. Repository-Level Module Topology

3.1 Required top-level module groups

src/core/
- types
- policy
- redaction
- injection guards
- schema validation
- LLM client
- audit
- cache
- prompt templates
- shared utilities
src/adapters/
- provider adapter interface
- provider implementations or stubs
src/agents/
- typed agent runners for fixed task classes
src/runners/
- local execution path
- GitHub-oriented execution path
docs/
- architectural and operational documentation
.github/workflows/
- CI demonstration or integration flows

4. Data Contracts

4.1 Severity

Allowed values:

info
warn
high

4.2 Category

Allowed values:

structure
invariant
threat
diff
test

4.3 Finding

A finding is a typed advisory unit.

type Finding = {
  id: string;
  severity: 'info' | 'warn' | 'high';
  category: 'structure' | 'invariant' | 'threat' | 'diff' | 'test';
  claim: string;
  evidence_refs: string[];
  suggested_action?: string;
};

4.4 Report

The report is the canonical accepted AI output structure.

type Report = {
  agent: string;
  version: string;
  input_hash: string;
  output_hash: string;
  findings: Finding[];
  notes?: string[];
};

4.5 FileBlob

type FileBlob = {
  path: string;
  content: string;
};

4.6 AgentContext

type AgentContext = {
  repo: { owner: string; name: string };
  pr?: { number: number; headSha: string };
  diffSummary: string;
  changedFiles: FileBlob[];
  promptVersion: string;
};

4.7 ModelSpec

type ModelSpec = {
  provider: 'stub' | 'openai' | 'azure_openai' | 'anthropic' | 'vertex';
  model: string;
  temperature: number;
  maxOutputTokens: number;
};

4.8 Budget

type Budget = {
  maxCalls: number;
  maxTotalInputTokens: number;
  maxTotalOutputTokens: number;
};

4.9 LLMRequest

type LLMRequest = {
  requestId: string;
  system: string;
  task: string;
  constraints: readonly string[];
  outputSchema: JSONSchemaLike;
  model: ModelSpec;
  context: {
    diffSummary: string;
    files: FileBlob[];
  };
  sampling?: {
    top_p?: number;
    seed?: number;
  };
};

4.10 LLMResponse

type LLMResponse = {
  requestId: string;
  provider: LLMProvider;
  model: string;
  rawText: string;
  parsed: Report;
  usage: {
    inputTokens: number;
    outputTokens: number;
  };
  audit: {
    promptHash: string;
    contextHash: string;
    outputHash: string;
    timestampMs: number;
  };
};

5. Policy Contract

5.1 Policy structure

The system policy must declare:

allowPaths
denyPaths
budget
model
strictSchema
promptInjectionGuard

Example contract:

type Policy = {
  allowPaths: string[];
  denyPaths: string[];
  budget: Budget;
  model: ModelSpec;
  strictSchema: true;
  promptInjectionGuard: true;
};

5.2 Path evaluation rule

A path is eligible iff:

it does not match any deny prefix
it does match at least one allow prefix

Formally, for path p:

eligible(p) = (forall d in D : not startsWith(p, d)) and (exists a in A : startsWith(p, a))

Where:

D = deny path set
A = allow path set

5.3 Default posture

The default policy must remain conservative and read-only in operational effect.

6. Request Lifecycle

6.1 Required order of execution

The system shall process each request in this exact logical order:

accept typed request
apply redaction
run prompt injection preflight
build prompt
hash prompt and context
check cache
enforce budget
record request audit event
call provider
parse response
validate response schema
increment budget counters
compute output hash
record response audit event
write cache entry
return structured response

This order is not optional. Rearranging it weakens safety or observability.

7. Context Reduction Requirements

7.1 Context assembly

Only changed files relevant to the current task may be included.

7.2 Context size discipline

The system must avoid whole-repo context assembly. Input is restricted to:

diff summary
selected changed files
fixed prompt template material
fixed constraints

7.3 Exclusion rules

Files matching deny policy shall never be passed to a provider.

7.4 Context objective

The context subsystem is optimized for signal density, not completeness.

8. Redaction Requirements

8.1 Redaction timing

Redaction must occur before cache-key generation for provider-bound prompt content and before provider invocation.

8.2 Minimum baseline patterns

The implementation must support rule-based redaction of:

obvious API-key-like tokens
email addresses
later extensible secret patterns

8.3 Redaction function

For text blob x and rule set R = {r_1, r_2, ..., r_n}:

Redact(x, R) = r_n(...r_2(r_1(x)))

Where each r_i is a pattern substitution function.

8.4 Redaction philosophy

The redaction subsystem is deliberately conservative. False positives are acceptable if they reduce accidental disclosure.

9. Prompt Injection Guard Requirements

9.1 Guard timing

Prompt injection screening must run after redaction and before provider invocation.

9.2 Heuristic scope

The system must reject obvious adversarial prompt constructs such as:

instruction override attempts
role-spoof labels
secret-exfiltration requests
provider-key disclosure language

9.3 Safety mode

The guard should prefer false positive rejection over permissive acceptance.

9.4 Failure behavior

A triggered guard produces immediate request rejection.

10. Prompt Envelope Construction

10.1 Required sections

The prompt envelope shall be assembled in explicit labeled sections:

SYSTEM
TASK
CONSTRAINTS
OUTPUT_SCHEMA
CONTEXT_DIFF_SUMMARY
CONTEXT_FILES

10.2 Section purpose

This labeling exists to reduce ambiguity, constrain prompt shape, and make prompt assembly auditable.

10.3 Prompt template versioning

Every prompt template must include:

id
version
system
task
constraints
outputSchema

Template version changes are behavioral changes and must be traceable.

11. Hashing and Cache Semantics

11.1 Prompt hash

Let P be the final assembled prompt string. Then:

promptHash = H(P)

11.2 Context hash

For diff summary S and files F = {(p_i, c_i)}:

contextHash = H(S || join_i(p_i || ":" || H(c_i)))

11.3 Cache key

A canonical cache key shall include:

policy namespace or equivalent
provider
model
prompt hash
context hash

Example:

cacheKey = "aidev:" || provider || ":" || model || ":" || promptHash || ":" || contextHash

11.4 Cache objective

Caching exists to prevent repeated spend on semantically equivalent work.

11.5 Cache store requirement

The cache interface must support:

get(key)
set(key, value, ttlSeconds)

The reference implementation may be in-memory. Production implementations may use external stores.

12. Budget Enforcement

12.1 Runtime counters

For a process-local runtime:

c = calls made
ti = cumulative input tokens
to = cumulative output tokens

12.2 Enforcement predicates

A request is permitted iff:

c < C_max
ti < I_max
to < O_max

If any predicate fails, the run must reject with an explicit budget error.

12.3 Budget enforcement timing

Budget checks occur before provider invocation.

12.4 Increment semantics

Counters are incremented only after a provider response is received.

12.5 Operational note

Process-local counters are sufficient for local/demo runs. Shared production environments may require durable or distributed budget state.

13. Provider Adapter Contract

13.1 Provider adapter purpose

The provider adapter isolates model-vendor specifics from core pipeline logic.

13.2 Minimum interface

The adapter must expose a call surface equivalent to:

interface ProviderAdapter {
  provider: LLMProvider;
  call(
    req: LLMRequest,
    prompt: string,
  ): Promise<{
    provider: LLMProvider;
    model: string;
    rawText: string;
    usage: { inputTokens: number; outputTokens: number };
  }>;
}

13.3 Stub provider

A stub provider shall be supported for:

public skeletons
offline demos
deterministic test harnesses
safe CI demonstrations

13.4 Provider principle

The provider is replaceable. Core safety posture may not depend on proprietary provider behavior.

14. Schema Validation Boundary

14.1 Boundary definition

The schema boundary is the point where raw model text may become acceptable structured input.

14.2 Required behavior

The system must:

parse raw text as JSON
validate the resulting object as a Report
reject malformed or invalid output

14.3 Structural validity vs correctness

Schema validity only means structure is acceptable. It does not certify truth, completeness, or sound reasoning.

14.4 Failure mode

Invalid JSON or invalid report structure must terminate the request as failure.

15. Audit Event Requirements

15.1 Event classes

At minimum, audit must support:

llm_request
llm_response
llm_error

15.2 Minimum request event fields

kind
requestId
timestampMs
provider
model
promptHash
contextHash

15.3 Minimum response event fields

all request event fields
outputHash
usage

15.4 Minimum error event fields

all request event fields where available
error name
error message

15.5 Structured emission

Audit events must be machine-ingestible, preferably JSON-structured.

16. Agent Implementation Requirements

16.1 Agent contract

Each agent must:

create an LLMRequest
bind to a versioned template
supply a concrete model spec
pass typed context
return Report

16.2 Required current agent classes

SpecLint
PRSynthesis
ThreatSketch

16.3 ThreatSketch special constraint

ThreatSketch must remain conceptual. It may classify risks and mitigations, but may not output exploitation steps.

16.4 Agent determinism rule

Agents may vary in prompt content and task definition, but not in core safety boundary behavior.

17. Runner Requirements

17.1 Local runner

The local runner must support demonstration execution using fixed example context and render advisory markdown.

17.2 GitHub runner

The GitHub runner must model or implement:

webhook signature verification
PR metadata extraction
installation token acquisition or workflow-token use
changed-file retrieval
path eligibility filtering
pipeline execution
advisory PR comment rendering

17.3 GitHub safety requirement

The GitHub path must default to read-only review surfaces such as comments or checks. It must not imply merge authority.

18. GitHub App / Webhook Model

18.1 Signature verification

Webhook-driven operation requires deterministic verification of the GitHub signature before processing payload content.

18.2 Installation token minting

If operating as a GitHub App, installation tokens must be minted per installation and scoped minimally.

18.3 Changed-file fetching

Only PR files relevant to the advisory pipeline may be fetched.

18.4 Policy application

Fetched files must be filtered by policy prior to downstream use.

18.5 Comment rendering

Rendered comments should state clearly that the result is advisory and schema-gated, not authoritative.

19. Pseudocode

19.1 Core request pipeline

function INVOKE(req, policy, cache, audit, provider):
    redactedReq = APPLY_REDACTION(req)

    if policy.promptInjectionGuard == true:
        ASSERT_NO_PROMPT_INJECTION(MATERIAL_FOR_GUARD(redactedReq))

    prompt = BUILD_PROMPT(redactedReq)

    promptHash  = HASH(prompt)
    contextHash = HASH_CONTEXT(redactedReq.context)
    cacheKey    = BUILD_CACHE_KEY(policy, redactedReq.model, promptHash, contextHash)

    if cache exists:
        hit = cache.get(cacheKey)
        if hit exists:
            return hit

    ENFORCE_BUDGET(policy.budget)

    audit.record(REQUEST_EVENT(...))

    try:
        raw = provider.call(redactedReq, prompt)
        parsed = PARSE_AND_VALIDATE(raw.rawText, redactedReq.outputSchema)

        UPDATE_RUNTIME_COUNTERS(raw.usage)

        outputHash = HASH(JSON.stringify(parsed))

        response = BUILD_RESPONSE(parsed, raw, promptHash, contextHash, outputHash)

        audit.record(RESPONSE_EVENT(...))

        if cache exists:
            cache.set(cacheKey, response, ttlSeconds)

        return response

    catch err:
        audit.record(ERROR_EVENT(...))
        raise err

19.2 Path eligibility

function IS_ALLOWED_PATH(path, allowPaths, denyPaths):
    for d in denyPaths:
        if path startsWith d:
            return false

    for a in allowPaths:
        if path startsWith a:
            return true

    return false

19.3 Agent runner pattern

function RUN_AGENT(agentTemplate, ctx):
    req = {
        requestId: BUILD_REQUEST_ID(agentTemplate, ctx),
        system: agentTemplate.system,
        task: agentTemplate.task,
        constraints: agentTemplate.constraints,
        outputSchema: agentTemplate.outputSchema,
        model: SELECT_MODEL(agentTemplate),
        context: {
            diffSummary: ctx.diffSummary,
            files: ctx.changedFiles
        }
    }

    res = LLM_CLIENT.invoke(req)
    return res.parsed

20. Evaluation and Metrics

20.1 Primary evaluation principle

The system must be evaluated by engineering outcomes, not token volume.

20.2 Suggested metrics

reduction in human review time
number of ambiguities caught before merge
contradiction detection rate
false positive rate
structured output acceptance rate
cache hit rate
provider failure rate
rejected unsafe-context rate
budget-overrun frequency
audit completeness rate

20.3 Quality lens

A system that spends fewer tokens but leaks secrets or produces unactionable noise is not successful.

21. Security Requirements

21.1 Secrets

Secrets must never be intentionally included in provider-bound prompt context.

21.2 PII

PII-bearing material must be excluded or redacted according to policy.

21.3 Write access

Write-capable automation must remain disabled unless explicitly approved and separately reviewed.

21.4 Supply chain

Dependencies used in CI or webhook execution should be minimal, pinned where appropriate, and reviewable.

21.5 Output treatment

Even validated output must remain advisory unless a separate deterministic control layer explicitly promotes a subset of behavior.

22. Failure Modes and Required Handling

22.1 Prompt injection guard triggered

Result: reject request, record error audit event.

22.2 Path not allowed

Result: exclude file or reject run depending on runner policy.

22.3 Redaction alters material significantly

Result: continue if structure remains usable; otherwise surface limited-result state.

22.4 Cache unavailable

Result: continue without cache if safety posture is preserved.

22.5 Budget exceeded

Result: reject before provider invocation.

22.6 Provider failure

Result: record error audit event and surface failure.

22.7 Invalid JSON

Result: reject response.

22.8 Schema mismatch

Result: reject response.

22.9 Audit sink failure

Preferred result: surface operational error; do not silently claim successful audit if audit failed.

23. Test Requirements

23.1 Unit tests

Minimum expected unit coverage should include:

path policy evaluation
redaction substitution
prompt injection heuristics
prompt assembly
schema validation success/failure
budget enforcement
cache hit/miss behavior
audit event formatting

23.2 Integration tests

Minimum expected integration coverage should include:

local runner end-to-end with stub provider
GitHub runner path filtering
advisory comment rendering
invalid response rejection path

23.3 Security-oriented tests

Minimum adversarial test cases should include:

injected override strings in diffs
secret-like material in changed files
denylisted paths in PR file lists
malformed JSON responses
structurally valid but empty reports

24. CI/CD Expectations

24.1 CI role

CI is used to verify deterministic correctness around the DevKit itself, not to treat model output as a release authority.

24.2 CI checks

Expected checks include:

formatting
linting
type checking
unit tests
integration tests where safe
workflow syntax validation

24.3 Public skeleton safety

In public or demonstration contexts, provider calls should remain stubbed unless explicitly configured otherwise.

25. Acceptance Criteria

Implementation satisfies this spec when all of the following are true:

typed requests can be constructed and executed
policy-based path filtering works as specified
redaction executes before provider call
prompt injection screening can reject suspicious content
prompt envelopes are assembled in labeled sections
prompt and context hashes are generated deterministically
cache hits bypass provider calls
budget enforcement blocks overrun conditions
provider adapters can be swapped without changing core logic
invalid JSON responses are rejected
invalid report structures are rejected
audit events are emitted for request/response/error paths
agents return structured reports
local runner can produce advisory markdown
GitHub runner can model or execute advisory PR workflow safely
no code path grants implicit merge or deploy authority to AI output

26. Implementation Notes

26.1 Public skeleton vs production implementation

The current repository may use lightweight validators, in-memory cache, and stub provider surfaces. That is acceptable for the public skeleton. Production-hardening may replace those internals without changing the architectural contract defined here.

26.2 Behavioral invariants that must not drift

The following invariants are mandatory:

AI output remains advisory
deterministic validation remains authoritative
provider access happens only after safety preflight
schema failure rejects output
budget is bounded
path policy is enforced
audit remains structured
read-only is the default integration posture

27. Summary

This engineering specification defines an AI-assisted engineering framework that is useful precisely because it is constrained.

The system is not valuable when it is permissive. It is valuable when it is:

structured
bounded
reviewable
cheap enough to operate
difficult to misuse
explicit about authority

That is the implementation contract for the Thesis Chain AI DevKit.

CI/CD Integration

CI/CD is not just a deployment mechanism. In systems like this, CI/CD is part of the control surface. It enforces the difference between “interesting idea” and “repeatable engineering behavior.”

For an AI-assisted workflow, CI/CD must enforce at least four things:

Deterministic execution paths
Cheap deterministic checks should run first and block unnecessary model calls.
Bounded permissions
CI jobs should default to read-only behavior, especially around repository state and merge authority.
Auditable artifacts
Outputs should be storable, reviewable, and attributable to a run context.
Version-locked automation
Actions, templates, schemas, and policies should be pinned so behavior does not drift silently.

In practice, this means the pipeline treats AI as a bounded advisory subsystem. It can inspect PR diffs, produce structured comments, and surface contradictions or risk, but it does not silently mutate production state.

The important point is architectural: CI/CD is where enforcement lives. If the rules are not enforced in the pipeline, then they are preferences, not controls.

Agentic Development Pipeline

This is the part most people misunderstand. Agentic development does not mean “use the most powerful model on everything.” It means divide work into classes, apply deterministic gates, route tasks to the cheapest sufficient capability, inspect aggressively, and preserve human authority over consequential decisions.

Loop Control
Agent loops must be bounded. Maximum calls, maximum retries, maximum token budgets, and explicit stop conditions are part of the system contract.
Task-Class Routing
Mechanical checks, narrow verification, contradiction detection, and low-ambiguity work should go to cheaper model tiers or deterministic tooling first. Higher-cost reasoning should be reserved for synthesis-heavy or ambiguous tasks.
Inspection Before Escalation
The system should not escalate spend just because a model produced an answer. It should inspect quality, confidence, structure, and policy conformance before deciding whether more expensive reasoning is justified.
Human-in-the-Loop as Authority
Human review is not an apology for the system. It is the authority boundary. Humans own interpretation, exception handling, merge authority, and architectural direction.
Token Cost as Design Input
Token usage is not a dashboard vanity metric. It is an input into architectural choices. Model selection, prompt size, context shape, cache strategy, and retry policy all exist to prevent spending from becoming chaotic.
Auditability Over Cleverness
A boring, inspectable loop is superior to a clever opaque loop. In practice, predictable bounded systems outperform magical-looking systems over time.

This is why I do not treat model choice as a status symbol. I treat it as routing policy. Different work deserves different tools. Better systems come from disciplined orchestration, not maximal model spend.

Human Inspection Roles

Human inspection remains central in any serious AI-assisted engineering system. The goal is not to remove humans from the loop. The goal is to remove low-value repetitive work while preserving human judgment where ambiguity, business context, risk, or architecture matter.

Quality Control
Humans validate whether the output is actually useful, not merely well-formed.
Architectural Arbitration
Humans decide when a system behavior is technically possible but strategically wrong.
Infra and Policy Control
Humans own permissions, deployment boundaries, policy changes, and escalation paths.
Exception Handling
Humans interpret edge cases, conflict states, and cross-domain ambiguity.

In other words: AI can accelerate analysis, summarization, contradiction discovery, and report generation. It should not silently inherit decision authority just because it is fast.

Security Architecture

Security is not a final checklist item. In AI-assisted systems it must be designed into every upstream layer: input handling, context assembly, provider boundaries, output validation, CI permissions, and operational review.

Prompt Injection Resistance
PR authors, diffs, and input payloads are untrusted. Context must be screened before the provider call.
Data Exfiltration Prevention
Sensitive paths, secrets, PII, and irrelevant configuration must be denied or redacted before context assembly.
Least-Privilege Automation
Default pipeline permissions should be read-only unless explicit write behavior is required and reviewed.
Authority Separation
AI output may be structured and useful without being authoritative. Deterministic checks and human review remain the source of actual control.
Supply-Chain Discipline
Dependencies, Actions versions, templates, and schemas should be pinned so automation does not drift into unknown behavior.
Visible Failure States
Unsafe or malformed behavior should surface as explicit failure. Silent recovery hides risk.

The shortest honest summary is this: safe agent systems are built by distrusting them correctly.

Case 02 — Human Agentic Pipeline

This case study documents the operating model behind a human-led agentic development pipeline. The objective is not to simulate autonomous magic. The objective is to design a system in which AI can accelerate engineering work without dissolving accountability, architectural control, or verification discipline.

In this model, AI is routed into bounded roles inside a controlled workflow. Humans retain authority over judgment, quality control, infrastructure, and final acceptance. The system is designed to produce auditable artifacts, visible checkpoints, deterministic handoff boundaries, and repeatable outputs rather than vague conversational momentum.

Problem Definition

Most “agentic” workflows fail for one of two reasons. Either they are too loose and devolve into expensive improvisation, or they are so tool-driven that no one can explain where authority lives, why a change happened, or whether the output still matches the original specification.

The engineering problem addressed here is therefore:

How do you structure a human-led, AI-assisted development system that can produce meaningful velocity while preserving deterministic phase order, verification gates, explicit authority boundaries, and drift resistance?

The answer is not “more autonomy.” The answer is architecture. Agentic systems only become useful when their behavior is constrained more like a build pipeline and less like a free-form assistant.

Operating Constraints

Strict Phase Order
Work must progress in a declared sequence. Architecture cannot be skipped, verification cannot be hand-waved, and implementation cannot silently rewrite system intent.
No Spec Drift
The process is anchored to canonical blueprints and engineering specs. If the output cannot be traced back to those anchors, it is drift.
No Hidden Authority
Roles are separated. An implementation agent does not gain architectural authority merely by writing code first.
Artifact-Based Work
Each phase should emit inspectable artifacts rather than conversational summaries. The system should leave behind evidence, not just momentum.
Assumptions Must Collapse to Zero
If critical assumptions remain, the process is not ready to progress. Guessing is treated as a process failure, not a creative virtue.
Idempotent Passes
Every pass should be independently reproducible. Partial patches, hand-wavy edits, and unbounded “just improve it” loops are not acceptable operating modes.
Human Final Authority
Human reviewers own the right to accept, reject, redirect, or halt the system at any stage.

Blueprint Architecture

The blueprint for a human agentic pipeline starts by defining role boundaries and execution order before discussing implementation. In a healthy agentic system, “who may decide what” is as important as “what code gets written.”

The structure I use is phase-driven and role-separated. The architect locks anchors and non-negotiables first. Tooling may only express what the architecture already allows. Implementation is scope-constrained to the approved tree. Verification must halt the system on drift rather than negotiate with it.

Phase 0 — Spec Anchors
Establish canonical files, anchor quotes, and derived non-negotiables. This is where the system proves it understands the assignment before it starts building.
Phase 1 — Architecture Plan
Define exact file tree, module boundaries, dependencies, and compliance mapping. No unanchored structure is permitted.
Phase 1b — Tooling Checklist
Confirm what CI commands, configs, and repo expectations are required. The tooling agent is not allowed to introduce whimsical changes.
Phase 1c — Verification Gate
Confirm drift check is empty, anchor coverage is complete, and assumptions are zero before implementation begins.
Phase 2 — Core Implementation
Emit only files already justified by the architecture plan. No speculative expansion. No structure creep.
Phase 3 — Hardening
Fix lint, type, and build failures without reopening architecture. Hardening is for compliance and polish, not redesign.

This structure matters because it prevents the most common failure mode in AI-heavy development: implementation racing ahead of architecture and forcing the system to rationalize drift after the fact.

Canonical Blueprint Markdown

The following appendix is mirrored locally from the orchestration lab blueprint and displayed here as canonical markdown.

ExNulla Blueprint

Human Agentic Orchestration Lab (Standalone Showpiece)

Repository (proposed): exnulla-orchestration-lab
Slug: orchestration-lab
Version: 1.1.0 (supersedes human-agentic-trainer v1.0.0)
Owner org: Thesis-Project (professional)
Primary goal: Portfolio-grade, standalone orchestration lab that can optionally embed as a demo via iframe (static-first).

0. Positioning

This project is a standalone orchestration lab that teaches and demonstrates agentic pipeline mechanics with:

Human transport (copy/paste between ChatGPT Projects) as the default execution provider.
Deterministic state machine and artifact ledger as the core product.
A clean upgrade path to API-based providers without rewriting orchestration logic.

It is intentionally “too serious” to be a toy demo.

1. Objectives

1.1 Core educational objectives

Teach (visibly, not abstractly):

Role separation and instruction boundaries
Prompt routing and supervisor logic
Context drift origins, detection, and recovery
Critic/revision loops and acceptance criteria closure
Budget discipline, token economy, and trade-offs

1.2 Core product objectives

Provide a reproducible lab environment:

Deterministic run capture + replay
Run artifact inspection (graph + diffs + drift flags)
Failure-mode injection and recovery demonstration
Formal role contract enforcement (schema validated outputs)
Cost and budget dashboards (simulated + estimated)

1.3 Optional objective (Phase 2)

Provider adapters for API orchestration (OpenAI/Anthropic/etc.) that reuse the same run state machine.

2. Constraints and non-goals

2.1 Constraints

Static-first deployment: default build outputs a static web app.
Atomic deploy friendly: build artifact can be deployed with symlink flips.
Iframe-safe: must function correctly when embedded in an iframe sandbox.
No scraping / no UI automation: human transport remains manual by design.

2.2 Non-goals (v1.1)

No live ChatGPT UI integration.
No storing personal secrets or API keys in the browser (Phase 2 moves to server runtime).
No “magic” agent framework wrapper that hides orchestration mechanics.

3. Target users

Learners: understand orchestration by running guided pipelines.
Hiring reviewers: see a polished, deterministic systems artifact with auditability.
Future-you: use specs + blueprint to build an API agent framework later without drift.

4. High-level architecture

4.1 Components

LOC (Local Orchestration Console)
- Runs locally (dev) and/or as a static app (prod) with persistence in browser storage and export/import.
- Generates role prompts, enforces contracts, logs turns, computes budgets, flags drift, scores rubrics.
Run Ledger + Artifact Store
- Run JSON artifacts are canonical.
- Export is deterministic: same inputs → same run structure (timestamps excluded or normalized).
Inspector UI (Showpiece layer)
- Graph view (turn DAG)
- Drift panels
- Budget/cost panels
- Failure injection controls
- Replay timeline controls
Provider Adapter Layer (Transport abstraction)
- HumanProvider (v1.1): manual paste-in/out
- SimulatedProvider (v1.1): fake latency/cost/reliability without APIs
- API Providers (v2+): optional later

4.2 “Square peg / round hole” mitigation

This repo is designed as standalone. If embedded into exnulla-demos, it is treated as a static build artifact embedded via iframe with a constrained integration contract (Section 13).

5. Deterministic state model

5.1 Canonical run artifact

runs/<RUN_ID>/run.json

Minimum fields:

schemaVersion (semver-like)
gitSha (injected at build time)
runId
createdAt (optional; normalized for deterministic replay exports)
scenarioId (the selected training scenario)
roles[] (role profiles and constraints)
turns[] (ordered, each with routing metadata and validation results)
artifacts[] (files/snippets produced by turns)
budgets (per-turn + cumulative)
rubric (scoring + thresholds)
drift (flags + evidence + severity)
acceptance (pass/fail + reasons)

5.2 Deterministic replay guarantee

Given:

Same scenarioId
Same initial inputs
Same turn responses (copied)
Same schemaVersion

Then:

The run artifact validation and derived metrics must match.

6. Role system

6.1 Default roles

architect
developer
critic
tester
(optional) supervisor (internal; LOC-driven orchestration)

6.2 Required ChatGPT Project setup (Human Provider)

Each role is configured as its own ChatGPT Project with persistent instructions.

The LOC provides:

Copy-paste “Project Instructions” templates per role.
A “Project Setup Checklist” with validation steps.

6.3 Formal role contract enforcement (new)

Each role response must conform to a strict schema (e.g., JSON or structured markdown blocks).

LOC validates:

Schema validity
Required fields present
Artifact references resolvable
No forbidden sections (role boundary rules)

If invalid:

LOC flags a contract violation.
LOC generates a corrective “format repair” prompt for the same role.

7. Drift detection and recovery

7.1 Drift signals (v1.1)

Rule-based detection, including:

Missing constraints or acceptance criteria
Contradictions vs. scenario requirements
Output schema violations
Spec deviations (e.g., wrong repo, wrong language, ignored deterministic rules)
Over-budget warnings and verbose inflation
“Unresolved questions” not propagated

7.2 Drift scoring

Each signal adds weighted severity:

info / warn / error
Cumulative drift score shown in Inspector UI

7.3 Recovery loops

LOC generates recovery prompts:

“Re-anchor constraints” prompt for Architect
“Patch minimal diff” prompt for Developer
“Re-evaluate rubric” prompt for Critic
“Regression / edge-case sweep” prompt for Tester

8. Failure mode injection (new showpiece capability)

8.1 Purpose

Turn the lab into a resilience demonstrator:

show failures
show detection
show recovery
show cost impact

8.2 Injection modes (v1.1)

Ambiguous spec: remove/blur key constraints
Conflicting constraints: intentionally contradict requirements
Truncated context: simulate missing prior turns
Bad critic: introduce incorrect critique or wrong rubric thresholds
Budget crunch: set very low budget caps mid-run

8.3 Implementation concept

Injection modifies:

scenario inputs
routing prompts
role templates
budget parameters

LOC must record injection events in run artifact (injections[]).

9. Budget and economics (expanded)

9.1 Token estimation

Estimate tokens from characters (baseline) and/or model-specific heuristics.
Record per-turn estimate and cumulative.

9.2 Cost simulation

For v1.1 (no real API calls):

user selects “pricing profile” presets (cheap / mid / premium)
LOC computes simulated cost per turn and total
show “what this would cost” with model tiers

9.3 Dashboard outputs

burn-down chart over time
per-role share of tokens/cost
budget threshold warnings
cost of drift (extra turns caused by drift recovery)

10. Visual Inspector UI (new, high impact)

10.1 Views

Run Timeline
- turn list with role, timestamp, budget, validation, drift severity
Turn Graph (DAG)
- nodes: turns
- edges: handoffs / dependencies
- highlights: drift, contract violations
Diff View
- compare two turns (or two runs) for changes in constraints, artifacts, budgets
Rubric Panel
- category scores and thresholds
- reasons for pass/fail
Injection Panel
- list and details of injected failures

10.2 UX principles

No hidden magic. Every derived conclusion links to evidence.
Export/import first-class.
Works in iframe (no popups, no cross-origin dependencies).

11. Multi-model simulation layer (optional in v1.1)

11.1 Why

Prepare learners for API orchestration by teaching tradeoffs:

latency
cost
reliability
verbosity

11.2 How (without APIs)

Simulated Provider:

assigns “model personality presets” to roles
applies constraints (e.g., “fast model tends to be terse and miss edge cases”)
introduces optional random error rates (seeded for determinism)

All simulation parameters must be recorded in the run artifact.

12. Tech stack and repo shape (static-first)

12.1 Proposed stack

TypeScript (strict)
Vite (static build)
React (or Astro + React islands; choose one)
Zod (schema validation)
Vitest (tests)
ESLint + Prettier (enforced)
Docker for deterministic builds

12.2 Repo layout (proposed)

exnulla-orchestration-lab/
  apps/
    loc-web/                 # static web app
  packages/
    core/                    # state machine, schemas, scoring, drift
    scenarios/               # scenario definitions + injection templates
    ui/                      # inspector components
    cli/                     # optional CLI runner/export tools (v1.2+)
  runs/                      # sample runs (optional; or in /examples)
  docs/
    blueprint/               # this blueprint
    engineering-spec/        # detailed spec (separate doc)
    role-instructions/       # ChatGPT Project templates per role
  .github/workflows/
  Dockerfile
  package.json
  pnpm-workspace.yaml

12.3 Deterministic build requirements

Inject GIT_SHA at build time (ARG + ENV)
Include meta/version.json with git SHA and build timestamp (timestamp optional/normalized)
Lockfile required (pnpm)
CI must block merges if lint/test fail

13. Deployment and iframe embedding

13.1 Default deployment (standalone)

Static build served by nginx or any static host
Atomic deploy by swapping symlinked build directory

13.2 Iframe embedding (optional)

If embedded in exnulla-site or exnulla-demos:

build outputs to a single folder root with relative assets
no service-worker assumptions that conflict with host
storage uses namespaced keys:
- exnulla.orchestrationLab.<runId> etc.
export/import uses file download/upload, not cross-window messaging

13.3 Integration contract (minimal)

Provide a single embed URL (e.g., /demos/orchestration-lab/index.html)
Provide a postMessage-optional integration later (v2+) but not required

14. Milestones

v1.1.0 (Showpiece baseline)

Core state machine + run artifact schema
HumanProvider workflow
Role contract enforcement + repair prompts
Drift detection v1 (rules)
Budget + cost dashboards (simulated)
Inspector UI with DAG + timeline + rubric
Failure injection panel + recorded injection events
Export/import runs (JSON) + deterministic replay validation
Docker + CI hygiene (lint/test/build)

v1.2.x

Scenario library expansion (3–6 scenarios)
CLI utilities for run validation and report generation
Run comparison tool (diff two runs)

v2.x

API provider adapters (optional)
Tool execution hooks (optional)
Multi-tenant “course mode” (optional)

15. Acceptance criteria

A v1.1 release is “done” when:

A learner can complete a guided run end-to-end using only copy/paste.
LOC validates role outputs against the schema and produces repair prompts.
Drift flags trigger reliably on injected failures.
Inspector clearly explains why drift was flagged (evidence linked).
Exported run artifact can be imported and replay-validated deterministically.
Static build deploys cleanly and works in an iframe.
CI enforces strict TypeScript, linting, formatting, and tests.
meta/version.json exposes build SHA.

16. Notes on scope control

This is a showpiece, but it stays manageable by enforcing:

Deterministic core first
UI second (inspector)
Scenario count limited in v1.1
Simulation kept optional and seeded (no randomness without seed)

17. Deliverables (docs)

This blueprint implies the following docs in-repo:

docs/blueprint/exnulla-blueprint-orchestration-lab-1-1-0.md (this file)
docs/engineering-spec/exnulla-engineering-spec-orchestration-lab-1-1-0.md (next step)
docs/role-instructions/*.md (ChatGPT Project templates)
docs/runbook/DEPLOY.md (atomic static deploy)
docs/runbook/IFRAME.md (embedding contract)

18. Repo naming rationale

Recommended: exnulla-orchestration-lab
Signals “serious systems lab” rather than “toy demo,” while staying on-brand.

Alternate options:

exnulla-agentic-lab
exnulla-orchestrator-lab
exnulla-human-to-api-orchestration

Engineering Specifications

The engineering spec for this operating model does not merely describe features. It defines behavioral law for the build process itself. That includes output format, file authority, acceptance gates, CI discipline, and what kinds of changes are explicitly forbidden.

In practical terms, the spec must answer these questions:

Which files are canonical inputs to the build?
What exact artifacts must each phase emit?
What counts as drift?
What work is allowed in hardening versus architecture?
How are assumptions surfaced and eliminated?
How does verification prove coverage rather than imply it?
How is output constrained so the system remains reproducible?

1. Output Discipline

Full-file emission matters because it prevents hidden partial edits, accidental omissions, and conversational patch ambiguity. The system should produce complete artifacts, not vague change suggestions.

2. Structure Discipline

New files may only exist if they are explicitly defined in the spec or derived in the architecture plan with anchor mapping. Unanchored structure is drift.

3. Verification Discipline

Verification is not a final glance at output quality. It is a formal gate with required proof: drift check empty, anchor coverage present, assumptions empty.

4. CI Discipline

The process assumes lint, typecheck, and build are mandatory. The agentic workflow is not complete because it “looks right.” It is complete when the repo gates are green.

5. Idempotency Discipline

Every pass should be reproducible from scratch. The pipeline should not rely on hidden chat context, implicit globals, or fragile one-off edits that cannot be replayed.

6. No-Hidden-Globals Rule

Environment requirements, allowed inputs, and tool expectations must be explicit. Invisible ambient state is a major source of drift and operational failure.

Canonical Engineering Spec Markdown

The following appendix is mirrored locally from the orchestration lab engineering spec and displayed here as canonical markdown.

ExNulla Engineering Spec

Human Agentic Orchestration Lab (Standalone Showpiece)

Repository: exnulla-orchestration-lab
Slug: orchestration-lab
Spec Version: 1.1.0
Blueprint: exnulla-blueprint-orchestration-lab-1-1-0.md
Owner org: Thesis-Project
Primary mode: Static-first web app (iframe-safe)
Provider mode (v1.1): Human transport + simulated provider (no APIs)
Last Updated (UTC): 2026-02-27T00:00:00Z

0. Scope and determinism contract

0.1 What this spec is

An implementation-grade engineering spec for a standalone orchestration lab that:

makes orchestration mechanics visible (role separation, routing, drift, budgets),
captures every run as a deterministic run artifact ledger (run.json),
provides an inspector UI (timeline, DAG, diffs, rubric, injections),
supports export/import + deterministic replay validation,
works in an iframe sandbox and deploys as an atomic static artifact.

This spec is written so it can be handed back later with: “build it” and executed with minimal drift.

0.2 Hard constraints (MUST)

Static-first: pnpm build outputs a static bundle that can be hosted by nginx / static host.
Iframe-safe: no popups, no cross-origin assumptions, no top-level navigation hacks.
No UI automation/scraping: human transport is manual by design.
Deterministic core: orchestration/state evaluation must be deterministic given the same inputs + responses.
Export/import first-class: runs are portable JSON artifacts; UI can import/export.
No secrets: browser build stores no API keys; v1.1 has no real provider calls.
Repo hygiene: TypeScript strict, ESLint + Prettier, tests, Docker deterministic build.

0.3 Non-goals (v1.1)

Live integration with ChatGPT UI.
Multi-user authentication / cloud persistence.
Real API providers (OpenAI/Anthropic/etc.) beyond interface stubs.
ML-based drift classification (rule-based + evidence only).

0.4 Deterministic replay guarantee (MUST)

Given:

identical scenarioId,
identical scenario inputs,
identical injection set (including seed),
identical agent responses pasted into the ledger,
identical schemaVersion, then:
validation results, drift flags, rubric scores, budget totals, and derived digests MUST match.

Allowed non-determinism:

wall-clock timestamps can exist but MUST be excluded from deterministic checks (or normalized under export).

1. Product definition

1.1 Core workflows

Create run
- user selects scenario, provider mode, seed, budget/cost profile, and optional injections.
Generate routed prompt
- LOC produces a prompt for a role and explicit routing instructions.
Human transport
- user executes prompt in the role’s ChatGPT Project and pastes the response into the LOC.
Validate + score
- LOC validates schema/format, computes budgets/cost, flags drift, updates rubric, derives next step.
Inspect
- user inspects timeline, graph, diffs, drift evidence, rubric reasoning, injection events.
Export / Import
- export run as JSON (and optional markdown transcript); import later and replay-validate deterministically.
Compare
- compare runs (or turns) via diff UI (v1.1: within one run; v1.2: cross-run).

1.2 Target user profiles

Learner / developer wanting “pre-calc → calc” understanding of orchestration.
Hiring reviewers assessing systems thinking + determinism discipline.
Future-you using the ledger/state machine for API orchestration later.

2. Architecture overview

2.1 Packages (MUST)

packages/core
Deterministic state machine, schemas, scoring, drift, budgets, providers, export/import, deterministic hashing.
packages/scenarios
Scenario definitions, injection templates, seeded simulation knobs, scenario validation.
packages/ui
Shared UI components (graph, diff, panels), pure/presentational where possible.
apps/loc-web
Vite + React static web app: run wizard, prompt router, paste console, inspector.

2.2 Runtime boundaries

All deterministic logic lives in packages/core and must be usable:
- from the web app, and
- from future CLI tooling (v1.2+).
The web app is a thin shell around the core.

2.3 Transport / provider abstraction

HumanProvider (v1.1): manual paste. Produces routing instructions only.
SimulatedProvider (v1.1): produces deterministic “simulated outputs” for demonstration/testing, seeded.
ApiProvider (v2+): stub interface only in v1.1 (no keys, no calls).

3. Tech stack and repo standards

3.1 Required stack (MUST)

Node.js LTS (recommend 20.x)
TypeScript strict: true
pnpm + lockfile
Vite + React (single-page app)
Zod for runtime validation
Vitest for unit/integration tests
ESLint + Prettier enforced
Docker for deterministic builds

3.2 Deterministic build provenance (MUST)

Build accepts ARG GIT_SHA and injects to app:
- import.meta.env.VITE_GIT_SHA (Vite) and/or process.env.GIT_SHA (tests/build scripts)
Build outputs meta/version.json containing:
- gitSha,
- schemaVersion,
- buildId (optional; may be derived deterministically from gitSha + package versions),
- builtAt (optional; if present must be excluded from determinism checks).

4. Repository layout

4.1 Canonical layout (MUST)

exnulla-orchestration-lab/
  apps/
    loc-web/
      index.html
      vite.config.ts
      src/
        app/
          routes/
          state/
          components/
        main.tsx
      public/
        meta/
          version.json
  packages/
    core/
      src/
        schema/
        engine/
        providers/
        scoring/
        drift/
        budget/
        export/
        util/
      tests/
    scenarios/
      src/
        scenarios/
        injections/
        pricing/
      tests/
    ui/
      src/
        graph/
        diff/
        panels/
        widgets/
  docs/
    blueprint/
    engineering-spec/
    role-instructions/
    runbooks/
  examples/
    runs/
    scenarios/
  .github/
    workflows/
  Dockerfile
  docker-compose.yml (optional)
  package.json
  pnpm-workspace.yaml
  pnpm-lock.yaml
  tsconfig.base.json
  eslint.config.js
  prettier.config.cjs

4.2 Git ignore rules

Ignore persisted runs by default:
- apps/loc-web/.local/ (dev-only)
- **/runs/** except examples/runs/**
Include:
- at least one sample run artifact in examples/runs/ for regression tests and UI demo.

5. Data model: canonical run ledger

5.1 Canonical artifact path semantics

The canonical artifact is a single JSON object:

Web app storage: stored in browser (IndexedDB preferred; localStorage acceptable for v1.1 with size limits)
Exported artifact: user downloads a file named:
- orchestration-lab.run.<runId>.json

When building a “runs folder” later (CLI), the canonical structure will be:

runs/<runId>/run.json (not required for static build)

5.2 Schema versioning

schemaVersion is a semver-like string, pinned to spec version for v1.1:
- "1.1.0"
Backward compatibility requirements:
- v1.1 UI must import artifacts with schemaVersion "1.1.0".
- Future versions must provide migration utilities (v1.2+).

5.3 RunArtifact schema (MUST)

5.3.1 Top-level

export type RunArtifact = {
  schemaVersion: '1.1.0';
  slug: 'orchestration-lab';
  gitSha: string; // injected at build; "unknown" allowed
  runId: string; // deterministic id format
  createdAt?: string; // ISO; optional for determinism checks
  updatedAt?: string; // ISO; optional for determinism checks

  mode: {
    provider: 'human' | 'simulated'; // v1.1
    simulation?: SimulationConfig; // if simulated
  };

  scenario: {
    scenarioId: string;
    version: string; // scenario version string, e.g. "1.0.0"
    inputs: Record<string, unknown>;
  };

  injections: InjectionEvent[]; // applied injections, deterministic order
  roles: RoleProfile[]; // role contracts + instructions metadata

  turns: Turn[]; // append-only
  derived: DerivedState; // regenerated deterministically

  budgets: BudgetLedger; // token estimates, warnings
  economics: EconomicsLedger; // simulated cost and profiles

  rubric: RubricLedger; // scoring + thresholds + evidence
  drift: DriftLedger; // flags + evidence + severity summary

  acceptance: {
    passed: boolean;
    reasons: string[];
    checklist: { item: string; status: 'pass' | 'fail' | 'unknown'; evidence?: string[] }[];
  };
};

5.3.2 RoleProfile

export type RoleName = 'architect' | 'developer' | 'critic' | 'tester';

export type RoleProfile = {
  role: RoleName;
  displayName: string;
  chatgptProjectName: string; // user-configurable label
  instructionTemplateId: string; // e.g. "role-architect-1.1.0"
  contract: RoleContract;
};

export type RoleContract = {
  responseFormat: 'structured_markdown_v1' | 'json_v1';
  requiredHeaders: string[]; // exact heading strings
  requiredSections: string[]; // section ids
  forbiddenPatterns: string[]; // regex strings
  maxCodeBlockChars?: number; // heuristic for role confusion
  mustEchoRunTurnHeader: boolean; // require runId/turnId header block
};

5.3.3 Turn

export type Turn = {
  turnId: number; // 1..n
  role: RoleName;

  prompt: {
    templateId: string; // prompt template key
    text: string;
    charCount: number;
    tokenEstimate: number;
    stateDigestHash: string; // hash of digest included in prompt
  };

  response: {
    text: string;
    charCount: number;
    tokenEstimate: number;
    parsed?: ParsedResponse; // result of parsing per contract
    contractValid: boolean;
    contractErrors: string[];
  };

  analysis: {
    driftFlags: DriftFlag[];
    rubricScore: RubricScore;
    notes: string[]; // deterministic, engine-generated notes only
  };

  timestamps?: { promptedAt: string; respondedAt: string }; // optional
};

5.3.4 DerivedState (regenerated)

export type DerivedState = {
  digest: StateDigest; // compact state summary
  digestHash: string; // stable hash of digest
  openIssues: Issue[];
  artifactsIndex: ArtifactRef[];
  loopCountByStage: Record<string, number>;
  completion: { done: boolean; nextRole: RoleName | null; stage: Stage };
};

5.3.5 Digest / issues / artifacts

export type Stage = 'kickoff' | 'implementation' | 'review' | 'test' | 'revise' | 'finalize';

export type StateDigest = {
  scenarioSummary: string; // scenario-provided summary, bounded
  constraints: string[]; // scenario constraints, stable order
  acceptanceCriteria: string[]; // stable order
  deliverables: string[]; // stable order
  lastDecisions: string[]; // last 3 decisions (deterministic extraction)
  openQuestions: string[]; // extracted from critic/tester
  artifactHints: string[]; // from dev outputs / plan sections
};

export type Issue = {
  id: string; // stable hash id
  severity: 'info' | 'warn' | 'error';
  source: 'critic' | 'tester' | 'engine';
  message: string;
  evidence: string[];
  open: boolean;
};

export type ArtifactRef = {
  id: string; // stable hash id
  kind: 'snippet' | 'filetree' | 'patch' | 'plan' | 'testplan';
  title: string;
  producedByTurnId: number;
  contentHash: string;
  excerpt: string; // bounded excerpt for UI
};

5.4 Deterministic hashing (MUST)

Use a stable hash for digests, issues, artifacts:
- sha256(canonicalJsonString(value))
Canonical JSON stringification:
- stable key ordering,
- no whitespace variability,
- arrays kept in order.

6. Scenario system

6.1 Scenario definition format (MUST)

Scenarios are authored as TypeScript objects in packages/scenarios and exported as a registry.

export type Scenario = {
  scenarioId: string; // e.g. "hello-orchestration"
  version: string; // semver string
  title: string;
  summary: string; // bounded summary
  description: string;

  constraints: string[]; // stable order
  acceptanceCriteria: string[]; // stable order
  deliverables: string[]; // stable order

  roleTemplates: {
    architect: PromptTemplateId;
    developer: PromptTemplateId;
    critic: PromptTemplateId;
    tester: PromptTemplateId;
  };

  initialInputsSchema: z.ZodTypeAny; // validates scenario inputs
  defaultInputs: Record<string, unknown>;

  rubricProfileId: string; // ties to rubric weights
};

6.2 Required scenarios (v1.1)

Ship 3 scenarios minimum (MUST), each designed to show different drift/failure types:

hello-orchestration
Simple deterministic task, emphasizes contracts + budgets.
drift-trap-spec
Ambiguous requirements; emphasizes clarification propagation and re-anchoring.
regression-loop
Forces test failures and revise loops; emphasizes loop caps and cost-of-drift.

6.3 Scenario determinism rules

Scenario registry ordering must be stable (sort by scenarioId).
Scenario inputs are validated and stored verbatim in run artifact.
Any scenario-generated derived values must be stored or recomputable deterministically.

7. Role system and ChatGPT Project setup

7.1 Role instruction templates (MUST)

Ship templates in docs/role-instructions/:

architect.md
developer.md
critic.md
tester.md

Each template MUST contain:

Mission
Allowed outputs
Forbidden actions
Required response format contract
Determinism rules (“no hallucinated filenames; state assumptions explicitly”)
Interaction protocol for missing info (“ask targeted questions; do not proceed with guesses”)

7.2 Contract format: `structured_markdown_v1` (default)

All role responses MUST begin with an exact header block:

# Role: <Architect|Developer|Critic|Tester>
# Run: <runId>
# Turn: <turnId>

Then role-specific sections with fixed headings (examples below). LOC must validate these headings (case-sensitive) as the contract baseline.

Architect required headings

## Constraints (Do Not Violate)
## Acceptance Criteria (Checklist)
## System Plan
## Open Questions
## Next Handoff

Developer required headings

## Implementation Plan
## Proposed File Tree
## Patch / Diff
## Notes for Critic
## Next Handoff

Critic required headings

## Contract Validation
## Drift Signals
## Rubric Scoring
## Blocking Issues
## Non-Blocking Suggestions
## Next Handoff

Tester required headings

## Test Plan
## Test Results
## Failures / Repro Steps
## Risk Assessment
## Next Handoff

7.3 Repair prompts (MUST)

If a response fails contract validation:

engine must generate a repair prompt for the same role that:
- explicitly lists missing headings/fields,
- instructs the role to rewrite in the required format,
- forbids changing substantive content beyond formatting unless requested.

Repair events must be recorded as:

a drift flag DRIFT_CONTRACT_VIOLATION,
plus an engine note explaining the repair required.

8. Orchestration engine (state machine)

8.1 Engine API surface (MUST)

In packages/core/src/engine/ implement:

export type EngineInput = {
  run: RunArtifact;
  event: EngineEvent;
};

export type EngineEvent =
  | { type: 'INIT_RUN'; scenarioId: string; inputs: Record<string, unknown>; config: RunConfig }
  | { type: 'PASTE_RESPONSE'; text: string }
  | { type: 'APPLY_INJECTION'; injectionId: string; params?: Record<string, unknown> }
  | { type: 'SET_BUDGET_CAP'; tokenEstimateCap: number }
  | { type: 'SET_PRICING_PROFILE'; profileId: string }
  | { type: 'RESET_TO_TURN'; turnId: number }; // optional v1.1, required v1.2

export type EngineOutput = {
  run: RunArtifact; // updated artifact
  next: {
    role: RoleName | null;
    stage: Stage;
    routingInstruction?: string;
    promptText?: string;
  };
  diagnostics: {
    contractErrors?: string[];
    driftFlags?: DriftFlag[];
    rubricScore?: RubricScore;
  };
};

export function stepEngine(input: EngineInput): EngineOutput;

8.2 Deterministic derivation pipeline (MUST)

On each PASTE_RESPONSE:

Identify expected role/stage from run.derived.completion.
Validate response contract; parse into ParsedResponse.
Compute charCount + tokenEstimate.
Run drift detection (rule-based) with evidence.
Run rubric scoring (rule-based) with evidence.
Update budgets + economics ledgers.
Derive DerivedState from all prior turns deterministically.
Choose next role/stage based on transition rules.

8.3 Transition rules (v1.1) (MUST)

Stage progression:
- kickoff (architect) → implementation (developer) → review (critic) → test (tester) → finalize (architect)
Loops:
- If critic finds blocking issues OR rubric score below threshold:
  - review (critic) → revise (developer) → review (critic)
- If tester reports failures:
  - test (tester) → revise (developer) → review (critic) → test (tester) (as needed)
Loop caps:
- maxReviseLoops default: 5
- if exceeded:
  - mark acceptance passed=false,
  - force finalize (architect) with reasons including loop cap triggered.

8.4 State digest regeneration (MUST)

Digest is regenerated from:

scenario summary + constraints + acceptance criteria + deliverables,
latest Architect “System Plan” section (bounded),
open issues extracted from critic/tester sections (bounded),
last 3 decisions extracted from “Next Handoff” sections.

Extraction rules must be deterministic and documented (regex-based with stable ordering).

9. Drift detection

9.1 Drift ledger schema

export type DriftLedger = {
  flags: DriftFlag[];
  maxSeverity: 'none' | 'info' | 'warn' | 'error';
  score: number; // weighted sum
};

export type DriftFlag = {
  id: string; // stable code
  severity: 'info' | 'warn' | 'error';
  message: string;
  turnId: number;
  evidence: string[]; // exact excerpts or rule hits
  category: 'contract' | 'role_boundary' | 'constraint' | 'scope' | 'budget' | 'consistency';
};

9.2 Required drift rules (v1.1)

Contract

Missing required headings / header block
Invalid run/turn header values (non-matching runId, non-integer turn)
Unparseable structured sections

Role boundary

Architect includes large code blocks over maxCodeBlockChars → warn
Developer includes rubric scoring section → warn
Critic proposes implementing code changes (not critique) → warn
Tester proposes architecture changes (not test results) → warn

Constraints

Mentions forbidden actions (scraping, secrets, automation, “I executed code”, etc.)
Mentions external network calls if constraint forbids.

Scope

Introduces new deliverables not in scenario deliverables
Changes language/stack when constraints fix it

Budget

Excess verbosity: response token estimate exceeds per-turn ceiling (configurable)
Budget cap exceeded: error

Consistency

Contradicts prior accepted constraints/decisions (simple text match + hash checks of constraint lists)

9.3 Drift scoring weights (MUST)

Provide a deterministic scoring table in code:

info = +1
warn = +5
error = +20 Plus per-category multipliers:
contract ×1.0
constraint ×1.5
consistency ×1.2
budget ×1.1
scope ×1.3
role_boundary ×1.0

10. Rubric scoring

10.1 Rubric ledger schema

export type RubricLedger = {
  profileId: string;
  thresholds: {
    overallPassScore: number; // e.g. 80
    maxAllowedDriftSeverity: 'warn' | 'error'; // default "warn"
    consecutivePassTurns: number; // default 2
  };
  scores: RubricScore[];
  lastTwoPass: boolean;
};

export type RubricScore = {
  turnId: number;
  role: RoleName;
  score: number; // 0..100
  breakdown: {
    completeness: number; // 0..25
    correctnessSignals: number; // 0..25
    constraintAdherence: number; // 0..25
    clarity: number; // 0..25
  };
  evidence: string[]; // bounded list
  notes: string[];
};

10.2 Deterministic scoring heuristics (MUST)

Each dimension uses deterministic signals:

Completeness:
- required headings present,
- acceptance criteria referenced (architect + finalize turns),
- deliverables addressed (developer).
Correctness signals:
- explicit assumptions list present when needed,
- no contradiction flags,
- critic/tester issues include reproduction/evidence.
Constraint adherence:
- no constraint drift flags,
- no forbidden patterns.
Clarity:
- headings + bullet lists,
- bounded verbosity,
- actionable steps in “Next Handoff”.

Rubric code MUST output evidence that can be shown in the UI.

11. Budgeting and simulated economics

11.1 Token estimation (MUST)

tokenEstimate = ceil(charCount / 4)
Track:
- per-prompt and per-response estimates,
- cumulative totals,
- per-role totals.

11.2 Budget ledger schema

export type BudgetLedger = {
  tokenEstimateCap?: number;
  used: number;
  usedByRole: Record<RoleName, number>;
  warnings: { atTurn: number; severity: 'info' | 'warn' | 'error'; message: string }[];
};

11.3 Warning thresholds (MUST)

If cap exists:

70% → warn
85% → warn
100% → error (require explicit “continue anyway” toggle in UI)

11.4 Cost simulation (MUST)

No real pricing calls. Provide local profile table:

export type PricingProfile = {
  profileId: string; // "cheap" | "mid" | "premium"
  title: string;
  promptPer1kTokensUSD: number;
  completionPer1kTokensUSD: number;
};

export type EconomicsLedger = {
  pricingProfileId: string;
  simulatedCostUSD: number;
  costByRoleUSD: Record<RoleName, number>;
  costByTurnUSD: Record<number, number>;
  costOfDriftUSD: number; // computed as cost of turns after first drift>=warn
};

12. Failure mode injection

12.1 Injection model (MUST)

Injections are deterministic transformations applied at run creation or mid-run.

export type InjectionEvent = {
  injectionId: string; // stable id
  appliedAtTurnId: number; // 0 for pre-run
  params: Record<string, unknown>;
  seed?: number; // if injection uses randomness
  description: string;
};

12.2 Required injection types (v1.1)

AMBIGUOUS_SPEC
- removes acceptance criteria items or makes one vague.
CONFLICTING_CONSTRAINTS
- injects contradictory constraint pair and forces architect re-anchor.
TRUNCATED_CONTEXT
- engine includes fewer turn summaries in prompt generation.
BAD_CRITIC
- simulated critic produces incorrect critique (sim provider only).
BUDGET_CRUNCH
- lowers cap mid-run and forces recovery strategy.

12.3 Recording and evidence (MUST)

Every injection must be recorded in run.injections[].
Drift detection must reference injections where relevant (“this failure was injected”).

13. Prompt generation

13.1 Prompt template requirements (MUST)

Prompt templates must be:

deterministic,
minimal history,
always include the current StateDigest (bounded),
explicitly state the role contract format.

13.2 Prompt generation algorithm (MUST)

Input:
- scenario definition,
- current digest,
- last N turns summaries (default N=2),
- injections affecting prompts,
- budget status.
Output:
- a single prompt string.

History inclusion MUST be bounded:

include only:
- digest,
- last N summaries (generated deterministically from parsed role sections),
- open issues list.

13.3 Prompt provenance

Store in each turn:

templateId,
included digestHash (so later we can prove prompt was generated from digest X),
token estimates.

14. Persistence, export, import

14.1 In-browser persistence (v1.1)

Preferred: IndexedDB via a small wrapper (e.g. idb library) to store:

run list metadata,
full run artifacts.

Fallback: localStorage for metadata + compressed run JSON (only if small).

Key namespace (MUST):

exnulla.orchestrationLab.*
include schemaVersion in keys where useful.

14.2 Export format (MUST)

Export is the canonical RunArtifact JSON.
Additionally export (optional):
- transcript.md (prompt/response pairs),
- summary.md (budgets, rubric, drift, acceptance checklist).

14.3 Import validation (MUST)

Import must:

validate schemaVersion,
validate Zod schema,
recompute derived state and compare to stored derived (deterministic check),
show any mismatches as “artifact integrity warnings.”

15. Inspector UI

15.1 Routes (MUST)

/ → landing + “New Run” + “Import Run”
/runs → run list
/runs/:runId → run overview (timeline)
/runs/:runId/turns/:turnId → turn detail
/runs/:runId/graph → DAG view
/runs/:runId/diff → diff view (turn-to-turn)
/runs/:runId/rubric → rubric panel
/runs/:runId/drift → drift panel
/runs/:runId/injections → injection panel
/meta/version.json → version endpoint (static)

15.2 Timeline view requirements

per turn:
- role badge,
- contract status,
- token estimate + cumulative,
- drift severity,
- rubric score,
- links to detail and diff.

15.3 DAG view requirements

nodes = turns (ordered left-to-right by turnId)
edges = inferred stage transitions / loops
node styles:
- contract invalid → highlight
- drift warn/error → highlight
click node opens turn detail

Implementation:

use a lightweight graph lib compatible with static builds (e.g. React Flow) OR custom SVG layout.
determinism requirement:
- graph layout must be stable for a given run (seeded layout if using force algorithms).

15.4 Diff view requirements

Diff options:

prompt vs prompt (two turns)
response vs response
digest vs digest across turns

Implementation:

use a deterministic diff algorithm (e.g. diff package) and render hunks.

15.5 Paste console requirements

shows expected role + stage
shows prompt block (copy button)
provides paste input area
validates contract live and shows errors before submission
submits through stepEngine({ type: "PASTE_RESPONSE" })

15.6 Accessibility / iframe constraints

no reliance on window.top control
all downloads via standard browser download; no popups
no external fonts required (optional)

16. Simulated provider (optional but REQUIRED for tests)

16.1 Purpose

Provide deterministic “agent outputs” for:
- unit/integration tests,
- demo mode without ChatGPT UI,
- injecting failure patterns reproducibly.

16.2 SimulationConfig

export type SimulationConfig = {
  seed: number; // required
  modelPresetByRole: Record<RoleName, 'fast' | 'balanced' | 'thorough'>;
  errorRateByRole: Record<RoleName, number>; // 0..1
  verbosityByRole: Record<RoleName, number>; // 0..1
};

16.3 Simulation determinism rules

Use a seeded PRNG (e.g. seedrandom) in core.
Never use Math.random() directly.
All simulated outputs must embed the run/turn header block and required headings.

17. Testing plan

17.1 Core unit tests (MUST)

schema validation (valid + invalid fixtures)
deterministic hashing + canonical json
drift rules hit expected evidence
rubric scoring stable given fixed input
budget math and warning thresholds
digest regeneration stable
transition rules with loop caps

17.2 Integration tests (MUST)

simulate an entire run with SimulatedProvider:
- with no injections → should pass acceptance,
- with each injection type → should flag drift and/or fail acceptance depending on design.

17.3 UI smoke tests (SHOULD)

ensure build compiles
ensure routes render with sample run artifact

18. CI and release hygiene

18.1 GitHub Actions (MUST)

Workflow steps:

pnpm install --frozen-lockfile
pnpm lint
pnpm test
pnpm build
optional: upload dist/ as artifact

18.2 Version stamping (MUST)

GIT_SHA injected in CI:
- GIT_SHA=${{ github.sha }}
meta/version.json created during build from env + package version.

19. Docker spec (deterministic build)

19.1 Dockerfile requirements (MUST)

multi-stage build (build → nginx or dist output)
uses pnpm with lockfile
accepts ARG GIT_SHA

Example (reference, adjust as needed):

FROM node:20-alpine AS build
WORKDIR /app
ARG GIT_SHA=unknown
ENV VITE_GIT_SHA=$GIT_SHA

COPY package.json pnpm-lock.yaml pnpm-workspace.yaml ./
COPY apps/loc-web/package.json apps/loc-web/package.json
COPY packages/core/package.json packages/core/package.json
COPY packages/scenarios/package.json packages/scenarios/package.json
COPY packages/ui/package.json packages/ui/package.json

RUN corepack enable && corepack prepare pnpm@latest --activate
RUN pnpm install --frozen-lockfile

COPY . .
RUN pnpm build

FROM nginx:alpine AS runtime
COPY --from=build /app/apps/loc-web/dist /usr/share/nginx/html

19.2 Determinism note

Avoid embedding build timestamps unless explicitly excluded from replay checks.

20. Security and safety

20.1 No secrets rule (MUST)

UI must warn: “Do not paste secrets; this tool stores data locally.”
Best-effort secret detection (SHOULD):
- regex for common token formats,
- show warning banner; allow user to proceed (do not hard-block in v1.1).

20.2 Content boundaries

Role templates must forbid:
- claiming to have executed code,
- scraping/automation,
- accessing private systems.

21. Acceptance criteria (v1.1 release gate)

A v1.1.0 release is “done” when all are true:

New run wizard works end-to-end in Human mode using copy/paste.
Contract validation triggers and generates repair prompts.
Drift rules reliably fire on injected failure modes with evidence.
Inspector explains drift + rubric with clickable evidence.
Export/import roundtrip works and deterministic replay validation passes.
Static build runs cleanly and is iframe-safe.
CI enforces strict TS, lint, tests, build.
/meta/version.json exposes git SHA and schemaVersion.

22. Implementation checklist (file-level)

22.1 `packages/core` (MUST)

src/schema/runArtifact.ts (types + zod)
src/util/canonicalJson.ts (stable stringify)
src/util/hash.ts (sha256 helpers)
src/engine/stepEngine.ts
src/engine/deriveState.ts
src/drift/rules/*.ts
src/scoring/rubric.ts
src/budget/budget.ts
src/providers/humanProvider.ts
src/providers/simulatedProvider.ts
tests/*

22.2 `packages/scenarios` (MUST)

scenario registry + zod input schemas
injection registry + deterministic transforms
pricing profiles

22.3 `apps/loc-web` (MUST)

run store (IndexedDB wrapper)
new run wizard
prompt router + paste console
inspector routes (timeline, turn detail, graph, diff, rubric, drift, injections)
export/import UI

22.4 `docs` (MUST)

role instruction templates
runbooks:
- DEPLOY.md (atomic static deploy)
- IFRAME.md (embedding contract and storage namespace)

23. Appendix A — Deterministic runId format

23.1 Format

Use a URL-safe id:

orl_<YYYYMMDD>_<hhmmss>_<randBase32> for human runs (time-based, not determinism-critical), OR
orl_<hashPrefix> for deterministic runs if seed-based.

v1.1 choice (recommended):

time-based is acceptable because determinism is based on artifact content, not runId.

23.2 Requirement

runId must be unique within local store.
export file naming uses runId.

24. Appendix B — UI embed contract (iframe)

24.1 Static hosting assumptions

all assets served relative to app root
no service worker required
no absolute URLs

24.2 Storage namespace

All keys must be prefixed:

exnulla.orchestrationLab.v1.1.0.*

25. Roadmap hooks (v1.2+ / v2+)

25.1 v1.2 (planned)

CLI validator:
- validate-run <file>
- diff-runs <a> <b>
cross-run comparison UI
more scenarios (6+)

25.2 v2 (planned)

API provider adapters
optional server runtime for keys (not in browser)
tool execution hooks (optional)

CI/CD and Verification Model

CI/CD is the external enforcement mechanism for this workflow. It is where subjective process claims become objective pass/fail behavior.

Lint Required
Formatting and static quality discipline are not optional postscript tasks.
Typecheck Required
The pipeline must prove structural correctness, not just visual plausibility.
Build Required
A finished pass that does not build is not finished.
No Tooling Drift
The pipeline should not modify root build systems, lint configs, or monorepo behavior unless that change is explicitly allowed by the spec.
Verification Before Completion
Completion requires verification pass, anchor coverage, assumptions collapsed to zero, and no drift alert.

The point is simple: the repo gates are the truth surface. Any agentic process that bypasses them is theater.

Agentic Development Pipeline

The workflow is deliberately closer to a supervised build engine than a conversational coding assistant. Roles are isolated. Scope is constrained. Output is artifact-based. Verification has veto power.

Architect Role
Defines anchors, non-negotiables, and architecture plan. Cannot silently implement.
Tooling Role
Confirms CI commands, repo expectations, and allowed configs. Cannot invent new stack decisions.
Implementation Role
Produces code only from the approved architecture tree. Cannot expand scope because a new idea appears attractive mid-pass.
Verification Role
Must halt the pass on drift, missing anchor coverage, or unresolved assumptions.

This role separation is not ceremony. It is what allows the system to scale reasoning without allowing authority to become ambiguous.

The hidden advantage is economic as well. By decomposing work into explicit phases, you can route simpler tasks to cheaper model tiers and reserve expensive reasoning for architecture, synthesis, and conflict resolution rather than paying premium rates for every token in the loop.

Human Roles

Human involvement is not a sign that the pipeline is unfinished. It is where system quality actually comes from.

Inspection
Humans inspect whether the output actually satisfies the real-world intent, not just the literal text of the prompt.
Quality Control
Humans catch misframed assumptions, strategic mismatches, and low-signal cleverness.
Infra and Deployment Authority
Humans retain ownership of environment changes, release discipline, secrets handling, and operational boundaries.
Specification Control
Humans are responsible for tightening the blueprint/spec pair when the process reveals new ambiguity.

In short: the machine accelerates structured work, but the human remains accountable for engineering judgment.

Security and Drift Control

In this operating model, security and drift control are tightly linked. A system that cannot explain why a file exists, why a behavior changed, or where authority came from is both a process problem and a security problem.

Drift Halt
If assumptions remain, anchors are incomplete, output format is violated, or unanchored files appear, the correct behavior is to stop the pass.
No Hidden Globals
Environment requirements must be explicit. Hidden state makes both reproducibility and security worse.
Scope-Constrained Output
The implementation agent should not expand capability beyond what the architecture already justified.
OS-Neutral, Repo-Relative Behavior
Process portability matters. Repo-relative paths and explicit assumptions reduce accidental environmental coupling.
Artifact Traceability
Every meaningful output should map back to a phase, a purpose, and a source authority.

The shortest summary is this: a safe agentic pipeline is not one that “does more.” It is one that fails visibly, explains itself, and refuses to outrun its own specification.