LinkedIn Watch on Twitch
Engineering Process

What to Expect From Me

The way an engineer designs, specifies, and operates complex systems matters as much as the systems themselves. This section exposes that process directly. Instead of describing capabilities in abstract terms, it shows the architecture, constraints, specifications, and control layers used to build real systems. For potential collaborators, employers, or clients, the goal is simple: evaluate the work itself. The materials below demonstrate how problems are structured, how specifications are written, and how systems are shipped. If the approach aligns with your needs, we should talk.

Case 01 — Thesis Chain AI DevKit Case 02 — Human Agentic Pipeline Canonical Source Material Included Where Available All Rights Reserved

How to Read This Section

Most portfolios show finished artifacts. This page shows the work behind those artifacts. It documents how problems are framed, how constraints are locked, how blueprints and engineering specs are written, how CI/CD is used as an enforcement mechanism, how AI-assisted development is bounded, and where human inspection remains authoritative.

The intent is to remove ambiguity about what someone is actually hiring when they hire me. The output matters, but the process matters more. This page is meant for technical readers, operators, founders, and engineering leadership who want to inspect the system behind the artifact rather than stop at the artifact itself.

Important Notice

The material presented here reflects proprietary engineering processes and system design work. These processes, architectures, methodologies, and planning artifacts are intellectual property. All rights reserved.

This page is provided for evaluation purposes only to demonstrate engineering capability, architectural reasoning, system discipline, and execution quality.

Case 01 — Thesis Chain AI DevKit

This case study examines the engineering process behind the Thesis Chain AI DevKit. The DevKit exists to safely integrate AI-assisted development into production-grade engineering workflows while controlling cost, behavior, nondeterminism, and security risk.

Rather than treating AI as an autonomous authority, the DevKit treats model output as untrusted input. Guardrails, validation layers, budget controls, policy checks, and human inspection surround the model so the engineering system remains predictable, auditable, and reviewable.

Problem Definition

Modern AI models can accelerate engineering work dramatically, but naive integration introduces severe risk: prompt injection, uncontrolled costs, nondeterministic output, accidental data disclosure, weak reviewability, and silent drift in system behavior.

Most AI-assisted development tooling assumes the model can be trusted to generate correct or safe output. In practice this assumption fails often enough to make an unconstrained approach unacceptable in serious engineering environments.

The engineering problem addressed by the Thesis Chain AI DevKit is therefore:

How can AI-assisted development be integrated into real engineering workflows while maintaining deterministic control, cost discipline, bounded authority, and meaningful security guarantees?

Engineering Constraints

Before architecture begins, the system must operate under explicit constraints. These constraints shape every architectural decision and prevent design drift.

Blueprint Architecture

The blueprint phase exists to lock system intent before implementation begins. Its purpose is not to describe code. Its purpose is to define the operational shape of the system: what the system must do, what it must never do, how risk is bounded, where authority resides, how inputs move, and what acceptance looks like before implementation starts.

For the Thesis Chain AI DevKit, the blueprint establishes a guardrail-first architecture. The model is never placed at the center of the system. Instead, the model is wrapped inside a deterministic control envelope that constrains what context may be passed in, how requests are formed, how outputs are parsed, and what conditions cause the system to reject the response.

At blueprint level, the architecture is divided into ordered layers rather than loose feature ideas. That matters because order determines safety. Cheap and deterministic checks execute first. Expensive and probabilistic work executes later, only after the input has been reduced, normalized, screened, and validated.

Canonical Blueprint Markdown

The following appendix is mirrored locally from the AI DevKit source material and displayed here as canonical markdown.

The Thesis Chain AI DevKit — Blueprint

Version: 1.0.0
Status: Canonical Blueprint
Project: the-thesis-chain-ai-devkit
Document Type: System Blueprint
Primary Audience: Engineering leadership, platform engineers, security reviewers, implementation engineers
Authoring Intent: Define the operational architecture, trust boundaries, guardrails, authority model, and implementation shape for a safe AI-assisted engineering system.


1. Purpose

The Thesis Chain AI DevKit exists to integrate AI-assisted development into real engineering workflows without giving model output uncontrolled authority over code, repository state, infrastructure, or policy.

The system is designed around a simple premise:

AI output is useful, but untrusted.

The DevKit therefore does not treat the model as a builder with implicit authority. It treats the model as an external probabilistic subsystem wrapped inside deterministic engineering controls. The value of the system comes from how inputs are reduced, how context is bounded, how outputs are validated, how budget is controlled, how risk is isolated, and where human authority is retained.

This project is not a chatbot wrapper. It is an engineering control framework for structured, auditable, bounded AI-assisted workflows.


2. Problem Statement

Modern model providers can accelerate review, synthesis, linting, threat sketching, and ambiguity detection. However, naive adoption creates a compound engineering risk surface:

  • unbounded token spend
  • accidental data disclosure
  • prompt injection through repository text
  • nondeterministic output treated as truth
  • silent workflow drift
  • provider coupling
  • weak auditability
  • unclear merge authority
  • inappropriate use of write-capable automation

The actual engineering problem is:

How can AI-assisted engineering workflows produce useful structured output while preserving deterministic safety, bounded cost, auditability, and human control?

This blueprint answers that question at architecture level.


3. Design Position

3.1 What AI is allowed to be

AI may act as:

  • a reviewer
  • a synthesizer
  • a contradiction detector
  • an ambiguity finder
  • a threat-category sketcher
  • a structured advisory instrument

3.2 What AI is not allowed to be

AI is not:

  • a source of truth
  • an autonomous merger
  • a deployment authority
  • a secrets-bearing execution surface
  • a repository-wide reader by default
  • a policy mutator
  • a privileged system actor

3.3 Core architectural stance

The system is guardrail-first, fail-closed, and authority-constrained.

The model sits inside a layered deterministic envelope. The envelope, not the model, is the system.


4. Non-Negotiable Constraints

Before implementation, the following constraints are locked.

4.1 Bounded authority

AI output may be rendered, scored, cached, audited, and surfaced for review, but it may not directly merge code, deploy infrastructure, rotate secrets, or mutate policy without explicit human approval.

4.2 Diff-limited context

The system must operate on narrowed, task-relevant, allowlisted context. Whole-repo dumping is prohibited by design.

4.3 Redaction before provider access

Redaction and path filtering occur before any provider call is possible.

4.4 Strict schema at boundaries

Model output must be parsed into declared structure. If parsing fails, the system rejects the result.

4.5 Fail-closed behavior

Validation, policy, or budget failure must produce rejection rather than silent degradation.

4.6 Deterministic gates remain authoritative

Deterministic checks keep final authority. AI output is advisory even when structurally valid.

4.7 Provider abstraction

Core system logic may not be tightly coupled to a single model vendor.

4.8 Full run traceability

Meaningful executions must emit auditable artifacts sufficient for replay, diagnosis, and review.


5. System Goals

The DevKit is intended to provide the following outcomes.

  1. Increase engineering leverage on review-heavy work.
  2. Reduce ambiguity and contradiction in specs, diffs, and architectural material.
  3. Bound the safety and cost risks of model usage.
  4. Produce repeatable structured outputs.
  5. Preserve explainability and post-run auditability.
  6. Support both local and GitHub-mediated workflows.
  7. Remain useful even when provider integrations are stubbed or offline.

6. Out of Scope

The following are explicitly out of scope for this version.

  • autonomous code merge
  • autonomous deployment
  • autonomous policy modification
  • secret retrieval from protected systems
  • unrestricted repo ingestion
  • write-capable agent swarms
  • unsupervised multi-step tool execution against production systems
  • treating schema-valid output as semantically correct by default

7. Operational Model

The DevKit is organized as a layered pipeline.

7.1 Layer 0 — Input boundary

Inputs enter as typed engineering artifacts:

  • repository reference
  • pull request reference
  • diff summary
  • changed files
  • prompt template version
  • task class
  • runtime policy
  • optional provider configuration

All inputs are assigned trust levels.

7.2 Layer 1 — Path policy and context eligibility

Files are filtered through allow/deny policy. Sensitive directories and structurally dangerous paths are excluded from model context.

7.3 Layer 2 — Redaction and sanitization

Eligible content is passed through redaction rules to suppress obvious secret and PII patterns and to reduce accidental disclosure.

7.4 Layer 3 — Prompt injection preflight

Repository text, diffs, and instructions are screened for prompt injection patterns. Safety mode accepts false positives over false negatives.

7.5 Layer 4 — Context minimization

Only the minimum useful diff and file content move forward. The system reduces low-signal input before any expensive operation.

7.6 Layer 5 — Budget and routing

The system decides whether the task deserves an AI call at all, and if so, what model class should receive it.

7.7 Layer 6 — Provider execution

Providers are treated as external execution surfaces. Their output is raw material, not authority.

7.8 Layer 7 — Parse and schema validation

Response text must parse to valid structured output. Invalid output is rejected.

7.9 Layer 8 — Decision boundary

A valid report is still classified as advisory. It may be rendered to markdown, attached to a PR, cached, audited, or flagged for manual review.

7.10 Layer 9 — Audit, metrics, replay

The run emits enough metadata to reconstruct what happened without trusting memory or provider logs alone.


8. High-Level Architecture

8.1 Principal subsystems

  • Policy subsystem

    • allow paths
    • deny paths
    • strict schema enforcement
    • prompt injection guard enablement
    • budget limits
    • model selection defaults
  • Context control subsystem

    • changed-file assembly
    • diff summary ingestion
    • size reduction
    • path gating
    • content shaping
  • Safety subsystem

    • redaction
    • prompt injection heuristics
    • fail-closed validation
  • Provider abstraction subsystem

    • provider interface
    • stub provider
    • future provider adapters
  • Schema boundary subsystem

    • output contract
    • parse failure handling
    • structure validation
  • Audit subsystem

    • request event
    • response event
    • error event
    • hashes and token usage
  • Cache subsystem

    • deterministic keying
    • TTL-based storage
    • duplicate-spend prevention
  • Agent subsystem

    • task-specific templates
    • structured report generation
    • agent versioning
  • Runner subsystem

    • local runner
    • GitHub Actions runner
    • GitHub App / webhook architecture

9. Agent Model

Agents in this system are not autonomous personas. They are typed task modules with fixed contracts.

Each agent must define:

  • an agent name
  • an agent version
  • a prompt template
  • constraints
  • an output schema
  • a deterministic validation boundary
  • a rendering target

Example task classes supported by the current architecture include:

  • specification linting
  • PR synthesis
  • threat sketching

The architectural rule is that an agent is not defined by a clever prompt. It is defined by a prompt-plus-contract-plus-boundary package.


10. Trust Boundaries

This system has several hard trust boundaries.

10.1 Repository text is untrusted

Pull request content, spec text, comments, and changed files may contain adversarial instructions.

10.2 Model provider is external

Provider calls move data beyond the local boundary. Context must be reduced before crossing that line.

10.3 Model output is untrusted

Even well-formed output may be wrong, incomplete, or subtly misleading.

10.4 Human reviewers remain authoritative

Human approval is the boundary at which advisory output may influence actual engineering decisions.


11. Safety Architecture

11.1 Prompt injection resistance

The system uses conservative preflight heuristics to reject obvious attempts to override role, reveal secrets, or alter instructions.

11.2 Path isolation

The system denies unsafe path classes by default and only sends allowlisted engineering material.

11.3 Secret and PII redaction

Sensitive patterns are removed or masked before request assembly.

11.4 Schema-gated output

Only output that fits the declared report structure is accepted into downstream systems.

11.5 Read-only default integration

Integrations should default to read-only scope with comment-only feedback unless explicitly elevated.

11.6 Human-held merge authority

No report, score, or advisory comment is permitted to stand in for merge authority.


12. Budget and Cost Control Model

The DevKit treats cost as a first-class systems problem.

12.1 Budget primitives

For a run r:

  • calls(r) = number of provider calls
  • Tin(r) = total input tokens
  • Tout(r) = total output tokens

The budget envelope is:

  • calls(r) <= C_max
  • Tin(r) <= I_max
  • Tout(r) <= O_max

The run is rejected when any inequality is violated.

12.2 Cost equation

For provider pricing:

  • alpha = cost per input token
  • beta = cost per output token

Then expected run cost is:

Cost(r) = alpha * Tin(r) + beta * Tout(r)

System-level budget discipline requires that expected spend be bounded before scale is allowed.

12.3 Caching principle

Repeated calls on equivalent prompt and context should not re-spend budget.

A canonical cache key shape is:

K = H(provider || model || prompt_version || prompt_hash || context_hash || policy_version)

Where H() is a collision-resistant digest.


13. Auditability Model

Every meaningful run should emit structured audit events.

At minimum, the system records:

  • request id
  • provider
  • model
  • prompt hash
  • context hash
  • output hash
  • timestamp
  • token usage
  • error state, if any

This allows operators to answer:

  • what was asked
  • what input class was sent
  • what provider/model handled it
  • whether the output was cached
  • whether the output validated
  • what it cost
  • what failed if the run was rejected

Audit exists to support diagnosis, governance, and trust.


14. GitHub Integration Model

The DevKit supports two primary integration modes.

14.1 CI-driven mode

A GitHub Action runs on PR events, assembles eligible context, executes the advisory pipeline, and posts structured review comments.

14.2 App-driven mode

A webhook service verifies GitHub signatures, mints installation tokens, fetches changed files, runs the advisory pipeline, and posts PR comments or check runs.

The blueprint preference is:

  • read-only by default
  • no content mutation by default
  • comment/check-run surfaces preferred over write surfaces
  • deterministic verification before any pipeline execution

15. Human Roles

The system explicitly retains human authority in the following roles.

15.1 Architect

Defines the allowed shape of the system, agent classes, boundaries, and non-negotiables.

15.2 Security reviewer

Owns threat posture, path policy, redaction strategy, integration scope, and escalation policy.

15.3 Implementation engineer

Builds adapters, runners, validators, and renderers against the blueprint and spec.

15.4 Reviewer / operator

Interprets advisory output, checks evidence, and decides whether action is warranted.

15.5 Release authority

Retains final authority for merges, deployment, and policy change.


16. Acceptance Criteria

The blueprint is considered implemented correctly when the system can demonstrably do the following:

  1. accept diff-limited engineering context
  2. reject disallowed paths before provider access
  3. redact obvious secrets and PII before request creation
  4. detect and block obvious prompt injection patterns
  5. assemble versioned prompt envelopes
  6. enforce hard token/call budgets
  7. cache equivalent requests deterministically
  8. parse and schema-validate response structure
  9. emit auditable request/response/error events
  10. surface advisory reports without granting write authority
  11. support both local and GitHub-oriented execution paths
  12. fail closed on malformed output or policy violation

17. Failure Philosophy

The DevKit is intentionally conservative.

When uncertain, it should:

  • reduce context
  • reject unsafe paths
  • block suspicious instructions
  • refuse malformed output
  • mark uncertainty explicitly
  • escalate to human review

The preferred failure mode is lost convenience, not silent compromise.


18. Future Evolution

The architecture permits future additions, but only within the same control posture.

Possible later extensions include:

  • stronger schema validators
  • scored evidence confidence
  • richer path-policy classes
  • provider multiplexing
  • offline replay tooling
  • diff chunking for large PRs
  • policy version pinning
  • richer evaluation harnesses
  • more agent classes

These are valid only if they preserve the current authority model: deterministic controls first, advisory AI second.


19. Blueprint Summary

The Thesis Chain AI DevKit is a control architecture for AI-assisted engineering, not an AI-first automation toy.

Its core principles are:

  • AI remains untrusted
  • deterministic boundaries remain authoritative
  • context is minimized before exposure
  • cost is bounded
  • outputs are schema-gated
  • audit is mandatory
  • write authority is withheld by default
  • humans retain final control

That is the system this blueprint defines.

Engineering Specifications

If the blueprint defines intent, the engineering specification defines execution. This is where high-level architectural ideas are converted into a buildable, inspectable, and testable system. In my process, the engineering spec is not a light outline. It is the document that removes ambiguity from implementation.

The engineering spec for an AI-assisted development system must answer several questions explicitly:

For the Thesis Chain AI DevKit, the engineering spec acts as a discipline document. It translates “AI should help here” into precise, enforceable behavior.

1. Module Boundaries

The spec separates the system into modules with narrow responsibilities: input preparation, sanitization, routing, provider calls, parsing, validation, budget accounting, result classification, and human inspection. If a module cannot be named and bounded, it is not ready to be implemented.

2. Ordered Guardrails

Guardrails are fixed in sequence. They are not optional helpers. They are part of the main execution path.

3. Output Contracts

The spec defines what a valid response looks like. Structured output contracts reduce hidden interpretation costs and unstable downstream behavior.

4. Failure Semantics

The spec identifies when the system must stop. A malformed response, budget breach, unsafe context match, or policy violation should terminate the path and surface a visible failure state.

5. Token and Cost Discipline

Work is divided into classes: mechanical, evaluative, synthesis-heavy, and ambiguous. These classes map to different model tiers and different budget thresholds.

6. Inspection Requirements

The spec defines what must be visible to a human reviewer: prompt class, sanitized input summary, chosen model tier, token consumption, validation results, classification outcome, and final disposition.

7. Non-Negotiables

The strongest specs contain non-negotiables that implementation is not allowed to reinterpret: no hidden globals, no silent fallback behavior, no speculative scope expansion, no unbounded model calls, and no accepting model output as trusted state without validation and review.

Canonical Engineering Spec Markdown

The following appendix is mirrored locally from the AI DevKit source material and displayed here as canonical markdown.

The Thesis Chain AI DevKit — Engineering Specification

Version: 1.0.0
Status: Canonical Engineering Specification
Project: the-thesis-chain-ai-devkit
Document Type: Engineering Specification
Primary Audience: Implementation engineers, reviewers, maintainers, CI/CD operators
Depends On: the-thesis-chain-ai-devkit-blueprint-1-0-0.md


1. Specification Intent

This engineering specification defines the concrete implementation contract for the Thesis Chain AI DevKit.

It exists to translate blueprint-level architectural intent into:

  • module boundaries
  • runtime data contracts
  • algorithmic flow
  • validation rules
  • budget equations
  • cache semantics
  • audit event structure
  • runner behavior
  • GitHub integration behavior
  • acceptance tests

This spec is written so an implementation engineer can build or extend the system without guessing.


2. System Summary

The DevKit is a provider-agnostic, schema-gated, guardrail-first framework for AI-assisted engineering workflows.

At runtime, the system:

  1. receives a task-specific request
  2. filters context by policy
  3. redacts content
  4. screens for prompt injection
  5. assembles a prompt envelope
  6. computes deterministic hashes
  7. checks cache
  8. enforces budget
  9. calls a provider adapter
  10. parses and validates response structure
  11. records audit events
  12. returns an advisory report to a runner

The implementation must preserve that order.


3. Repository-Level Module Topology

3.1 Required top-level module groups

  • src/core/

    • types
    • policy
    • redaction
    • injection guards
    • schema validation
    • LLM client
    • audit
    • cache
    • prompt templates
    • shared utilities
  • src/adapters/

    • provider adapter interface
    • provider implementations or stubs
  • src/agents/

    • typed agent runners for fixed task classes
  • src/runners/

    • local execution path
    • GitHub-oriented execution path
  • docs/

    • architectural and operational documentation
  • .github/workflows/

    • CI demonstration or integration flows

4. Data Contracts

4.1 Severity

Allowed values:

  • info
  • warn
  • high

4.2 Category

Allowed values:

  • structure
  • invariant
  • threat
  • diff
  • test

4.3 Finding

A finding is a typed advisory unit.

type Finding = {
  id: string;
  severity: 'info' | 'warn' | 'high';
  category: 'structure' | 'invariant' | 'threat' | 'diff' | 'test';
  claim: string;
  evidence_refs: string[];
  suggested_action?: string;
};

4.4 Report

The report is the canonical accepted AI output structure.

type Report = {
  agent: string;
  version: string;
  input_hash: string;
  output_hash: string;
  findings: Finding[];
  notes?: string[];
};

4.5 FileBlob

type FileBlob = {
  path: string;
  content: string;
};

4.6 AgentContext

type AgentContext = {
  repo: { owner: string; name: string };
  pr?: { number: number; headSha: string };
  diffSummary: string;
  changedFiles: FileBlob[];
  promptVersion: string;
};

4.7 ModelSpec

type ModelSpec = {
  provider: 'stub' | 'openai' | 'azure_openai' | 'anthropic' | 'vertex';
  model: string;
  temperature: number;
  maxOutputTokens: number;
};

4.8 Budget

type Budget = {
  maxCalls: number;
  maxTotalInputTokens: number;
  maxTotalOutputTokens: number;
};

4.9 LLMRequest

type LLMRequest = {
  requestId: string;
  system: string;
  task: string;
  constraints: readonly string[];
  outputSchema: JSONSchemaLike;
  model: ModelSpec;
  context: {
    diffSummary: string;
    files: FileBlob[];
  };
  sampling?: {
    top_p?: number;
    seed?: number;
  };
};

4.10 LLMResponse

type LLMResponse = {
  requestId: string;
  provider: LLMProvider;
  model: string;
  rawText: string;
  parsed: Report;
  usage: {
    inputTokens: number;
    outputTokens: number;
  };
  audit: {
    promptHash: string;
    contextHash: string;
    outputHash: string;
    timestampMs: number;
  };
};

5. Policy Contract

5.1 Policy structure

The system policy must declare:

  • allowPaths
  • denyPaths
  • budget
  • model
  • strictSchema
  • promptInjectionGuard

Example contract:

type Policy = {
  allowPaths: string[];
  denyPaths: string[];
  budget: Budget;
  model: ModelSpec;
  strictSchema: true;
  promptInjectionGuard: true;
};

5.2 Path evaluation rule

A path is eligible iff:

  1. it does not match any deny prefix
  2. it does match at least one allow prefix

Formally, for path p:

eligible(p) = (forall d in D : not startsWith(p, d)) and (exists a in A : startsWith(p, a))

Where:

  • D = deny path set
  • A = allow path set

5.3 Default posture

The default policy must remain conservative and read-only in operational effect.


6. Request Lifecycle

6.1 Required order of execution

The system shall process each request in this exact logical order:

  1. accept typed request
  2. apply redaction
  3. run prompt injection preflight
  4. build prompt
  5. hash prompt and context
  6. check cache
  7. enforce budget
  8. record request audit event
  9. call provider
  10. parse response
  11. validate response schema
  12. increment budget counters
  13. compute output hash
  14. record response audit event
  15. write cache entry
  16. return structured response

This order is not optional. Rearranging it weakens safety or observability.


7. Context Reduction Requirements

7.1 Context assembly

Only changed files relevant to the current task may be included.

7.2 Context size discipline

The system must avoid whole-repo context assembly. Input is restricted to:

  • diff summary
  • selected changed files
  • fixed prompt template material
  • fixed constraints

7.3 Exclusion rules

Files matching deny policy shall never be passed to a provider.

7.4 Context objective

The context subsystem is optimized for signal density, not completeness.


8. Redaction Requirements

8.1 Redaction timing

Redaction must occur before cache-key generation for provider-bound prompt content and before provider invocation.

8.2 Minimum baseline patterns

The implementation must support rule-based redaction of:

  • obvious API-key-like tokens
  • email addresses
  • later extensible secret patterns

8.3 Redaction function

For text blob x and rule set R = {r_1, r_2, ..., r_n}:

Redact(x, R) = r_n(...r_2(r_1(x)))

Where each r_i is a pattern substitution function.

8.4 Redaction philosophy

The redaction subsystem is deliberately conservative. False positives are acceptable if they reduce accidental disclosure.


9. Prompt Injection Guard Requirements

9.1 Guard timing

Prompt injection screening must run after redaction and before provider invocation.

9.2 Heuristic scope

The system must reject obvious adversarial prompt constructs such as:

  • instruction override attempts
  • role-spoof labels
  • secret-exfiltration requests
  • provider-key disclosure language

9.3 Safety mode

The guard should prefer false positive rejection over permissive acceptance.

9.4 Failure behavior

A triggered guard produces immediate request rejection.


10. Prompt Envelope Construction

10.1 Required sections

The prompt envelope shall be assembled in explicit labeled sections:

  • SYSTEM
  • TASK
  • CONSTRAINTS
  • OUTPUT_SCHEMA
  • CONTEXT_DIFF_SUMMARY
  • CONTEXT_FILES

10.2 Section purpose

This labeling exists to reduce ambiguity, constrain prompt shape, and make prompt assembly auditable.

10.3 Prompt template versioning

Every prompt template must include:

  • id
  • version
  • system
  • task
  • constraints
  • outputSchema

Template version changes are behavioral changes and must be traceable.


11. Hashing and Cache Semantics

11.1 Prompt hash

Let P be the final assembled prompt string. Then:

promptHash = H(P)

11.2 Context hash

For diff summary S and files F = {(p_i, c_i)}:

contextHash = H(S || join_i(p_i || ":" || H(c_i)))

11.3 Cache key

A canonical cache key shall include:

  • policy namespace or equivalent
  • provider
  • model
  • prompt hash
  • context hash

Example:

cacheKey = "aidev:" || provider || ":" || model || ":" || promptHash || ":" || contextHash

11.4 Cache objective

Caching exists to prevent repeated spend on semantically equivalent work.

11.5 Cache store requirement

The cache interface must support:

  • get(key)
  • set(key, value, ttlSeconds)

The reference implementation may be in-memory. Production implementations may use external stores.


12. Budget Enforcement

12.1 Runtime counters

For a process-local runtime:

  • c = calls made
  • ti = cumulative input tokens
  • to = cumulative output tokens

12.2 Enforcement predicates

A request is permitted iff:

  • c < C_max
  • ti < I_max
  • to < O_max

If any predicate fails, the run must reject with an explicit budget error.

12.3 Budget enforcement timing

Budget checks occur before provider invocation.

12.4 Increment semantics

Counters are incremented only after a provider response is received.

12.5 Operational note

Process-local counters are sufficient for local/demo runs. Shared production environments may require durable or distributed budget state.


13. Provider Adapter Contract

13.1 Provider adapter purpose

The provider adapter isolates model-vendor specifics from core pipeline logic.

13.2 Minimum interface

The adapter must expose a call surface equivalent to:

interface ProviderAdapter {
  provider: LLMProvider;
  call(
    req: LLMRequest,
    prompt: string,
  ): Promise<{
    provider: LLMProvider;
    model: string;
    rawText: string;
    usage: { inputTokens: number; outputTokens: number };
  }>;
}

13.3 Stub provider

A stub provider shall be supported for:

  • public skeletons
  • offline demos
  • deterministic test harnesses
  • safe CI demonstrations

13.4 Provider principle

The provider is replaceable. Core safety posture may not depend on proprietary provider behavior.


14. Schema Validation Boundary

14.1 Boundary definition

The schema boundary is the point where raw model text may become acceptable structured input.

14.2 Required behavior

The system must:

  1. parse raw text as JSON
  2. validate the resulting object as a Report
  3. reject malformed or invalid output

14.3 Structural validity vs correctness

Schema validity only means structure is acceptable. It does not certify truth, completeness, or sound reasoning.

14.4 Failure mode

Invalid JSON or invalid report structure must terminate the request as failure.


15. Audit Event Requirements

15.1 Event classes

At minimum, audit must support:

  • llm_request
  • llm_response
  • llm_error

15.2 Minimum request event fields

  • kind
  • requestId
  • timestampMs
  • provider
  • model
  • promptHash
  • contextHash

15.3 Minimum response event fields

  • all request event fields
  • outputHash
  • usage

15.4 Minimum error event fields

  • all request event fields where available
  • error name
  • error message

15.5 Structured emission

Audit events must be machine-ingestible, preferably JSON-structured.


16. Agent Implementation Requirements

16.1 Agent contract

Each agent must:

  • create an LLMRequest
  • bind to a versioned template
  • supply a concrete model spec
  • pass typed context
  • return Report

16.2 Required current agent classes

  • SpecLint
  • PRSynthesis
  • ThreatSketch

16.3 ThreatSketch special constraint

ThreatSketch must remain conceptual. It may classify risks and mitigations, but may not output exploitation steps.

16.4 Agent determinism rule

Agents may vary in prompt content and task definition, but not in core safety boundary behavior.


17. Runner Requirements

17.1 Local runner

The local runner must support demonstration execution using fixed example context and render advisory markdown.

17.2 GitHub runner

The GitHub runner must model or implement:

  • webhook signature verification
  • PR metadata extraction
  • installation token acquisition or workflow-token use
  • changed-file retrieval
  • path eligibility filtering
  • pipeline execution
  • advisory PR comment rendering

17.3 GitHub safety requirement

The GitHub path must default to read-only review surfaces such as comments or checks. It must not imply merge authority.


18. GitHub App / Webhook Model

18.1 Signature verification

Webhook-driven operation requires deterministic verification of the GitHub signature before processing payload content.

18.2 Installation token minting

If operating as a GitHub App, installation tokens must be minted per installation and scoped minimally.

18.3 Changed-file fetching

Only PR files relevant to the advisory pipeline may be fetched.

18.4 Policy application

Fetched files must be filtered by policy prior to downstream use.

18.5 Comment rendering

Rendered comments should state clearly that the result is advisory and schema-gated, not authoritative.


19. Pseudocode

19.1 Core request pipeline

function INVOKE(req, policy, cache, audit, provider):
    redactedReq = APPLY_REDACTION(req)

    if policy.promptInjectionGuard == true:
        ASSERT_NO_PROMPT_INJECTION(MATERIAL_FOR_GUARD(redactedReq))

    prompt = BUILD_PROMPT(redactedReq)

    promptHash  = HASH(prompt)
    contextHash = HASH_CONTEXT(redactedReq.context)
    cacheKey    = BUILD_CACHE_KEY(policy, redactedReq.model, promptHash, contextHash)

    if cache exists:
        hit = cache.get(cacheKey)
        if hit exists:
            return hit

    ENFORCE_BUDGET(policy.budget)

    audit.record(REQUEST_EVENT(...))

    try:
        raw = provider.call(redactedReq, prompt)
        parsed = PARSE_AND_VALIDATE(raw.rawText, redactedReq.outputSchema)

        UPDATE_RUNTIME_COUNTERS(raw.usage)

        outputHash = HASH(JSON.stringify(parsed))

        response = BUILD_RESPONSE(parsed, raw, promptHash, contextHash, outputHash)

        audit.record(RESPONSE_EVENT(...))

        if cache exists:
            cache.set(cacheKey, response, ttlSeconds)

        return response

    catch err:
        audit.record(ERROR_EVENT(...))
        raise err

19.2 Path eligibility

function IS_ALLOWED_PATH(path, allowPaths, denyPaths):
    for d in denyPaths:
        if path startsWith d:
            return false

    for a in allowPaths:
        if path startsWith a:
            return true

    return false

19.3 Agent runner pattern

function RUN_AGENT(agentTemplate, ctx):
    req = {
        requestId: BUILD_REQUEST_ID(agentTemplate, ctx),
        system: agentTemplate.system,
        task: agentTemplate.task,
        constraints: agentTemplate.constraints,
        outputSchema: agentTemplate.outputSchema,
        model: SELECT_MODEL(agentTemplate),
        context: {
            diffSummary: ctx.diffSummary,
            files: ctx.changedFiles
        }
    }

    res = LLM_CLIENT.invoke(req)
    return res.parsed

20. Evaluation and Metrics

20.1 Primary evaluation principle

The system must be evaluated by engineering outcomes, not token volume.

20.2 Suggested metrics

  • reduction in human review time
  • number of ambiguities caught before merge
  • contradiction detection rate
  • false positive rate
  • structured output acceptance rate
  • cache hit rate
  • provider failure rate
  • rejected unsafe-context rate
  • budget-overrun frequency
  • audit completeness rate

20.3 Quality lens

A system that spends fewer tokens but leaks secrets or produces unactionable noise is not successful.


21. Security Requirements

21.1 Secrets

Secrets must never be intentionally included in provider-bound prompt context.

21.2 PII

PII-bearing material must be excluded or redacted according to policy.

21.3 Write access

Write-capable automation must remain disabled unless explicitly approved and separately reviewed.

21.4 Supply chain

Dependencies used in CI or webhook execution should be minimal, pinned where appropriate, and reviewable.

21.5 Output treatment

Even validated output must remain advisory unless a separate deterministic control layer explicitly promotes a subset of behavior.


22. Failure Modes and Required Handling

22.1 Prompt injection guard triggered

Result: reject request, record error audit event.

22.2 Path not allowed

Result: exclude file or reject run depending on runner policy.

22.3 Redaction alters material significantly

Result: continue if structure remains usable; otherwise surface limited-result state.

22.4 Cache unavailable

Result: continue without cache if safety posture is preserved.

22.5 Budget exceeded

Result: reject before provider invocation.

22.6 Provider failure

Result: record error audit event and surface failure.

22.7 Invalid JSON

Result: reject response.

22.8 Schema mismatch

Result: reject response.

22.9 Audit sink failure

Preferred result: surface operational error; do not silently claim successful audit if audit failed.


23. Test Requirements

23.1 Unit tests

Minimum expected unit coverage should include:

  • path policy evaluation
  • redaction substitution
  • prompt injection heuristics
  • prompt assembly
  • schema validation success/failure
  • budget enforcement
  • cache hit/miss behavior
  • audit event formatting

23.2 Integration tests

Minimum expected integration coverage should include:

  • local runner end-to-end with stub provider
  • GitHub runner path filtering
  • advisory comment rendering
  • invalid response rejection path

23.3 Security-oriented tests

Minimum adversarial test cases should include:

  • injected override strings in diffs
  • secret-like material in changed files
  • denylisted paths in PR file lists
  • malformed JSON responses
  • structurally valid but empty reports

24. CI/CD Expectations

24.1 CI role

CI is used to verify deterministic correctness around the DevKit itself, not to treat model output as a release authority.

24.2 CI checks

Expected checks include:

  • formatting
  • linting
  • type checking
  • unit tests
  • integration tests where safe
  • workflow syntax validation

24.3 Public skeleton safety

In public or demonstration contexts, provider calls should remain stubbed unless explicitly configured otherwise.


25. Acceptance Criteria

Implementation satisfies this spec when all of the following are true:

  • typed requests can be constructed and executed
  • policy-based path filtering works as specified
  • redaction executes before provider call
  • prompt injection screening can reject suspicious content
  • prompt envelopes are assembled in labeled sections
  • prompt and context hashes are generated deterministically
  • cache hits bypass provider calls
  • budget enforcement blocks overrun conditions
  • provider adapters can be swapped without changing core logic
  • invalid JSON responses are rejected
  • invalid report structures are rejected
  • audit events are emitted for request/response/error paths
  • agents return structured reports
  • local runner can produce advisory markdown
  • GitHub runner can model or execute advisory PR workflow safely
  • no code path grants implicit merge or deploy authority to AI output

26. Implementation Notes

26.1 Public skeleton vs production implementation

The current repository may use lightweight validators, in-memory cache, and stub provider surfaces. That is acceptable for the public skeleton. Production-hardening may replace those internals without changing the architectural contract defined here.

26.2 Behavioral invariants that must not drift

The following invariants are mandatory:

  • AI output remains advisory
  • deterministic validation remains authoritative
  • provider access happens only after safety preflight
  • schema failure rejects output
  • budget is bounded
  • path policy is enforced
  • audit remains structured
  • read-only is the default integration posture

27. Summary

This engineering specification defines an AI-assisted engineering framework that is useful precisely because it is constrained.

The system is not valuable when it is permissive. It is valuable when it is:

  • structured
  • bounded
  • reviewable
  • cheap enough to operate
  • difficult to misuse
  • explicit about authority

That is the implementation contract for the Thesis Chain AI DevKit.

CI/CD Integration

CI/CD is not just a deployment mechanism. In systems like this, CI/CD is part of the control surface. It enforces the difference between “interesting idea” and “repeatable engineering behavior.”

For an AI-assisted workflow, CI/CD must enforce at least four things:

In practice, this means the pipeline treats AI as a bounded advisory subsystem. It can inspect PR diffs, produce structured comments, and surface contradictions or risk, but it does not silently mutate production state.

The important point is architectural: CI/CD is where enforcement lives. If the rules are not enforced in the pipeline, then they are preferences, not controls.

Agentic Development Pipeline

This is the part most people misunderstand. Agentic development does not mean “use the most powerful model on everything.” It means divide work into classes, apply deterministic gates, route tasks to the cheapest sufficient capability, inspect aggressively, and preserve human authority over consequential decisions.

This is why I do not treat model choice as a status symbol. I treat it as routing policy. Different work deserves different tools. Better systems come from disciplined orchestration, not maximal model spend.

Human Inspection Roles

Human inspection remains central in any serious AI-assisted engineering system. The goal is not to remove humans from the loop. The goal is to remove low-value repetitive work while preserving human judgment where ambiguity, business context, risk, or architecture matter.

In other words: AI can accelerate analysis, summarization, contradiction discovery, and report generation. It should not silently inherit decision authority just because it is fast.

Security Architecture

Security is not a final checklist item. In AI-assisted systems it must be designed into every upstream layer: input handling, context assembly, provider boundaries, output validation, CI permissions, and operational review.

The shortest honest summary is this: safe agent systems are built by distrusting them correctly.

Case 02 — Human Agentic Pipeline

This case study documents the operating model behind a human-led agentic development pipeline. The objective is not to simulate autonomous magic. The objective is to design a system in which AI can accelerate engineering work without dissolving accountability, architectural control, or verification discipline.

In this model, AI is routed into bounded roles inside a controlled workflow. Humans retain authority over judgment, quality control, infrastructure, and final acceptance. The system is designed to produce auditable artifacts, visible checkpoints, deterministic handoff boundaries, and repeatable outputs rather than vague conversational momentum.

Problem Definition

Most “agentic” workflows fail for one of two reasons. Either they are too loose and devolve into expensive improvisation, or they are so tool-driven that no one can explain where authority lives, why a change happened, or whether the output still matches the original specification.

The engineering problem addressed here is therefore:

How do you structure a human-led, AI-assisted development system that can produce meaningful velocity while preserving deterministic phase order, verification gates, explicit authority boundaries, and drift resistance?

The answer is not “more autonomy.” The answer is architecture. Agentic systems only become useful when their behavior is constrained more like a build pipeline and less like a free-form assistant.

Operating Constraints

Blueprint Architecture

The blueprint for a human agentic pipeline starts by defining role boundaries and execution order before discussing implementation. In a healthy agentic system, “who may decide what” is as important as “what code gets written.”

The structure I use is phase-driven and role-separated. The architect locks anchors and non-negotiables first. Tooling may only express what the architecture already allows. Implementation is scope-constrained to the approved tree. Verification must halt the system on drift rather than negotiate with it.

This structure matters because it prevents the most common failure mode in AI-heavy development: implementation racing ahead of architecture and forcing the system to rationalize drift after the fact.

Canonical Blueprint Markdown

The following appendix is mirrored locally from the orchestration lab blueprint and displayed here as canonical markdown.

ExNulla Blueprint

Human Agentic Orchestration Lab (Standalone Showpiece)

Repository (proposed): exnulla-orchestration-lab
Slug: orchestration-lab
Version: 1.1.0 (supersedes human-agentic-trainer v1.0.0)
Owner org: Thesis-Project (professional)
Primary goal: Portfolio-grade, standalone orchestration lab that can optionally embed as a demo via iframe (static-first).


0. Positioning

This project is a standalone orchestration lab that teaches and demonstrates agentic pipeline mechanics with:

  • Human transport (copy/paste between ChatGPT Projects) as the default execution provider.
  • Deterministic state machine and artifact ledger as the core product.
  • A clean upgrade path to API-based providers without rewriting orchestration logic.

It is intentionally “too serious” to be a toy demo.


1. Objectives

1.1 Core educational objectives

Teach (visibly, not abstractly):

  • Role separation and instruction boundaries
  • Prompt routing and supervisor logic
  • Context drift origins, detection, and recovery
  • Critic/revision loops and acceptance criteria closure
  • Budget discipline, token economy, and trade-offs

1.2 Core product objectives

Provide a reproducible lab environment:

  • Deterministic run capture + replay
  • Run artifact inspection (graph + diffs + drift flags)
  • Failure-mode injection and recovery demonstration
  • Formal role contract enforcement (schema validated outputs)
  • Cost and budget dashboards (simulated + estimated)

1.3 Optional objective (Phase 2)

Provider adapters for API orchestration (OpenAI/Anthropic/etc.) that reuse the same run state machine.


2. Constraints and non-goals

2.1 Constraints

  • Static-first deployment: default build outputs a static web app.
  • Atomic deploy friendly: build artifact can be deployed with symlink flips.
  • Iframe-safe: must function correctly when embedded in an iframe sandbox.
  • No scraping / no UI automation: human transport remains manual by design.

2.2 Non-goals (v1.1)

  • No live ChatGPT UI integration.
  • No storing personal secrets or API keys in the browser (Phase 2 moves to server runtime).
  • No “magic” agent framework wrapper that hides orchestration mechanics.

3. Target users

  • Learners: understand orchestration by running guided pipelines.
  • Hiring reviewers: see a polished, deterministic systems artifact with auditability.
  • Future-you: use specs + blueprint to build an API agent framework later without drift.

4. High-level architecture

4.1 Components

  1. LOC (Local Orchestration Console)

    • Runs locally (dev) and/or as a static app (prod) with persistence in browser storage and export/import.
    • Generates role prompts, enforces contracts, logs turns, computes budgets, flags drift, scores rubrics.
  2. Run Ledger + Artifact Store

    • Run JSON artifacts are canonical.
    • Export is deterministic: same inputs → same run structure (timestamps excluded or normalized).
  3. Inspector UI (Showpiece layer)

    • Graph view (turn DAG)
    • Drift panels
    • Budget/cost panels
    • Failure injection controls
    • Replay timeline controls
  4. Provider Adapter Layer (Transport abstraction)

    • HumanProvider (v1.1): manual paste-in/out
    • SimulatedProvider (v1.1): fake latency/cost/reliability without APIs
    • API Providers (v2+): optional later

4.2 “Square peg / round hole” mitigation

This repo is designed as standalone. If embedded into exnulla-demos, it is treated as a static build artifact embedded via iframe with a constrained integration contract (Section 13).


5. Deterministic state model

5.1 Canonical run artifact

runs/<RUN_ID>/run.json

Minimum fields:

  • schemaVersion (semver-like)
  • gitSha (injected at build time)
  • runId
  • createdAt (optional; normalized for deterministic replay exports)
  • scenarioId (the selected training scenario)
  • roles[] (role profiles and constraints)
  • turns[] (ordered, each with routing metadata and validation results)
  • artifacts[] (files/snippets produced by turns)
  • budgets (per-turn + cumulative)
  • rubric (scoring + thresholds)
  • drift (flags + evidence + severity)
  • acceptance (pass/fail + reasons)

5.2 Deterministic replay guarantee

Given:

  • Same scenarioId
  • Same initial inputs
  • Same turn responses (copied)
  • Same schemaVersion

Then:

  • The run artifact validation and derived metrics must match.

6. Role system

6.1 Default roles

  • architect
  • developer
  • critic
  • tester
  • (optional) supervisor (internal; LOC-driven orchestration)

6.2 Required ChatGPT Project setup (Human Provider)

Each role is configured as its own ChatGPT Project with persistent instructions.

The LOC provides:

  • Copy-paste “Project Instructions” templates per role.
  • A “Project Setup Checklist” with validation steps.

6.3 Formal role contract enforcement (new)

Each role response must conform to a strict schema (e.g., JSON or structured markdown blocks).

LOC validates:

  • Schema validity
  • Required fields present
  • Artifact references resolvable
  • No forbidden sections (role boundary rules)

If invalid:

  • LOC flags a contract violation.
  • LOC generates a corrective “format repair” prompt for the same role.

7. Drift detection and recovery

7.1 Drift signals (v1.1)

Rule-based detection, including:

  • Missing constraints or acceptance criteria
  • Contradictions vs. scenario requirements
  • Output schema violations
  • Spec deviations (e.g., wrong repo, wrong language, ignored deterministic rules)
  • Over-budget warnings and verbose inflation
  • “Unresolved questions” not propagated

7.2 Drift scoring

Each signal adds weighted severity:

  • info / warn / error
  • Cumulative drift score shown in Inspector UI

7.3 Recovery loops

LOC generates recovery prompts:

  • “Re-anchor constraints” prompt for Architect
  • “Patch minimal diff” prompt for Developer
  • “Re-evaluate rubric” prompt for Critic
  • “Regression / edge-case sweep” prompt for Tester

8. Failure mode injection (new showpiece capability)

8.1 Purpose

Turn the lab into a resilience demonstrator:

  • show failures
  • show detection
  • show recovery
  • show cost impact

8.2 Injection modes (v1.1)

  • Ambiguous spec: remove/blur key constraints
  • Conflicting constraints: intentionally contradict requirements
  • Truncated context: simulate missing prior turns
  • Bad critic: introduce incorrect critique or wrong rubric thresholds
  • Budget crunch: set very low budget caps mid-run

8.3 Implementation concept

Injection modifies:

  • scenario inputs
  • routing prompts
  • role templates
  • budget parameters

LOC must record injection events in run artifact (injections[]).


9. Budget and economics (expanded)

9.1 Token estimation

  • Estimate tokens from characters (baseline) and/or model-specific heuristics.
  • Record per-turn estimate and cumulative.

9.2 Cost simulation

For v1.1 (no real API calls):

  • user selects “pricing profile” presets (cheap / mid / premium)
  • LOC computes simulated cost per turn and total
  • show “what this would cost” with model tiers

9.3 Dashboard outputs

  • burn-down chart over time
  • per-role share of tokens/cost
  • budget threshold warnings
  • cost of drift (extra turns caused by drift recovery)

10. Visual Inspector UI (new, high impact)

10.1 Views

  1. Run Timeline
    • turn list with role, timestamp, budget, validation, drift severity
  2. Turn Graph (DAG)
    • nodes: turns
    • edges: handoffs / dependencies
    • highlights: drift, contract violations
  3. Diff View
    • compare two turns (or two runs) for changes in constraints, artifacts, budgets
  4. Rubric Panel
    • category scores and thresholds
    • reasons for pass/fail
  5. Injection Panel
    • list and details of injected failures

10.2 UX principles

  • No hidden magic. Every derived conclusion links to evidence.
  • Export/import first-class.
  • Works in iframe (no popups, no cross-origin dependencies).

11. Multi-model simulation layer (optional in v1.1)

11.1 Why

Prepare learners for API orchestration by teaching tradeoffs:

  • latency
  • cost
  • reliability
  • verbosity

11.2 How (without APIs)

Simulated Provider:

  • assigns “model personality presets” to roles
  • applies constraints (e.g., “fast model tends to be terse and miss edge cases”)
  • introduces optional random error rates (seeded for determinism)

All simulation parameters must be recorded in the run artifact.


12. Tech stack and repo shape (static-first)

12.1 Proposed stack

  • TypeScript (strict)
  • Vite (static build)
  • React (or Astro + React islands; choose one)
  • Zod (schema validation)
  • Vitest (tests)
  • ESLint + Prettier (enforced)
  • Docker for deterministic builds

12.2 Repo layout (proposed)

exnulla-orchestration-lab/
  apps/
    loc-web/                 # static web app
  packages/
    core/                    # state machine, schemas, scoring, drift
    scenarios/               # scenario definitions + injection templates
    ui/                      # inspector components
    cli/                     # optional CLI runner/export tools (v1.2+)
  runs/                      # sample runs (optional; or in /examples)
  docs/
    blueprint/               # this blueprint
    engineering-spec/        # detailed spec (separate doc)
    role-instructions/       # ChatGPT Project templates per role
  .github/workflows/
  Dockerfile
  package.json
  pnpm-workspace.yaml

12.3 Deterministic build requirements

  • Inject GIT_SHA at build time (ARG + ENV)
  • Include meta/version.json with git SHA and build timestamp (timestamp optional/normalized)
  • Lockfile required (pnpm)
  • CI must block merges if lint/test fail

13. Deployment and iframe embedding

13.1 Default deployment (standalone)

  • Static build served by nginx or any static host
  • Atomic deploy by swapping symlinked build directory

13.2 Iframe embedding (optional)

If embedded in exnulla-site or exnulla-demos:

  • build outputs to a single folder root with relative assets
  • no service-worker assumptions that conflict with host
  • storage uses namespaced keys:
    • exnulla.orchestrationLab.<runId> etc.
  • export/import uses file download/upload, not cross-window messaging

13.3 Integration contract (minimal)

  • Provide a single embed URL (e.g., /demos/orchestration-lab/index.html)
  • Provide a postMessage-optional integration later (v2+) but not required

14. Milestones

v1.1.0 (Showpiece baseline)

  • Core state machine + run artifact schema
  • HumanProvider workflow
  • Role contract enforcement + repair prompts
  • Drift detection v1 (rules)
  • Budget + cost dashboards (simulated)
  • Inspector UI with DAG + timeline + rubric
  • Failure injection panel + recorded injection events
  • Export/import runs (JSON) + deterministic replay validation
  • Docker + CI hygiene (lint/test/build)

v1.2.x

  • Scenario library expansion (3–6 scenarios)
  • CLI utilities for run validation and report generation
  • Run comparison tool (diff two runs)

v2.x

  • API provider adapters (optional)
  • Tool execution hooks (optional)
  • Multi-tenant “course mode” (optional)

15. Acceptance criteria

A v1.1 release is “done” when:

  1. A learner can complete a guided run end-to-end using only copy/paste.
  2. LOC validates role outputs against the schema and produces repair prompts.
  3. Drift flags trigger reliably on injected failures.
  4. Inspector clearly explains why drift was flagged (evidence linked).
  5. Exported run artifact can be imported and replay-validated deterministically.
  6. Static build deploys cleanly and works in an iframe.
  7. CI enforces strict TypeScript, linting, formatting, and tests.
  8. meta/version.json exposes build SHA.

16. Notes on scope control

This is a showpiece, but it stays manageable by enforcing:

  • Deterministic core first
  • UI second (inspector)
  • Scenario count limited in v1.1
  • Simulation kept optional and seeded (no randomness without seed)

17. Deliverables (docs)

This blueprint implies the following docs in-repo:

  • docs/blueprint/exnulla-blueprint-orchestration-lab-1-1-0.md (this file)
  • docs/engineering-spec/exnulla-engineering-spec-orchestration-lab-1-1-0.md (next step)
  • docs/role-instructions/*.md (ChatGPT Project templates)
  • docs/runbook/DEPLOY.md (atomic static deploy)
  • docs/runbook/IFRAME.md (embedding contract)

18. Repo naming rationale

Recommended: exnulla-orchestration-lab
Signals “serious systems lab” rather than “toy demo,” while staying on-brand.

Alternate options:

  • exnulla-agentic-lab
  • exnulla-orchestrator-lab
  • exnulla-human-to-api-orchestration

Engineering Specifications

The engineering spec for this operating model does not merely describe features. It defines behavioral law for the build process itself. That includes output format, file authority, acceptance gates, CI discipline, and what kinds of changes are explicitly forbidden.

In practical terms, the spec must answer these questions:

1. Output Discipline

Full-file emission matters because it prevents hidden partial edits, accidental omissions, and conversational patch ambiguity. The system should produce complete artifacts, not vague change suggestions.

2. Structure Discipline

New files may only exist if they are explicitly defined in the spec or derived in the architecture plan with anchor mapping. Unanchored structure is drift.

3. Verification Discipline

Verification is not a final glance at output quality. It is a formal gate with required proof: drift check empty, anchor coverage present, assumptions empty.

4. CI Discipline

The process assumes lint, typecheck, and build are mandatory. The agentic workflow is not complete because it “looks right.” It is complete when the repo gates are green.

5. Idempotency Discipline

Every pass should be reproducible from scratch. The pipeline should not rely on hidden chat context, implicit globals, or fragile one-off edits that cannot be replayed.

6. No-Hidden-Globals Rule

Environment requirements, allowed inputs, and tool expectations must be explicit. Invisible ambient state is a major source of drift and operational failure.

Canonical Engineering Spec Markdown

The following appendix is mirrored locally from the orchestration lab engineering spec and displayed here as canonical markdown.

ExNulla Engineering Spec

Human Agentic Orchestration Lab (Standalone Showpiece)

Repository: exnulla-orchestration-lab
Slug: orchestration-lab
Spec Version: 1.1.0
Blueprint: exnulla-blueprint-orchestration-lab-1-1-0.md
Owner org: Thesis-Project
Primary mode: Static-first web app (iframe-safe)
Provider mode (v1.1): Human transport + simulated provider (no APIs)
Last Updated (UTC): 2026-02-27T00:00:00Z


0. Scope and determinism contract

0.1 What this spec is

An implementation-grade engineering spec for a standalone orchestration lab that:

  • makes orchestration mechanics visible (role separation, routing, drift, budgets),
  • captures every run as a deterministic run artifact ledger (run.json),
  • provides an inspector UI (timeline, DAG, diffs, rubric, injections),
  • supports export/import + deterministic replay validation,
  • works in an iframe sandbox and deploys as an atomic static artifact.

This spec is written so it can be handed back later with: “build it” and executed with minimal drift.

0.2 Hard constraints (MUST)

  1. Static-first: pnpm build outputs a static bundle that can be hosted by nginx / static host.
  2. Iframe-safe: no popups, no cross-origin assumptions, no top-level navigation hacks.
  3. No UI automation/scraping: human transport is manual by design.
  4. Deterministic core: orchestration/state evaluation must be deterministic given the same inputs + responses.
  5. Export/import first-class: runs are portable JSON artifacts; UI can import/export.
  6. No secrets: browser build stores no API keys; v1.1 has no real provider calls.
  7. Repo hygiene: TypeScript strict, ESLint + Prettier, tests, Docker deterministic build.

0.3 Non-goals (v1.1)

  • Live integration with ChatGPT UI.
  • Multi-user authentication / cloud persistence.
  • Real API providers (OpenAI/Anthropic/etc.) beyond interface stubs.
  • ML-based drift classification (rule-based + evidence only).

0.4 Deterministic replay guarantee (MUST)

Given:

  • identical scenarioId,
  • identical scenario inputs,
  • identical injection set (including seed),
  • identical agent responses pasted into the ledger,
  • identical schemaVersion, then:
  • validation results, drift flags, rubric scores, budget totals, and derived digests MUST match.

Allowed non-determinism:

  • wall-clock timestamps can exist but MUST be excluded from deterministic checks (or normalized under export).

1. Product definition

1.1 Core workflows

  1. Create run
    • user selects scenario, provider mode, seed, budget/cost profile, and optional injections.
  2. Generate routed prompt
    • LOC produces a prompt for a role and explicit routing instructions.
  3. Human transport
    • user executes prompt in the role’s ChatGPT Project and pastes the response into the LOC.
  4. Validate + score
    • LOC validates schema/format, computes budgets/cost, flags drift, updates rubric, derives next step.
  5. Inspect
    • user inspects timeline, graph, diffs, drift evidence, rubric reasoning, injection events.
  6. Export / Import
    • export run as JSON (and optional markdown transcript); import later and replay-validate deterministically.
  7. Compare
    • compare runs (or turns) via diff UI (v1.1: within one run; v1.2: cross-run).

1.2 Target user profiles

  • Learner / developer wanting “pre-calc → calc” understanding of orchestration.
  • Hiring reviewers assessing systems thinking + determinism discipline.
  • Future-you using the ledger/state machine for API orchestration later.

2. Architecture overview

2.1 Packages (MUST)

  • packages/core
    Deterministic state machine, schemas, scoring, drift, budgets, providers, export/import, deterministic hashing.
  • packages/scenarios
    Scenario definitions, injection templates, seeded simulation knobs, scenario validation.
  • packages/ui
    Shared UI components (graph, diff, panels), pure/presentational where possible.
  • apps/loc-web
    Vite + React static web app: run wizard, prompt router, paste console, inspector.

2.2 Runtime boundaries

  • All deterministic logic lives in packages/core and must be usable:
    • from the web app, and
    • from future CLI tooling (v1.2+).
  • The web app is a thin shell around the core.

2.3 Transport / provider abstraction

  • HumanProvider (v1.1): manual paste. Produces routing instructions only.
  • SimulatedProvider (v1.1): produces deterministic “simulated outputs” for demonstration/testing, seeded.
  • ApiProvider (v2+): stub interface only in v1.1 (no keys, no calls).

3. Tech stack and repo standards

3.1 Required stack (MUST)

  • Node.js LTS (recommend 20.x)
  • TypeScript strict: true
  • pnpm + lockfile
  • Vite + React (single-page app)
  • Zod for runtime validation
  • Vitest for unit/integration tests
  • ESLint + Prettier enforced
  • Docker for deterministic builds

3.2 Deterministic build provenance (MUST)

  • Build accepts ARG GIT_SHA and injects to app:
    • import.meta.env.VITE_GIT_SHA (Vite) and/or process.env.GIT_SHA (tests/build scripts)
  • Build outputs meta/version.json containing:
    • gitSha,
    • schemaVersion,
    • buildId (optional; may be derived deterministically from gitSha + package versions),
    • builtAt (optional; if present must be excluded from determinism checks).

4. Repository layout

4.1 Canonical layout (MUST)

exnulla-orchestration-lab/
  apps/
    loc-web/
      index.html
      vite.config.ts
      src/
        app/
          routes/
          state/
          components/
        main.tsx
      public/
        meta/
          version.json
  packages/
    core/
      src/
        schema/
        engine/
        providers/
        scoring/
        drift/
        budget/
        export/
        util/
      tests/
    scenarios/
      src/
        scenarios/
        injections/
        pricing/
      tests/
    ui/
      src/
        graph/
        diff/
        panels/
        widgets/
  docs/
    blueprint/
    engineering-spec/
    role-instructions/
    runbooks/
  examples/
    runs/
    scenarios/
  .github/
    workflows/
  Dockerfile
  docker-compose.yml (optional)
  package.json
  pnpm-workspace.yaml
  pnpm-lock.yaml
  tsconfig.base.json
  eslint.config.js
  prettier.config.cjs

4.2 Git ignore rules

  • Ignore persisted runs by default:
    • apps/loc-web/.local/ (dev-only)
    • **/runs/** except examples/runs/**
  • Include:
    • at least one sample run artifact in examples/runs/ for regression tests and UI demo.

5. Data model: canonical run ledger

5.1 Canonical artifact path semantics

The canonical artifact is a single JSON object:

  • Web app storage: stored in browser (IndexedDB preferred; localStorage acceptable for v1.1 with size limits)
  • Exported artifact: user downloads a file named:
    • orchestration-lab.run.<runId>.json

When building a “runs folder” later (CLI), the canonical structure will be:

  • runs/<runId>/run.json (not required for static build)

5.2 Schema versioning

  • schemaVersion is a semver-like string, pinned to spec version for v1.1:
    • "1.1.0"
  • Backward compatibility requirements:
    • v1.1 UI must import artifacts with schemaVersion "1.1.0".
    • Future versions must provide migration utilities (v1.2+).

5.3 RunArtifact schema (MUST)

5.3.1 Top-level

export type RunArtifact = {
  schemaVersion: '1.1.0';
  slug: 'orchestration-lab';
  gitSha: string; // injected at build; "unknown" allowed
  runId: string; // deterministic id format
  createdAt?: string; // ISO; optional for determinism checks
  updatedAt?: string; // ISO; optional for determinism checks

  mode: {
    provider: 'human' | 'simulated'; // v1.1
    simulation?: SimulationConfig; // if simulated
  };

  scenario: {
    scenarioId: string;
    version: string; // scenario version string, e.g. "1.0.0"
    inputs: Record<string, unknown>;
  };

  injections: InjectionEvent[]; // applied injections, deterministic order
  roles: RoleProfile[]; // role contracts + instructions metadata

  turns: Turn[]; // append-only
  derived: DerivedState; // regenerated deterministically

  budgets: BudgetLedger; // token estimates, warnings
  economics: EconomicsLedger; // simulated cost and profiles

  rubric: RubricLedger; // scoring + thresholds + evidence
  drift: DriftLedger; // flags + evidence + severity summary

  acceptance: {
    passed: boolean;
    reasons: string[];
    checklist: { item: string; status: 'pass' | 'fail' | 'unknown'; evidence?: string[] }[];
  };
};

5.3.2 RoleProfile

export type RoleName = 'architect' | 'developer' | 'critic' | 'tester';

export type RoleProfile = {
  role: RoleName;
  displayName: string;
  chatgptProjectName: string; // user-configurable label
  instructionTemplateId: string; // e.g. "role-architect-1.1.0"
  contract: RoleContract;
};

export type RoleContract = {
  responseFormat: 'structured_markdown_v1' | 'json_v1';
  requiredHeaders: string[]; // exact heading strings
  requiredSections: string[]; // section ids
  forbiddenPatterns: string[]; // regex strings
  maxCodeBlockChars?: number; // heuristic for role confusion
  mustEchoRunTurnHeader: boolean; // require runId/turnId header block
};

5.3.3 Turn

export type Turn = {
  turnId: number; // 1..n
  role: RoleName;

  prompt: {
    templateId: string; // prompt template key
    text: string;
    charCount: number;
    tokenEstimate: number;
    stateDigestHash: string; // hash of digest included in prompt
  };

  response: {
    text: string;
    charCount: number;
    tokenEstimate: number;
    parsed?: ParsedResponse; // result of parsing per contract
    contractValid: boolean;
    contractErrors: string[];
  };

  analysis: {
    driftFlags: DriftFlag[];
    rubricScore: RubricScore;
    notes: string[]; // deterministic, engine-generated notes only
  };

  timestamps?: { promptedAt: string; respondedAt: string }; // optional
};

5.3.4 DerivedState (regenerated)

export type DerivedState = {
  digest: StateDigest; // compact state summary
  digestHash: string; // stable hash of digest
  openIssues: Issue[];
  artifactsIndex: ArtifactRef[];
  loopCountByStage: Record<string, number>;
  completion: { done: boolean; nextRole: RoleName | null; stage: Stage };
};

5.3.5 Digest / issues / artifacts

export type Stage = 'kickoff' | 'implementation' | 'review' | 'test' | 'revise' | 'finalize';

export type StateDigest = {
  scenarioSummary: string; // scenario-provided summary, bounded
  constraints: string[]; // scenario constraints, stable order
  acceptanceCriteria: string[]; // stable order
  deliverables: string[]; // stable order
  lastDecisions: string[]; // last 3 decisions (deterministic extraction)
  openQuestions: string[]; // extracted from critic/tester
  artifactHints: string[]; // from dev outputs / plan sections
};

export type Issue = {
  id: string; // stable hash id
  severity: 'info' | 'warn' | 'error';
  source: 'critic' | 'tester' | 'engine';
  message: string;
  evidence: string[];
  open: boolean;
};

export type ArtifactRef = {
  id: string; // stable hash id
  kind: 'snippet' | 'filetree' | 'patch' | 'plan' | 'testplan';
  title: string;
  producedByTurnId: number;
  contentHash: string;
  excerpt: string; // bounded excerpt for UI
};

5.4 Deterministic hashing (MUST)

  • Use a stable hash for digests, issues, artifacts:
    • sha256(canonicalJsonString(value))
  • Canonical JSON stringification:
    • stable key ordering,
    • no whitespace variability,
    • arrays kept in order.

6. Scenario system

6.1 Scenario definition format (MUST)

Scenarios are authored as TypeScript objects in packages/scenarios and exported as a registry.

export type Scenario = {
  scenarioId: string; // e.g. "hello-orchestration"
  version: string; // semver string
  title: string;
  summary: string; // bounded summary
  description: string;

  constraints: string[]; // stable order
  acceptanceCriteria: string[]; // stable order
  deliverables: string[]; // stable order

  roleTemplates: {
    architect: PromptTemplateId;
    developer: PromptTemplateId;
    critic: PromptTemplateId;
    tester: PromptTemplateId;
  };

  initialInputsSchema: z.ZodTypeAny; // validates scenario inputs
  defaultInputs: Record<string, unknown>;

  rubricProfileId: string; // ties to rubric weights
};

6.2 Required scenarios (v1.1)

Ship 3 scenarios minimum (MUST), each designed to show different drift/failure types:

  1. hello-orchestration
    Simple deterministic task, emphasizes contracts + budgets.
  2. drift-trap-spec
    Ambiguous requirements; emphasizes clarification propagation and re-anchoring.
  3. regression-loop
    Forces test failures and revise loops; emphasizes loop caps and cost-of-drift.

6.3 Scenario determinism rules

  • Scenario registry ordering must be stable (sort by scenarioId).
  • Scenario inputs are validated and stored verbatim in run artifact.
  • Any scenario-generated derived values must be stored or recomputable deterministically.

7. Role system and ChatGPT Project setup

7.1 Role instruction templates (MUST)

Ship templates in docs/role-instructions/:

  • architect.md
  • developer.md
  • critic.md
  • tester.md

Each template MUST contain:

  • Mission
  • Allowed outputs
  • Forbidden actions
  • Required response format contract
  • Determinism rules (“no hallucinated filenames; state assumptions explicitly”)
  • Interaction protocol for missing info (“ask targeted questions; do not proceed with guesses”)

7.2 Contract format: structured_markdown_v1 (default)

All role responses MUST begin with an exact header block:

# Role: <Architect|Developer|Critic|Tester>
# Run: <runId>
# Turn: <turnId>

Then role-specific sections with fixed headings (examples below). LOC must validate these headings (case-sensitive) as the contract baseline.

Architect required headings

  • ## Constraints (Do Not Violate)
  • ## Acceptance Criteria (Checklist)
  • ## System Plan
  • ## Open Questions
  • ## Next Handoff

Developer required headings

  • ## Implementation Plan
  • ## Proposed File Tree
  • ## Patch / Diff
  • ## Notes for Critic
  • ## Next Handoff

Critic required headings

  • ## Contract Validation
  • ## Drift Signals
  • ## Rubric Scoring
  • ## Blocking Issues
  • ## Non-Blocking Suggestions
  • ## Next Handoff

Tester required headings

  • ## Test Plan
  • ## Test Results
  • ## Failures / Repro Steps
  • ## Risk Assessment
  • ## Next Handoff

7.3 Repair prompts (MUST)

If a response fails contract validation:

  • engine must generate a repair prompt for the same role that:
    • explicitly lists missing headings/fields,
    • instructs the role to rewrite in the required format,
    • forbids changing substantive content beyond formatting unless requested.

Repair events must be recorded as:

  • a drift flag DRIFT_CONTRACT_VIOLATION,
  • plus an engine note explaining the repair required.

8. Orchestration engine (state machine)

8.1 Engine API surface (MUST)

In packages/core/src/engine/ implement:

export type EngineInput = {
  run: RunArtifact;
  event: EngineEvent;
};

export type EngineEvent =
  | { type: 'INIT_RUN'; scenarioId: string; inputs: Record<string, unknown>; config: RunConfig }
  | { type: 'PASTE_RESPONSE'; text: string }
  | { type: 'APPLY_INJECTION'; injectionId: string; params?: Record<string, unknown> }
  | { type: 'SET_BUDGET_CAP'; tokenEstimateCap: number }
  | { type: 'SET_PRICING_PROFILE'; profileId: string }
  | { type: 'RESET_TO_TURN'; turnId: number }; // optional v1.1, required v1.2

export type EngineOutput = {
  run: RunArtifact; // updated artifact
  next: {
    role: RoleName | null;
    stage: Stage;
    routingInstruction?: string;
    promptText?: string;
  };
  diagnostics: {
    contractErrors?: string[];
    driftFlags?: DriftFlag[];
    rubricScore?: RubricScore;
  };
};

export function stepEngine(input: EngineInput): EngineOutput;

8.2 Deterministic derivation pipeline (MUST)

On each PASTE_RESPONSE:

  1. Identify expected role/stage from run.derived.completion.
  2. Validate response contract; parse into ParsedResponse.
  3. Compute charCount + tokenEstimate.
  4. Run drift detection (rule-based) with evidence.
  5. Run rubric scoring (rule-based) with evidence.
  6. Update budgets + economics ledgers.
  7. Derive DerivedState from all prior turns deterministically.
  8. Choose next role/stage based on transition rules.

8.3 Transition rules (v1.1) (MUST)

  • Stage progression:
    • kickoff (architect)implementation (developer)review (critic)test (tester)finalize (architect)
  • Loops:
    • If critic finds blocking issues OR rubric score below threshold:
      • review (critic)revise (developer)review (critic)
    • If tester reports failures:
      • test (tester)revise (developer)review (critic)test (tester) (as needed)
  • Loop caps:
    • maxReviseLoops default: 5
    • if exceeded:
      • mark acceptance passed=false,
      • force finalize (architect) with reasons including loop cap triggered.

8.4 State digest regeneration (MUST)

Digest is regenerated from:

  • scenario summary + constraints + acceptance criteria + deliverables,
  • latest Architect “System Plan” section (bounded),
  • open issues extracted from critic/tester sections (bounded),
  • last 3 decisions extracted from “Next Handoff” sections.

Extraction rules must be deterministic and documented (regex-based with stable ordering).


9. Drift detection

9.1 Drift ledger schema

export type DriftLedger = {
  flags: DriftFlag[];
  maxSeverity: 'none' | 'info' | 'warn' | 'error';
  score: number; // weighted sum
};

export type DriftFlag = {
  id: string; // stable code
  severity: 'info' | 'warn' | 'error';
  message: string;
  turnId: number;
  evidence: string[]; // exact excerpts or rule hits
  category: 'contract' | 'role_boundary' | 'constraint' | 'scope' | 'budget' | 'consistency';
};

9.2 Required drift rules (v1.1)

Contract

  • Missing required headings / header block
  • Invalid run/turn header values (non-matching runId, non-integer turn)
  • Unparseable structured sections

Role boundary

  • Architect includes large code blocks over maxCodeBlockChars → warn
  • Developer includes rubric scoring section → warn
  • Critic proposes implementing code changes (not critique) → warn
  • Tester proposes architecture changes (not test results) → warn

Constraints

  • Mentions forbidden actions (scraping, secrets, automation, “I executed code”, etc.)
  • Mentions external network calls if constraint forbids.

Scope

  • Introduces new deliverables not in scenario deliverables
  • Changes language/stack when constraints fix it

Budget

  • Excess verbosity: response token estimate exceeds per-turn ceiling (configurable)
  • Budget cap exceeded: error

Consistency

  • Contradicts prior accepted constraints/decisions (simple text match + hash checks of constraint lists)

9.3 Drift scoring weights (MUST)

Provide a deterministic scoring table in code:

  • info = +1
  • warn = +5
  • error = +20 Plus per-category multipliers:
  • contract ×1.0
  • constraint ×1.5
  • consistency ×1.2
  • budget ×1.1
  • scope ×1.3
  • role_boundary ×1.0

10. Rubric scoring

10.1 Rubric ledger schema

export type RubricLedger = {
  profileId: string;
  thresholds: {
    overallPassScore: number; // e.g. 80
    maxAllowedDriftSeverity: 'warn' | 'error'; // default "warn"
    consecutivePassTurns: number; // default 2
  };
  scores: RubricScore[];
  lastTwoPass: boolean;
};

export type RubricScore = {
  turnId: number;
  role: RoleName;
  score: number; // 0..100
  breakdown: {
    completeness: number; // 0..25
    correctnessSignals: number; // 0..25
    constraintAdherence: number; // 0..25
    clarity: number; // 0..25
  };
  evidence: string[]; // bounded list
  notes: string[];
};

10.2 Deterministic scoring heuristics (MUST)

Each dimension uses deterministic signals:

  • Completeness:
    • required headings present,
    • acceptance criteria referenced (architect + finalize turns),
    • deliverables addressed (developer).
  • Correctness signals:
    • explicit assumptions list present when needed,
    • no contradiction flags,
    • critic/tester issues include reproduction/evidence.
  • Constraint adherence:
    • no constraint drift flags,
    • no forbidden patterns.
  • Clarity:
    • headings + bullet lists,
    • bounded verbosity,
    • actionable steps in “Next Handoff”.

Rubric code MUST output evidence that can be shown in the UI.


11. Budgeting and simulated economics

11.1 Token estimation (MUST)

  • tokenEstimate = ceil(charCount / 4)
  • Track:
    • per-prompt and per-response estimates,
    • cumulative totals,
    • per-role totals.

11.2 Budget ledger schema

export type BudgetLedger = {
  tokenEstimateCap?: number;
  used: number;
  usedByRole: Record<RoleName, number>;
  warnings: { atTurn: number; severity: 'info' | 'warn' | 'error'; message: string }[];
};

11.3 Warning thresholds (MUST)

If cap exists:

  • 70% → warn
  • 85% → warn
  • 100% → error (require explicit “continue anyway” toggle in UI)

11.4 Cost simulation (MUST)

No real pricing calls. Provide local profile table:

export type PricingProfile = {
  profileId: string; // "cheap" | "mid" | "premium"
  title: string;
  promptPer1kTokensUSD: number;
  completionPer1kTokensUSD: number;
};

export type EconomicsLedger = {
  pricingProfileId: string;
  simulatedCostUSD: number;
  costByRoleUSD: Record<RoleName, number>;
  costByTurnUSD: Record<number, number>;
  costOfDriftUSD: number; // computed as cost of turns after first drift>=warn
};

12. Failure mode injection

12.1 Injection model (MUST)

Injections are deterministic transformations applied at run creation or mid-run.

export type InjectionEvent = {
  injectionId: string; // stable id
  appliedAtTurnId: number; // 0 for pre-run
  params: Record<string, unknown>;
  seed?: number; // if injection uses randomness
  description: string;
};

12.2 Required injection types (v1.1)

  1. AMBIGUOUS_SPEC
    • removes acceptance criteria items or makes one vague.
  2. CONFLICTING_CONSTRAINTS
    • injects contradictory constraint pair and forces architect re-anchor.
  3. TRUNCATED_CONTEXT
    • engine includes fewer turn summaries in prompt generation.
  4. BAD_CRITIC
    • simulated critic produces incorrect critique (sim provider only).
  5. BUDGET_CRUNCH
    • lowers cap mid-run and forces recovery strategy.

12.3 Recording and evidence (MUST)

  • Every injection must be recorded in run.injections[].
  • Drift detection must reference injections where relevant (“this failure was injected”).

13. Prompt generation

13.1 Prompt template requirements (MUST)

Prompt templates must be:

  • deterministic,
  • minimal history,
  • always include the current StateDigest (bounded),
  • explicitly state the role contract format.

13.2 Prompt generation algorithm (MUST)

  • Input:
    • scenario definition,
    • current digest,
    • last N turns summaries (default N=2),
    • injections affecting prompts,
    • budget status.
  • Output:
    • a single prompt string.

History inclusion MUST be bounded:

  • include only:
    • digest,
    • last N summaries (generated deterministically from parsed role sections),
    • open issues list.

13.3 Prompt provenance

Store in each turn:

  • templateId,
  • included digestHash (so later we can prove prompt was generated from digest X),
  • token estimates.

14. Persistence, export, import

14.1 In-browser persistence (v1.1)

Preferred: IndexedDB via a small wrapper (e.g. idb library) to store:

  • run list metadata,
  • full run artifacts.

Fallback: localStorage for metadata + compressed run JSON (only if small).

Key namespace (MUST):

  • exnulla.orchestrationLab.*
  • include schemaVersion in keys where useful.

14.2 Export format (MUST)

  • Export is the canonical RunArtifact JSON.
  • Additionally export (optional):
    • transcript.md (prompt/response pairs),
    • summary.md (budgets, rubric, drift, acceptance checklist).

14.3 Import validation (MUST)

Import must:

  • validate schemaVersion,
  • validate Zod schema,
  • recompute derived state and compare to stored derived (deterministic check),
  • show any mismatches as “artifact integrity warnings.”

15. Inspector UI

15.1 Routes (MUST)

  • / → landing + “New Run” + “Import Run”
  • /runs → run list
  • /runs/:runId → run overview (timeline)
  • /runs/:runId/turns/:turnId → turn detail
  • /runs/:runId/graph → DAG view
  • /runs/:runId/diff → diff view (turn-to-turn)
  • /runs/:runId/rubric → rubric panel
  • /runs/:runId/drift → drift panel
  • /runs/:runId/injections → injection panel
  • /meta/version.json → version endpoint (static)

15.2 Timeline view requirements

  • per turn:
    • role badge,
    • contract status,
    • token estimate + cumulative,
    • drift severity,
    • rubric score,
    • links to detail and diff.

15.3 DAG view requirements

  • nodes = turns (ordered left-to-right by turnId)
  • edges = inferred stage transitions / loops
  • node styles:
    • contract invalid → highlight
    • drift warn/error → highlight
  • click node opens turn detail

Implementation:

  • use a lightweight graph lib compatible with static builds (e.g. React Flow) OR custom SVG layout.
  • determinism requirement:
    • graph layout must be stable for a given run (seeded layout if using force algorithms).

15.4 Diff view requirements

Diff options:

  • prompt vs prompt (two turns)
  • response vs response
  • digest vs digest across turns

Implementation:

  • use a deterministic diff algorithm (e.g. diff package) and render hunks.

15.5 Paste console requirements

  • shows expected role + stage
  • shows prompt block (copy button)
  • provides paste input area
  • validates contract live and shows errors before submission
  • submits through stepEngine({ type: "PASTE_RESPONSE" })

15.6 Accessibility / iframe constraints

  • no reliance on window.top control
  • all downloads via standard browser download; no popups
  • no external fonts required (optional)

16. Simulated provider (optional but REQUIRED for tests)

16.1 Purpose

  • Provide deterministic “agent outputs” for:
    • unit/integration tests,
    • demo mode without ChatGPT UI,
    • injecting failure patterns reproducibly.

16.2 SimulationConfig

export type SimulationConfig = {
  seed: number; // required
  modelPresetByRole: Record<RoleName, 'fast' | 'balanced' | 'thorough'>;
  errorRateByRole: Record<RoleName, number>; // 0..1
  verbosityByRole: Record<RoleName, number>; // 0..1
};

16.3 Simulation determinism rules

  • Use a seeded PRNG (e.g. seedrandom) in core.
  • Never use Math.random() directly.
  • All simulated outputs must embed the run/turn header block and required headings.

17. Testing plan

17.1 Core unit tests (MUST)

  • schema validation (valid + invalid fixtures)
  • deterministic hashing + canonical json
  • drift rules hit expected evidence
  • rubric scoring stable given fixed input
  • budget math and warning thresholds
  • digest regeneration stable
  • transition rules with loop caps

17.2 Integration tests (MUST)

  • simulate an entire run with SimulatedProvider:
    • with no injections → should pass acceptance,
    • with each injection type → should flag drift and/or fail acceptance depending on design.

17.3 UI smoke tests (SHOULD)

  • ensure build compiles
  • ensure routes render with sample run artifact

18. CI and release hygiene

18.1 GitHub Actions (MUST)

Workflow steps:

  1. pnpm install --frozen-lockfile
  2. pnpm lint
  3. pnpm test
  4. pnpm build
  5. optional: upload dist/ as artifact

18.2 Version stamping (MUST)

  • GIT_SHA injected in CI:
    • GIT_SHA=${{ github.sha }}
  • meta/version.json created during build from env + package version.

19. Docker spec (deterministic build)

19.1 Dockerfile requirements (MUST)

  • multi-stage build (build → nginx or dist output)
  • uses pnpm with lockfile
  • accepts ARG GIT_SHA

Example (reference, adjust as needed):

FROM node:20-alpine AS build
WORKDIR /app
ARG GIT_SHA=unknown
ENV VITE_GIT_SHA=$GIT_SHA

COPY package.json pnpm-lock.yaml pnpm-workspace.yaml ./
COPY apps/loc-web/package.json apps/loc-web/package.json
COPY packages/core/package.json packages/core/package.json
COPY packages/scenarios/package.json packages/scenarios/package.json
COPY packages/ui/package.json packages/ui/package.json

RUN corepack enable && corepack prepare pnpm@latest --activate
RUN pnpm install --frozen-lockfile

COPY . .
RUN pnpm build

FROM nginx:alpine AS runtime
COPY --from=build /app/apps/loc-web/dist /usr/share/nginx/html

19.2 Determinism note

Avoid embedding build timestamps unless explicitly excluded from replay checks.


20. Security and safety

20.1 No secrets rule (MUST)

  • UI must warn: “Do not paste secrets; this tool stores data locally.”
  • Best-effort secret detection (SHOULD):
    • regex for common token formats,
    • show warning banner; allow user to proceed (do not hard-block in v1.1).

20.2 Content boundaries

  • Role templates must forbid:
    • claiming to have executed code,
    • scraping/automation,
    • accessing private systems.

21. Acceptance criteria (v1.1 release gate)

A v1.1.0 release is “done” when all are true:

  1. New run wizard works end-to-end in Human mode using copy/paste.
  2. Contract validation triggers and generates repair prompts.
  3. Drift rules reliably fire on injected failure modes with evidence.
  4. Inspector explains drift + rubric with clickable evidence.
  5. Export/import roundtrip works and deterministic replay validation passes.
  6. Static build runs cleanly and is iframe-safe.
  7. CI enforces strict TS, lint, tests, build.
  8. /meta/version.json exposes git SHA and schemaVersion.

22. Implementation checklist (file-level)

22.1 packages/core (MUST)

  • src/schema/runArtifact.ts (types + zod)
  • src/util/canonicalJson.ts (stable stringify)
  • src/util/hash.ts (sha256 helpers)
  • src/engine/stepEngine.ts
  • src/engine/deriveState.ts
  • src/drift/rules/*.ts
  • src/scoring/rubric.ts
  • src/budget/budget.ts
  • src/providers/humanProvider.ts
  • src/providers/simulatedProvider.ts
  • tests/*

22.2 packages/scenarios (MUST)

  • scenario registry + zod input schemas
  • injection registry + deterministic transforms
  • pricing profiles

22.3 apps/loc-web (MUST)

  • run store (IndexedDB wrapper)
  • new run wizard
  • prompt router + paste console
  • inspector routes (timeline, turn detail, graph, diff, rubric, drift, injections)
  • export/import UI

22.4 docs (MUST)

  • role instruction templates
  • runbooks:
    • DEPLOY.md (atomic static deploy)
    • IFRAME.md (embedding contract and storage namespace)

23. Appendix A — Deterministic runId format

23.1 Format

Use a URL-safe id:

  • orl_<YYYYMMDD>_<hhmmss>_<randBase32> for human runs (time-based, not determinism-critical), OR
  • orl_<hashPrefix> for deterministic runs if seed-based.

v1.1 choice (recommended):

  • time-based is acceptable because determinism is based on artifact content, not runId.

23.2 Requirement

  • runId must be unique within local store.
  • export file naming uses runId.

24. Appendix B — UI embed contract (iframe)

24.1 Static hosting assumptions

  • all assets served relative to app root
  • no service worker required
  • no absolute URLs

24.2 Storage namespace

All keys must be prefixed:

  • exnulla.orchestrationLab.v1.1.0.*

25. Roadmap hooks (v1.2+ / v2+)

25.1 v1.2 (planned)

  • CLI validator:
    • validate-run <file>
    • diff-runs <a> <b>
  • cross-run comparison UI
  • more scenarios (6+)

25.2 v2 (planned)

  • API provider adapters
  • optional server runtime for keys (not in browser)
  • tool execution hooks (optional)

CI/CD and Verification Model

CI/CD is the external enforcement mechanism for this workflow. It is where subjective process claims become objective pass/fail behavior.

The point is simple: the repo gates are the truth surface. Any agentic process that bypasses them is theater.

Agentic Development Pipeline

The workflow is deliberately closer to a supervised build engine than a conversational coding assistant. Roles are isolated. Scope is constrained. Output is artifact-based. Verification has veto power.

This role separation is not ceremony. It is what allows the system to scale reasoning without allowing authority to become ambiguous.

The hidden advantage is economic as well. By decomposing work into explicit phases, you can route simpler tasks to cheaper model tiers and reserve expensive reasoning for architecture, synthesis, and conflict resolution rather than paying premium rates for every token in the loop.

Human Roles

Human involvement is not a sign that the pipeline is unfinished. It is where system quality actually comes from.

In short: the machine accelerates structured work, but the human remains accountable for engineering judgment.

Security and Drift Control

In this operating model, security and drift control are tightly linked. A system that cannot explain why a file exists, why a behavior changed, or where authority came from is both a process problem and a security problem.

The shortest summary is this: a safe agentic pipeline is not one that “does more.” It is one that fails visibly, explains itself, and refuses to outrun its own specification.