What to Expect From Me
The way an engineer designs, specifies, and operates complex systems matters as much as the systems themselves. This section exposes that process directly. Instead of describing capabilities in abstract terms, it shows the architecture, constraints, specifications, and control layers used to build real systems. For potential collaborators, employers, or clients, the goal is simple: evaluate the work itself. The materials below demonstrate how problems are structured, how specifications are written, and how systems are shipped. If the approach aligns with your needs, we should talk.
How to Read This Section
Most portfolios show finished artifacts. This page shows the work behind those artifacts. It documents how problems are framed, how constraints are locked, how blueprints and engineering specs are written, how CI/CD is used as an enforcement mechanism, how AI-assisted development is bounded, and where human inspection remains authoritative.
The intent is to remove ambiguity about what someone is actually hiring when they hire me. The output matters, but the process matters more. This page is meant for technical readers, operators, founders, and engineering leadership who want to inspect the system behind the artifact rather than stop at the artifact itself.
Important Notice
The material presented here reflects proprietary engineering processes and system design work. These processes, architectures, methodologies, and planning artifacts are intellectual property. All rights reserved.
This page is provided for evaluation purposes only to demonstrate engineering capability, architectural reasoning, system discipline, and execution quality.
Case 01 — Thesis Chain AI DevKit
This case study examines the engineering process behind the Thesis Chain AI DevKit. The DevKit exists to safely integrate AI-assisted development into production-grade engineering workflows while controlling cost, behavior, nondeterminism, and security risk.
Rather than treating AI as an autonomous authority, the DevKit treats model output as untrusted input. Guardrails, validation layers, budget controls, policy checks, and human inspection surround the model so the engineering system remains predictable, auditable, and reviewable.
Problem Definition
Modern AI models can accelerate engineering work dramatically, but naive integration introduces severe risk: prompt injection, uncontrolled costs, nondeterministic output, accidental data disclosure, weak reviewability, and silent drift in system behavior.
Most AI-assisted development tooling assumes the model can be trusted to generate correct or safe output. In practice this assumption fails often enough to make an unconstrained approach unacceptable in serious engineering environments.
The engineering problem addressed by the Thesis Chain AI DevKit is therefore:
How can AI-assisted development be integrated into real engineering workflows while maintaining deterministic control, cost discipline, bounded authority, and meaningful security guarantees?
Engineering Constraints
Before architecture begins, the system must operate under explicit constraints. These constraints shape every architectural decision and prevent design drift.
- Token Budget Control
AI usage must operate under strict token ceilings. Models are selected intentionally by task class to avoid unnecessary cost. - Deterministic Behavior
Outputs must be inspectable and reproducible wherever possible. AI responses are never treated as authoritative system state. - Guardrail-First Architecture
Every model interaction must pass through validation layers including prompt screening, context restriction, redaction, schema validation, and disposition control. - Model Stratification
Different models are used for different classes of work. Expensive models are reserved for tasks that genuinely require deeper synthesis. Lower-cost models handle mechanical or narrow work. - Fail-Closed System Behavior
If validation fails or a guardrail triggers, the system rejects the output rather than attempting to recover silently. - Human Inspection Authority
Human engineers retain final authority over merges, deployments, architecture, and policy changes. - Security Isolation
Sensitive information and secrets must never enter model context. The system must operate under the assumption that model output could be malicious, confused, or incorrect.
Blueprint Architecture
The blueprint phase exists to lock system intent before implementation begins. Its purpose is not to describe code. Its purpose is to define the operational shape of the system: what the system must do, what it must never do, how risk is bounded, where authority resides, how inputs move, and what acceptance looks like before implementation starts.
For the Thesis Chain AI DevKit, the blueprint establishes a guardrail-first architecture. The model is never placed at the center of the system. Instead, the model is wrapped inside a deterministic control envelope that constrains what context may be passed in, how requests are formed, how outputs are parsed, and what conditions cause the system to reject the response.
At blueprint level, the architecture is divided into ordered layers rather than loose feature ideas. That matters because order determines safety. Cheap and deterministic checks execute first. Expensive and probabilistic work executes later, only after the input has been reduced, normalized, screened, and validated.
- Input Boundary
Raw source content, prompts, instructions, repository diffs, and execution context are treated as separate classes of input with different trust levels. - Redaction and Sanitization Layer
Secret-bearing content, irrelevant data, and structurally dangerous prompt material are removed or transformed before a provider call is even possible. - Context Minimization Layer
Only the minimum useful context should move forward. This prevents whole-repo dumping, wasted spend, and low-signal prompts. - Budget and Routing Layer
The system decides whether a task deserves an AI call at all, and if it does, which model tier should receive it. - Provider Abstraction Layer
Providers are execution surfaces, not sources of truth. Core engineering logic is not coupled to one vendor. - Schema and Validation Layer
Output must fit a declared contract. If parsing fails or required structure is absent, the result is rejected. - Decision Boundary Layer
Even valid model output does not become authority automatically. The system classifies it as blocked, advisory, review-required, or safe to surface. - Audit and Replay Layer
Every meaningful run should be inspectable after the fact. Useful engineering systems must be reviewable, explainable, and diagnosable under failure.
Canonical Blueprint Markdown
The following appendix is mirrored locally from the AI DevKit source material and displayed here as canonical markdown.
The Thesis Chain AI DevKit — Blueprint
Version: 1.0.0
Status: Canonical Blueprint
Project: the-thesis-chain-ai-devkit
Document Type: System Blueprint
Primary Audience: Engineering leadership, platform engineers, security reviewers, implementation engineers
Authoring Intent: Define the operational architecture, trust boundaries, guardrails, authority model, and implementation shape for a safe AI-assisted engineering system.
1. Purpose
The Thesis Chain AI DevKit exists to integrate AI-assisted development into real engineering workflows without giving model output uncontrolled authority over code, repository state, infrastructure, or policy.
The system is designed around a simple premise:
AI output is useful, but untrusted.
The DevKit therefore does not treat the model as a builder with implicit authority. It treats the model as an external probabilistic subsystem wrapped inside deterministic engineering controls. The value of the system comes from how inputs are reduced, how context is bounded, how outputs are validated, how budget is controlled, how risk is isolated, and where human authority is retained.
This project is not a chatbot wrapper. It is an engineering control framework for structured, auditable, bounded AI-assisted workflows.
2. Problem Statement
Modern model providers can accelerate review, synthesis, linting, threat sketching, and ambiguity detection. However, naive adoption creates a compound engineering risk surface:
- unbounded token spend
- accidental data disclosure
- prompt injection through repository text
- nondeterministic output treated as truth
- silent workflow drift
- provider coupling
- weak auditability
- unclear merge authority
- inappropriate use of write-capable automation
The actual engineering problem is:
How can AI-assisted engineering workflows produce useful structured output while preserving deterministic safety, bounded cost, auditability, and human control?
This blueprint answers that question at architecture level.
3. Design Position
3.1 What AI is allowed to be
AI may act as:
- a reviewer
- a synthesizer
- a contradiction detector
- an ambiguity finder
- a threat-category sketcher
- a structured advisory instrument
3.2 What AI is not allowed to be
AI is not:
- a source of truth
- an autonomous merger
- a deployment authority
- a secrets-bearing execution surface
- a repository-wide reader by default
- a policy mutator
- a privileged system actor
3.3 Core architectural stance
The system is guardrail-first, fail-closed, and authority-constrained.
The model sits inside a layered deterministic envelope. The envelope, not the model, is the system.
4. Non-Negotiable Constraints
Before implementation, the following constraints are locked.
4.1 Bounded authority
AI output may be rendered, scored, cached, audited, and surfaced for review, but it may not directly merge code, deploy infrastructure, rotate secrets, or mutate policy without explicit human approval.
4.2 Diff-limited context
The system must operate on narrowed, task-relevant, allowlisted context. Whole-repo dumping is prohibited by design.
4.3 Redaction before provider access
Redaction and path filtering occur before any provider call is possible.
4.4 Strict schema at boundaries
Model output must be parsed into declared structure. If parsing fails, the system rejects the result.
4.5 Fail-closed behavior
Validation, policy, or budget failure must produce rejection rather than silent degradation.
4.6 Deterministic gates remain authoritative
Deterministic checks keep final authority. AI output is advisory even when structurally valid.
4.7 Provider abstraction
Core system logic may not be tightly coupled to a single model vendor.
4.8 Full run traceability
Meaningful executions must emit auditable artifacts sufficient for replay, diagnosis, and review.
5. System Goals
The DevKit is intended to provide the following outcomes.
- Increase engineering leverage on review-heavy work.
- Reduce ambiguity and contradiction in specs, diffs, and architectural material.
- Bound the safety and cost risks of model usage.
- Produce repeatable structured outputs.
- Preserve explainability and post-run auditability.
- Support both local and GitHub-mediated workflows.
- Remain useful even when provider integrations are stubbed or offline.
6. Out of Scope
The following are explicitly out of scope for this version.
- autonomous code merge
- autonomous deployment
- autonomous policy modification
- secret retrieval from protected systems
- unrestricted repo ingestion
- write-capable agent swarms
- unsupervised multi-step tool execution against production systems
- treating schema-valid output as semantically correct by default
7. Operational Model
The DevKit is organized as a layered pipeline.
7.1 Layer 0 — Input boundary
Inputs enter as typed engineering artifacts:
- repository reference
- pull request reference
- diff summary
- changed files
- prompt template version
- task class
- runtime policy
- optional provider configuration
All inputs are assigned trust levels.
7.2 Layer 1 — Path policy and context eligibility
Files are filtered through allow/deny policy. Sensitive directories and structurally dangerous paths are excluded from model context.
7.3 Layer 2 — Redaction and sanitization
Eligible content is passed through redaction rules to suppress obvious secret and PII patterns and to reduce accidental disclosure.
7.4 Layer 3 — Prompt injection preflight
Repository text, diffs, and instructions are screened for prompt injection patterns. Safety mode accepts false positives over false negatives.
7.5 Layer 4 — Context minimization
Only the minimum useful diff and file content move forward. The system reduces low-signal input before any expensive operation.
7.6 Layer 5 — Budget and routing
The system decides whether the task deserves an AI call at all, and if so, what model class should receive it.
7.7 Layer 6 — Provider execution
Providers are treated as external execution surfaces. Their output is raw material, not authority.
7.8 Layer 7 — Parse and schema validation
Response text must parse to valid structured output. Invalid output is rejected.
7.9 Layer 8 — Decision boundary
A valid report is still classified as advisory. It may be rendered to markdown, attached to a PR, cached, audited, or flagged for manual review.
7.10 Layer 9 — Audit, metrics, replay
The run emits enough metadata to reconstruct what happened without trusting memory or provider logs alone.
8. High-Level Architecture
8.1 Principal subsystems
Policy subsystem
- allow paths
- deny paths
- strict schema enforcement
- prompt injection guard enablement
- budget limits
- model selection defaults
Context control subsystem
- changed-file assembly
- diff summary ingestion
- size reduction
- path gating
- content shaping
Safety subsystem
- redaction
- prompt injection heuristics
- fail-closed validation
Provider abstraction subsystem
- provider interface
- stub provider
- future provider adapters
Schema boundary subsystem
- output contract
- parse failure handling
- structure validation
Audit subsystem
- request event
- response event
- error event
- hashes and token usage
Cache subsystem
- deterministic keying
- TTL-based storage
- duplicate-spend prevention
Agent subsystem
- task-specific templates
- structured report generation
- agent versioning
Runner subsystem
- local runner
- GitHub Actions runner
- GitHub App / webhook architecture
9. Agent Model
Agents in this system are not autonomous personas. They are typed task modules with fixed contracts.
Each agent must define:
- an agent name
- an agent version
- a prompt template
- constraints
- an output schema
- a deterministic validation boundary
- a rendering target
Example task classes supported by the current architecture include:
- specification linting
- PR synthesis
- threat sketching
The architectural rule is that an agent is not defined by a clever prompt. It is defined by a prompt-plus-contract-plus-boundary package.
10. Trust Boundaries
This system has several hard trust boundaries.
10.1 Repository text is untrusted
Pull request content, spec text, comments, and changed files may contain adversarial instructions.
10.2 Model provider is external
Provider calls move data beyond the local boundary. Context must be reduced before crossing that line.
10.3 Model output is untrusted
Even well-formed output may be wrong, incomplete, or subtly misleading.
10.4 Human reviewers remain authoritative
Human approval is the boundary at which advisory output may influence actual engineering decisions.
11. Safety Architecture
11.1 Prompt injection resistance
The system uses conservative preflight heuristics to reject obvious attempts to override role, reveal secrets, or alter instructions.
11.2 Path isolation
The system denies unsafe path classes by default and only sends allowlisted engineering material.
11.3 Secret and PII redaction
Sensitive patterns are removed or masked before request assembly.
11.4 Schema-gated output
Only output that fits the declared report structure is accepted into downstream systems.
11.5 Read-only default integration
Integrations should default to read-only scope with comment-only feedback unless explicitly elevated.
11.6 Human-held merge authority
No report, score, or advisory comment is permitted to stand in for merge authority.
12. Budget and Cost Control Model
The DevKit treats cost as a first-class systems problem.
12.1 Budget primitives
For a run r:
calls(r)= number of provider callsTin(r)= total input tokensTout(r)= total output tokens
The budget envelope is:
calls(r) <= C_maxTin(r) <= I_maxTout(r) <= O_max
The run is rejected when any inequality is violated.
12.2 Cost equation
For provider pricing:
alpha= cost per input tokenbeta= cost per output token
Then expected run cost is:
Cost(r) = alpha * Tin(r) + beta * Tout(r)
System-level budget discipline requires that expected spend be bounded before scale is allowed.
12.3 Caching principle
Repeated calls on equivalent prompt and context should not re-spend budget.
A canonical cache key shape is:
K = H(provider || model || prompt_version || prompt_hash || context_hash || policy_version)
Where H() is a collision-resistant digest.
13. Auditability Model
Every meaningful run should emit structured audit events.
At minimum, the system records:
- request id
- provider
- model
- prompt hash
- context hash
- output hash
- timestamp
- token usage
- error state, if any
This allows operators to answer:
- what was asked
- what input class was sent
- what provider/model handled it
- whether the output was cached
- whether the output validated
- what it cost
- what failed if the run was rejected
Audit exists to support diagnosis, governance, and trust.
14. GitHub Integration Model
The DevKit supports two primary integration modes.
14.1 CI-driven mode
A GitHub Action runs on PR events, assembles eligible context, executes the advisory pipeline, and posts structured review comments.
14.2 App-driven mode
A webhook service verifies GitHub signatures, mints installation tokens, fetches changed files, runs the advisory pipeline, and posts PR comments or check runs.
The blueprint preference is:
- read-only by default
- no content mutation by default
- comment/check-run surfaces preferred over write surfaces
- deterministic verification before any pipeline execution
15. Human Roles
The system explicitly retains human authority in the following roles.
15.1 Architect
Defines the allowed shape of the system, agent classes, boundaries, and non-negotiables.
15.2 Security reviewer
Owns threat posture, path policy, redaction strategy, integration scope, and escalation policy.
15.3 Implementation engineer
Builds adapters, runners, validators, and renderers against the blueprint and spec.
15.4 Reviewer / operator
Interprets advisory output, checks evidence, and decides whether action is warranted.
15.5 Release authority
Retains final authority for merges, deployment, and policy change.
16. Acceptance Criteria
The blueprint is considered implemented correctly when the system can demonstrably do the following:
- accept diff-limited engineering context
- reject disallowed paths before provider access
- redact obvious secrets and PII before request creation
- detect and block obvious prompt injection patterns
- assemble versioned prompt envelopes
- enforce hard token/call budgets
- cache equivalent requests deterministically
- parse and schema-validate response structure
- emit auditable request/response/error events
- surface advisory reports without granting write authority
- support both local and GitHub-oriented execution paths
- fail closed on malformed output or policy violation
17. Failure Philosophy
The DevKit is intentionally conservative.
When uncertain, it should:
- reduce context
- reject unsafe paths
- block suspicious instructions
- refuse malformed output
- mark uncertainty explicitly
- escalate to human review
The preferred failure mode is lost convenience, not silent compromise.
18. Future Evolution
The architecture permits future additions, but only within the same control posture.
Possible later extensions include:
- stronger schema validators
- scored evidence confidence
- richer path-policy classes
- provider multiplexing
- offline replay tooling
- diff chunking for large PRs
- policy version pinning
- richer evaluation harnesses
- more agent classes
These are valid only if they preserve the current authority model: deterministic controls first, advisory AI second.
19. Blueprint Summary
The Thesis Chain AI DevKit is a control architecture for AI-assisted engineering, not an AI-first automation toy.
Its core principles are:
- AI remains untrusted
- deterministic boundaries remain authoritative
- context is minimized before exposure
- cost is bounded
- outputs are schema-gated
- audit is mandatory
- write authority is withheld by default
- humans retain final control
That is the system this blueprint defines.
Engineering Specifications
If the blueprint defines intent, the engineering specification defines execution. This is where high-level architectural ideas are converted into a buildable, inspectable, and testable system. In my process, the engineering spec is not a light outline. It is the document that removes ambiguity from implementation.
The engineering spec for an AI-assisted development system must answer several questions explicitly:
- What modules exist, and what are their exact responsibilities?
- What data enters and leaves each boundary?
- What conditions are blocking conditions versus warning conditions?
- Where does the system fail closed?
- What is human-reviewed, and what is machine-validated?
- How are token budgets measured, enforced, and audited?
- How are outputs replayed, inspected, and compared?
For the Thesis Chain AI DevKit, the engineering spec acts as a discipline document. It translates “AI should help here” into precise, enforceable behavior.
1. Module Boundaries
The spec separates the system into modules with narrow responsibilities: input preparation, sanitization, routing, provider calls, parsing, validation, budget accounting, result classification, and human inspection. If a module cannot be named and bounded, it is not ready to be implemented.
2. Ordered Guardrails
Guardrails are fixed in sequence. They are not optional helpers. They are part of the main execution path.
3. Output Contracts
The spec defines what a valid response looks like. Structured output contracts reduce hidden interpretation costs and unstable downstream behavior.
4. Failure Semantics
The spec identifies when the system must stop. A malformed response, budget breach, unsafe context match, or policy violation should terminate the path and surface a visible failure state.
5. Token and Cost Discipline
Work is divided into classes: mechanical, evaluative, synthesis-heavy, and ambiguous. These classes map to different model tiers and different budget thresholds.
6. Inspection Requirements
The spec defines what must be visible to a human reviewer: prompt class, sanitized input summary, chosen model tier, token consumption, validation results, classification outcome, and final disposition.
7. Non-Negotiables
The strongest specs contain non-negotiables that implementation is not allowed to reinterpret: no hidden globals, no silent fallback behavior, no speculative scope expansion, no unbounded model calls, and no accepting model output as trusted state without validation and review.
Canonical Engineering Spec Markdown
The following appendix is mirrored locally from the AI DevKit source material and displayed here as canonical markdown.
The Thesis Chain AI DevKit — Engineering Specification
Version: 1.0.0
Status: Canonical Engineering Specification
Project: the-thesis-chain-ai-devkit
Document Type: Engineering Specification
Primary Audience: Implementation engineers, reviewers, maintainers, CI/CD operators
Depends On: the-thesis-chain-ai-devkit-blueprint-1-0-0.md
1. Specification Intent
This engineering specification defines the concrete implementation contract for the Thesis Chain AI DevKit.
It exists to translate blueprint-level architectural intent into:
- module boundaries
- runtime data contracts
- algorithmic flow
- validation rules
- budget equations
- cache semantics
- audit event structure
- runner behavior
- GitHub integration behavior
- acceptance tests
This spec is written so an implementation engineer can build or extend the system without guessing.
2. System Summary
The DevKit is a provider-agnostic, schema-gated, guardrail-first framework for AI-assisted engineering workflows.
At runtime, the system:
- receives a task-specific request
- filters context by policy
- redacts content
- screens for prompt injection
- assembles a prompt envelope
- computes deterministic hashes
- checks cache
- enforces budget
- calls a provider adapter
- parses and validates response structure
- records audit events
- returns an advisory report to a runner
The implementation must preserve that order.
3. Repository-Level Module Topology
3.1 Required top-level module groups
src/core/- types
- policy
- redaction
- injection guards
- schema validation
- LLM client
- audit
- cache
- prompt templates
- shared utilities
src/adapters/- provider adapter interface
- provider implementations or stubs
src/agents/- typed agent runners for fixed task classes
src/runners/- local execution path
- GitHub-oriented execution path
docs/- architectural and operational documentation
.github/workflows/- CI demonstration or integration flows
4. Data Contracts
4.1 Severity
Allowed values:
infowarnhigh
4.2 Category
Allowed values:
structureinvariantthreatdifftest
4.3 Finding
A finding is a typed advisory unit.
type Finding = {
id: string;
severity: 'info' | 'warn' | 'high';
category: 'structure' | 'invariant' | 'threat' | 'diff' | 'test';
claim: string;
evidence_refs: string[];
suggested_action?: string;
};
4.4 Report
The report is the canonical accepted AI output structure.
type Report = {
agent: string;
version: string;
input_hash: string;
output_hash: string;
findings: Finding[];
notes?: string[];
};
4.5 FileBlob
type FileBlob = {
path: string;
content: string;
};
4.6 AgentContext
type AgentContext = {
repo: { owner: string; name: string };
pr?: { number: number; headSha: string };
diffSummary: string;
changedFiles: FileBlob[];
promptVersion: string;
};
4.7 ModelSpec
type ModelSpec = {
provider: 'stub' | 'openai' | 'azure_openai' | 'anthropic' | 'vertex';
model: string;
temperature: number;
maxOutputTokens: number;
};
4.8 Budget
type Budget = {
maxCalls: number;
maxTotalInputTokens: number;
maxTotalOutputTokens: number;
};
4.9 LLMRequest
type LLMRequest = {
requestId: string;
system: string;
task: string;
constraints: readonly string[];
outputSchema: JSONSchemaLike;
model: ModelSpec;
context: {
diffSummary: string;
files: FileBlob[];
};
sampling?: {
top_p?: number;
seed?: number;
};
};
4.10 LLMResponse
type LLMResponse = {
requestId: string;
provider: LLMProvider;
model: string;
rawText: string;
parsed: Report;
usage: {
inputTokens: number;
outputTokens: number;
};
audit: {
promptHash: string;
contextHash: string;
outputHash: string;
timestampMs: number;
};
};
5. Policy Contract
5.1 Policy structure
The system policy must declare:
allowPathsdenyPathsbudgetmodelstrictSchemapromptInjectionGuard
Example contract:
type Policy = {
allowPaths: string[];
denyPaths: string[];
budget: Budget;
model: ModelSpec;
strictSchema: true;
promptInjectionGuard: true;
};
5.2 Path evaluation rule
A path is eligible iff:
- it does not match any deny prefix
- it does match at least one allow prefix
Formally, for path p:
eligible(p) = (forall d in D : not startsWith(p, d)) and (exists a in A : startsWith(p, a))
Where:
D= deny path setA= allow path set
5.3 Default posture
The default policy must remain conservative and read-only in operational effect.
6. Request Lifecycle
6.1 Required order of execution
The system shall process each request in this exact logical order:
- accept typed request
- apply redaction
- run prompt injection preflight
- build prompt
- hash prompt and context
- check cache
- enforce budget
- record request audit event
- call provider
- parse response
- validate response schema
- increment budget counters
- compute output hash
- record response audit event
- write cache entry
- return structured response
This order is not optional. Rearranging it weakens safety or observability.
7. Context Reduction Requirements
7.1 Context assembly
Only changed files relevant to the current task may be included.
7.2 Context size discipline
The system must avoid whole-repo context assembly. Input is restricted to:
- diff summary
- selected changed files
- fixed prompt template material
- fixed constraints
7.3 Exclusion rules
Files matching deny policy shall never be passed to a provider.
7.4 Context objective
The context subsystem is optimized for signal density, not completeness.
8. Redaction Requirements
8.1 Redaction timing
Redaction must occur before cache-key generation for provider-bound prompt content and before provider invocation.
8.2 Minimum baseline patterns
The implementation must support rule-based redaction of:
- obvious API-key-like tokens
- email addresses
- later extensible secret patterns
8.3 Redaction function
For text blob x and rule set R = {r_1, r_2, ..., r_n}:
Redact(x, R) = r_n(...r_2(r_1(x)))
Where each r_i is a pattern substitution function.
8.4 Redaction philosophy
The redaction subsystem is deliberately conservative. False positives are acceptable if they reduce accidental disclosure.
9. Prompt Injection Guard Requirements
9.1 Guard timing
Prompt injection screening must run after redaction and before provider invocation.
9.2 Heuristic scope
The system must reject obvious adversarial prompt constructs such as:
- instruction override attempts
- role-spoof labels
- secret-exfiltration requests
- provider-key disclosure language
9.3 Safety mode
The guard should prefer false positive rejection over permissive acceptance.
9.4 Failure behavior
A triggered guard produces immediate request rejection.
10. Prompt Envelope Construction
10.1 Required sections
The prompt envelope shall be assembled in explicit labeled sections:
SYSTEMTASKCONSTRAINTSOUTPUT_SCHEMACONTEXT_DIFF_SUMMARYCONTEXT_FILES
10.2 Section purpose
This labeling exists to reduce ambiguity, constrain prompt shape, and make prompt assembly auditable.
10.3 Prompt template versioning
Every prompt template must include:
idversionsystemtaskconstraintsoutputSchema
Template version changes are behavioral changes and must be traceable.
11. Hashing and Cache Semantics
11.1 Prompt hash
Let P be the final assembled prompt string. Then:
promptHash = H(P)
11.2 Context hash
For diff summary S and files F = {(p_i, c_i)}:
contextHash = H(S || join_i(p_i || ":" || H(c_i)))
11.3 Cache key
A canonical cache key shall include:
- policy namespace or equivalent
- provider
- model
- prompt hash
- context hash
Example:
cacheKey = "aidev:" || provider || ":" || model || ":" || promptHash || ":" || contextHash
11.4 Cache objective
Caching exists to prevent repeated spend on semantically equivalent work.
11.5 Cache store requirement
The cache interface must support:
get(key)set(key, value, ttlSeconds)
The reference implementation may be in-memory. Production implementations may use external stores.
12. Budget Enforcement
12.1 Runtime counters
For a process-local runtime:
c= calls madeti= cumulative input tokensto= cumulative output tokens
12.2 Enforcement predicates
A request is permitted iff:
c < C_maxti < I_maxto < O_max
If any predicate fails, the run must reject with an explicit budget error.
12.3 Budget enforcement timing
Budget checks occur before provider invocation.
12.4 Increment semantics
Counters are incremented only after a provider response is received.
12.5 Operational note
Process-local counters are sufficient for local/demo runs. Shared production environments may require durable or distributed budget state.
13. Provider Adapter Contract
13.1 Provider adapter purpose
The provider adapter isolates model-vendor specifics from core pipeline logic.
13.2 Minimum interface
The adapter must expose a call surface equivalent to:
interface ProviderAdapter {
provider: LLMProvider;
call(
req: LLMRequest,
prompt: string,
): Promise<{
provider: LLMProvider;
model: string;
rawText: string;
usage: { inputTokens: number; outputTokens: number };
}>;
}
13.3 Stub provider
A stub provider shall be supported for:
- public skeletons
- offline demos
- deterministic test harnesses
- safe CI demonstrations
13.4 Provider principle
The provider is replaceable. Core safety posture may not depend on proprietary provider behavior.
14. Schema Validation Boundary
14.1 Boundary definition
The schema boundary is the point where raw model text may become acceptable structured input.
14.2 Required behavior
The system must:
- parse raw text as JSON
- validate the resulting object as a
Report - reject malformed or invalid output
14.3 Structural validity vs correctness
Schema validity only means structure is acceptable. It does not certify truth, completeness, or sound reasoning.
14.4 Failure mode
Invalid JSON or invalid report structure must terminate the request as failure.
15. Audit Event Requirements
15.1 Event classes
At minimum, audit must support:
llm_requestllm_responsellm_error
15.2 Minimum request event fields
kindrequestIdtimestampMsprovidermodelpromptHashcontextHash
15.3 Minimum response event fields
- all request event fields
outputHashusage
15.4 Minimum error event fields
- all request event fields where available
- error name
- error message
15.5 Structured emission
Audit events must be machine-ingestible, preferably JSON-structured.
16. Agent Implementation Requirements
16.1 Agent contract
Each agent must:
- create an
LLMRequest - bind to a versioned template
- supply a concrete model spec
- pass typed context
- return
Report
16.2 Required current agent classes
SpecLintPRSynthesisThreatSketch
16.3 ThreatSketch special constraint
ThreatSketch must remain conceptual. It may classify risks and mitigations, but may not output exploitation steps.
16.4 Agent determinism rule
Agents may vary in prompt content and task definition, but not in core safety boundary behavior.
17. Runner Requirements
17.1 Local runner
The local runner must support demonstration execution using fixed example context and render advisory markdown.
17.2 GitHub runner
The GitHub runner must model or implement:
- webhook signature verification
- PR metadata extraction
- installation token acquisition or workflow-token use
- changed-file retrieval
- path eligibility filtering
- pipeline execution
- advisory PR comment rendering
17.3 GitHub safety requirement
The GitHub path must default to read-only review surfaces such as comments or checks. It must not imply merge authority.
18. GitHub App / Webhook Model
18.1 Signature verification
Webhook-driven operation requires deterministic verification of the GitHub signature before processing payload content.
18.2 Installation token minting
If operating as a GitHub App, installation tokens must be minted per installation and scoped minimally.
18.3 Changed-file fetching
Only PR files relevant to the advisory pipeline may be fetched.
18.4 Policy application
Fetched files must be filtered by policy prior to downstream use.
18.5 Comment rendering
Rendered comments should state clearly that the result is advisory and schema-gated, not authoritative.
19. Pseudocode
19.1 Core request pipeline
function INVOKE(req, policy, cache, audit, provider):
redactedReq = APPLY_REDACTION(req)
if policy.promptInjectionGuard == true:
ASSERT_NO_PROMPT_INJECTION(MATERIAL_FOR_GUARD(redactedReq))
prompt = BUILD_PROMPT(redactedReq)
promptHash = HASH(prompt)
contextHash = HASH_CONTEXT(redactedReq.context)
cacheKey = BUILD_CACHE_KEY(policy, redactedReq.model, promptHash, contextHash)
if cache exists:
hit = cache.get(cacheKey)
if hit exists:
return hit
ENFORCE_BUDGET(policy.budget)
audit.record(REQUEST_EVENT(...))
try:
raw = provider.call(redactedReq, prompt)
parsed = PARSE_AND_VALIDATE(raw.rawText, redactedReq.outputSchema)
UPDATE_RUNTIME_COUNTERS(raw.usage)
outputHash = HASH(JSON.stringify(parsed))
response = BUILD_RESPONSE(parsed, raw, promptHash, contextHash, outputHash)
audit.record(RESPONSE_EVENT(...))
if cache exists:
cache.set(cacheKey, response, ttlSeconds)
return response
catch err:
audit.record(ERROR_EVENT(...))
raise err
19.2 Path eligibility
function IS_ALLOWED_PATH(path, allowPaths, denyPaths):
for d in denyPaths:
if path startsWith d:
return false
for a in allowPaths:
if path startsWith a:
return true
return false
19.3 Agent runner pattern
function RUN_AGENT(agentTemplate, ctx):
req = {
requestId: BUILD_REQUEST_ID(agentTemplate, ctx),
system: agentTemplate.system,
task: agentTemplate.task,
constraints: agentTemplate.constraints,
outputSchema: agentTemplate.outputSchema,
model: SELECT_MODEL(agentTemplate),
context: {
diffSummary: ctx.diffSummary,
files: ctx.changedFiles
}
}
res = LLM_CLIENT.invoke(req)
return res.parsed
20. Evaluation and Metrics
20.1 Primary evaluation principle
The system must be evaluated by engineering outcomes, not token volume.
20.2 Suggested metrics
- reduction in human review time
- number of ambiguities caught before merge
- contradiction detection rate
- false positive rate
- structured output acceptance rate
- cache hit rate
- provider failure rate
- rejected unsafe-context rate
- budget-overrun frequency
- audit completeness rate
20.3 Quality lens
A system that spends fewer tokens but leaks secrets or produces unactionable noise is not successful.
21. Security Requirements
21.1 Secrets
Secrets must never be intentionally included in provider-bound prompt context.
21.2 PII
PII-bearing material must be excluded or redacted according to policy.
21.3 Write access
Write-capable automation must remain disabled unless explicitly approved and separately reviewed.
21.4 Supply chain
Dependencies used in CI or webhook execution should be minimal, pinned where appropriate, and reviewable.
21.5 Output treatment
Even validated output must remain advisory unless a separate deterministic control layer explicitly promotes a subset of behavior.
22. Failure Modes and Required Handling
22.1 Prompt injection guard triggered
Result: reject request, record error audit event.
22.2 Path not allowed
Result: exclude file or reject run depending on runner policy.
22.3 Redaction alters material significantly
Result: continue if structure remains usable; otherwise surface limited-result state.
22.4 Cache unavailable
Result: continue without cache if safety posture is preserved.
22.5 Budget exceeded
Result: reject before provider invocation.
22.6 Provider failure
Result: record error audit event and surface failure.
22.7 Invalid JSON
Result: reject response.
22.8 Schema mismatch
Result: reject response.
22.9 Audit sink failure
Preferred result: surface operational error; do not silently claim successful audit if audit failed.
23. Test Requirements
23.1 Unit tests
Minimum expected unit coverage should include:
- path policy evaluation
- redaction substitution
- prompt injection heuristics
- prompt assembly
- schema validation success/failure
- budget enforcement
- cache hit/miss behavior
- audit event formatting
23.2 Integration tests
Minimum expected integration coverage should include:
- local runner end-to-end with stub provider
- GitHub runner path filtering
- advisory comment rendering
- invalid response rejection path
23.3 Security-oriented tests
Minimum adversarial test cases should include:
- injected override strings in diffs
- secret-like material in changed files
- denylisted paths in PR file lists
- malformed JSON responses
- structurally valid but empty reports
24. CI/CD Expectations
24.1 CI role
CI is used to verify deterministic correctness around the DevKit itself, not to treat model output as a release authority.
24.2 CI checks
Expected checks include:
- formatting
- linting
- type checking
- unit tests
- integration tests where safe
- workflow syntax validation
24.3 Public skeleton safety
In public or demonstration contexts, provider calls should remain stubbed unless explicitly configured otherwise.
25. Acceptance Criteria
Implementation satisfies this spec when all of the following are true:
- typed requests can be constructed and executed
- policy-based path filtering works as specified
- redaction executes before provider call
- prompt injection screening can reject suspicious content
- prompt envelopes are assembled in labeled sections
- prompt and context hashes are generated deterministically
- cache hits bypass provider calls
- budget enforcement blocks overrun conditions
- provider adapters can be swapped without changing core logic
- invalid JSON responses are rejected
- invalid report structures are rejected
- audit events are emitted for request/response/error paths
- agents return structured reports
- local runner can produce advisory markdown
- GitHub runner can model or execute advisory PR workflow safely
- no code path grants implicit merge or deploy authority to AI output
26. Implementation Notes
26.1 Public skeleton vs production implementation
The current repository may use lightweight validators, in-memory cache, and stub provider surfaces. That is acceptable for the public skeleton. Production-hardening may replace those internals without changing the architectural contract defined here.
26.2 Behavioral invariants that must not drift
The following invariants are mandatory:
- AI output remains advisory
- deterministic validation remains authoritative
- provider access happens only after safety preflight
- schema failure rejects output
- budget is bounded
- path policy is enforced
- audit remains structured
- read-only is the default integration posture
27. Summary
This engineering specification defines an AI-assisted engineering framework that is useful precisely because it is constrained.
The system is not valuable when it is permissive. It is valuable when it is:
- structured
- bounded
- reviewable
- cheap enough to operate
- difficult to misuse
- explicit about authority
That is the implementation contract for the Thesis Chain AI DevKit.
CI/CD Integration
CI/CD is not just a deployment mechanism. In systems like this, CI/CD is part of the control surface. It enforces the difference between “interesting idea” and “repeatable engineering behavior.”
For an AI-assisted workflow, CI/CD must enforce at least four things:
- Deterministic execution paths
Cheap deterministic checks should run first and block unnecessary model calls. - Bounded permissions
CI jobs should default to read-only behavior, especially around repository state and merge authority. - Auditable artifacts
Outputs should be storable, reviewable, and attributable to a run context. - Version-locked automation
Actions, templates, schemas, and policies should be pinned so behavior does not drift silently.
In practice, this means the pipeline treats AI as a bounded advisory subsystem. It can inspect PR diffs, produce structured comments, and surface contradictions or risk, but it does not silently mutate production state.
The important point is architectural: CI/CD is where enforcement lives. If the rules are not enforced in the pipeline, then they are preferences, not controls.
Agentic Development Pipeline
This is the part most people misunderstand. Agentic development does not mean “use the most powerful model on everything.” It means divide work into classes, apply deterministic gates, route tasks to the cheapest sufficient capability, inspect aggressively, and preserve human authority over consequential decisions.
- Loop Control
Agent loops must be bounded. Maximum calls, maximum retries, maximum token budgets, and explicit stop conditions are part of the system contract. - Task-Class Routing
Mechanical checks, narrow verification, contradiction detection, and low-ambiguity work should go to cheaper model tiers or deterministic tooling first. Higher-cost reasoning should be reserved for synthesis-heavy or ambiguous tasks. - Inspection Before Escalation
The system should not escalate spend just because a model produced an answer. It should inspect quality, confidence, structure, and policy conformance before deciding whether more expensive reasoning is justified. - Human-in-the-Loop as Authority
Human review is not an apology for the system. It is the authority boundary. Humans own interpretation, exception handling, merge authority, and architectural direction. - Token Cost as Design Input
Token usage is not a dashboard vanity metric. It is an input into architectural choices. Model selection, prompt size, context shape, cache strategy, and retry policy all exist to prevent spending from becoming chaotic. - Auditability Over Cleverness
A boring, inspectable loop is superior to a clever opaque loop. In practice, predictable bounded systems outperform magical-looking systems over time.
This is why I do not treat model choice as a status symbol. I treat it as routing policy. Different work deserves different tools. Better systems come from disciplined orchestration, not maximal model spend.
Human Inspection Roles
Human inspection remains central in any serious AI-assisted engineering system. The goal is not to remove humans from the loop. The goal is to remove low-value repetitive work while preserving human judgment where ambiguity, business context, risk, or architecture matter.
- Quality Control
Humans validate whether the output is actually useful, not merely well-formed. - Architectural Arbitration
Humans decide when a system behavior is technically possible but strategically wrong. - Infra and Policy Control
Humans own permissions, deployment boundaries, policy changes, and escalation paths. - Exception Handling
Humans interpret edge cases, conflict states, and cross-domain ambiguity.
In other words: AI can accelerate analysis, summarization, contradiction discovery, and report generation. It should not silently inherit decision authority just because it is fast.
Security Architecture
Security is not a final checklist item. In AI-assisted systems it must be designed into every upstream layer: input handling, context assembly, provider boundaries, output validation, CI permissions, and operational review.
- Prompt Injection Resistance
PR authors, diffs, and input payloads are untrusted. Context must be screened before the provider call. - Data Exfiltration Prevention
Sensitive paths, secrets, PII, and irrelevant configuration must be denied or redacted before context assembly. - Least-Privilege Automation
Default pipeline permissions should be read-only unless explicit write behavior is required and reviewed. - Authority Separation
AI output may be structured and useful without being authoritative. Deterministic checks and human review remain the source of actual control. - Supply-Chain Discipline
Dependencies, Actions versions, templates, and schemas should be pinned so automation does not drift into unknown behavior. - Visible Failure States
Unsafe or malformed behavior should surface as explicit failure. Silent recovery hides risk.
The shortest honest summary is this: safe agent systems are built by distrusting them correctly.
Case 02 — Human Agentic Pipeline
This case study documents the operating model behind a human-led agentic development pipeline. The objective is not to simulate autonomous magic. The objective is to design a system in which AI can accelerate engineering work without dissolving accountability, architectural control, or verification discipline.
In this model, AI is routed into bounded roles inside a controlled workflow. Humans retain authority over judgment, quality control, infrastructure, and final acceptance. The system is designed to produce auditable artifacts, visible checkpoints, deterministic handoff boundaries, and repeatable outputs rather than vague conversational momentum.
Problem Definition
Most “agentic” workflows fail for one of two reasons. Either they are too loose and devolve into expensive improvisation, or they are so tool-driven that no one can explain where authority lives, why a change happened, or whether the output still matches the original specification.
The engineering problem addressed here is therefore:
How do you structure a human-led, AI-assisted development system that can produce meaningful velocity while preserving deterministic phase order, verification gates, explicit authority boundaries, and drift resistance?
The answer is not “more autonomy.” The answer is architecture. Agentic systems only become useful when their behavior is constrained more like a build pipeline and less like a free-form assistant.
Operating Constraints
- Strict Phase Order
Work must progress in a declared sequence. Architecture cannot be skipped, verification cannot be hand-waved, and implementation cannot silently rewrite system intent. - No Spec Drift
The process is anchored to canonical blueprints and engineering specs. If the output cannot be traced back to those anchors, it is drift. - No Hidden Authority
Roles are separated. An implementation agent does not gain architectural authority merely by writing code first. - Artifact-Based Work
Each phase should emit inspectable artifacts rather than conversational summaries. The system should leave behind evidence, not just momentum. - Assumptions Must Collapse to Zero
If critical assumptions remain, the process is not ready to progress. Guessing is treated as a process failure, not a creative virtue. - Idempotent Passes
Every pass should be independently reproducible. Partial patches, hand-wavy edits, and unbounded “just improve it” loops are not acceptable operating modes. - Human Final Authority
Human reviewers own the right to accept, reject, redirect, or halt the system at any stage.
Blueprint Architecture
The blueprint for a human agentic pipeline starts by defining role boundaries and execution order before discussing implementation. In a healthy agentic system, “who may decide what” is as important as “what code gets written.”
The structure I use is phase-driven and role-separated. The architect locks anchors and non-negotiables first. Tooling may only express what the architecture already allows. Implementation is scope-constrained to the approved tree. Verification must halt the system on drift rather than negotiate with it.
- Phase 0 — Spec Anchors
Establish canonical files, anchor quotes, and derived non-negotiables. This is where the system proves it understands the assignment before it starts building. - Phase 1 — Architecture Plan
Define exact file tree, module boundaries, dependencies, and compliance mapping. No unanchored structure is permitted. - Phase 1b — Tooling Checklist
Confirm what CI commands, configs, and repo expectations are required. The tooling agent is not allowed to introduce whimsical changes. - Phase 1c — Verification Gate
Confirm drift check is empty, anchor coverage is complete, and assumptions are zero before implementation begins. - Phase 2 — Core Implementation
Emit only files already justified by the architecture plan. No speculative expansion. No structure creep. - Phase 3 — Hardening
Fix lint, type, and build failures without reopening architecture. Hardening is for compliance and polish, not redesign.
This structure matters because it prevents the most common failure mode in AI-heavy development: implementation racing ahead of architecture and forcing the system to rationalize drift after the fact.
Canonical Blueprint Markdown
The following appendix is mirrored locally from the orchestration lab blueprint and displayed here as canonical markdown.
ExNulla Blueprint
Human Agentic Orchestration Lab (Standalone Showpiece)
Repository (proposed): exnulla-orchestration-lab
Slug: orchestration-lab
Version: 1.1.0 (supersedes human-agentic-trainer v1.0.0)
Owner org: Thesis-Project (professional)
Primary goal: Portfolio-grade, standalone orchestration lab that can optionally embed as a demo via iframe (static-first).
0. Positioning
This project is a standalone orchestration lab that teaches and demonstrates agentic pipeline mechanics with:
- Human transport (copy/paste between ChatGPT Projects) as the default execution provider.
- Deterministic state machine and artifact ledger as the core product.
- A clean upgrade path to API-based providers without rewriting orchestration logic.
It is intentionally “too serious” to be a toy demo.
1. Objectives
1.1 Core educational objectives
Teach (visibly, not abstractly):
- Role separation and instruction boundaries
- Prompt routing and supervisor logic
- Context drift origins, detection, and recovery
- Critic/revision loops and acceptance criteria closure
- Budget discipline, token economy, and trade-offs
1.2 Core product objectives
Provide a reproducible lab environment:
- Deterministic run capture + replay
- Run artifact inspection (graph + diffs + drift flags)
- Failure-mode injection and recovery demonstration
- Formal role contract enforcement (schema validated outputs)
- Cost and budget dashboards (simulated + estimated)
1.3 Optional objective (Phase 2)
Provider adapters for API orchestration (OpenAI/Anthropic/etc.) that reuse the same run state machine.
2. Constraints and non-goals
2.1 Constraints
- Static-first deployment: default build outputs a static web app.
- Atomic deploy friendly: build artifact can be deployed with symlink flips.
- Iframe-safe: must function correctly when embedded in an iframe sandbox.
- No scraping / no UI automation: human transport remains manual by design.
2.2 Non-goals (v1.1)
- No live ChatGPT UI integration.
- No storing personal secrets or API keys in the browser (Phase 2 moves to server runtime).
- No “magic” agent framework wrapper that hides orchestration mechanics.
3. Target users
- Learners: understand orchestration by running guided pipelines.
- Hiring reviewers: see a polished, deterministic systems artifact with auditability.
- Future-you: use specs + blueprint to build an API agent framework later without drift.
4. High-level architecture
4.1 Components
LOC (Local Orchestration Console)
- Runs locally (dev) and/or as a static app (prod) with persistence in browser storage and export/import.
- Generates role prompts, enforces contracts, logs turns, computes budgets, flags drift, scores rubrics.
Run Ledger + Artifact Store
- Run JSON artifacts are canonical.
- Export is deterministic: same inputs → same run structure (timestamps excluded or normalized).
Inspector UI (Showpiece layer)
- Graph view (turn DAG)
- Drift panels
- Budget/cost panels
- Failure injection controls
- Replay timeline controls
Provider Adapter Layer (Transport abstraction)
- HumanProvider (v1.1): manual paste-in/out
- SimulatedProvider (v1.1): fake latency/cost/reliability without APIs
- API Providers (v2+): optional later
4.2 “Square peg / round hole” mitigation
This repo is designed as standalone. If embedded into exnulla-demos, it is treated as a static build artifact embedded via iframe with a constrained integration contract (Section 13).
5. Deterministic state model
5.1 Canonical run artifact
runs/<RUN_ID>/run.json
Minimum fields:
schemaVersion(semver-like)gitSha(injected at build time)runIdcreatedAt(optional; normalized for deterministic replay exports)scenarioId(the selected training scenario)roles[](role profiles and constraints)turns[](ordered, each with routing metadata and validation results)artifacts[](files/snippets produced by turns)budgets(per-turn + cumulative)rubric(scoring + thresholds)drift(flags + evidence + severity)acceptance(pass/fail + reasons)
5.2 Deterministic replay guarantee
Given:
- Same
scenarioId - Same initial
inputs - Same turn responses (copied)
- Same
schemaVersion
Then:
- The run artifact validation and derived metrics must match.
6. Role system
6.1 Default roles
architectdevelopercritictester- (optional)
supervisor(internal; LOC-driven orchestration)
6.2 Required ChatGPT Project setup (Human Provider)
Each role is configured as its own ChatGPT Project with persistent instructions.
The LOC provides:
- Copy-paste “Project Instructions” templates per role.
- A “Project Setup Checklist” with validation steps.
6.3 Formal role contract enforcement (new)
Each role response must conform to a strict schema (e.g., JSON or structured markdown blocks).
LOC validates:
- Schema validity
- Required fields present
- Artifact references resolvable
- No forbidden sections (role boundary rules)
If invalid:
- LOC flags a contract violation.
- LOC generates a corrective “format repair” prompt for the same role.
7. Drift detection and recovery
7.1 Drift signals (v1.1)
Rule-based detection, including:
- Missing constraints or acceptance criteria
- Contradictions vs. scenario requirements
- Output schema violations
- Spec deviations (e.g., wrong repo, wrong language, ignored deterministic rules)
- Over-budget warnings and verbose inflation
- “Unresolved questions” not propagated
7.2 Drift scoring
Each signal adds weighted severity:
info/warn/error- Cumulative drift score shown in Inspector UI
7.3 Recovery loops
LOC generates recovery prompts:
- “Re-anchor constraints” prompt for Architect
- “Patch minimal diff” prompt for Developer
- “Re-evaluate rubric” prompt for Critic
- “Regression / edge-case sweep” prompt for Tester
8. Failure mode injection (new showpiece capability)
8.1 Purpose
Turn the lab into a resilience demonstrator:
- show failures
- show detection
- show recovery
- show cost impact
8.2 Injection modes (v1.1)
- Ambiguous spec: remove/blur key constraints
- Conflicting constraints: intentionally contradict requirements
- Truncated context: simulate missing prior turns
- Bad critic: introduce incorrect critique or wrong rubric thresholds
- Budget crunch: set very low budget caps mid-run
8.3 Implementation concept
Injection modifies:
- scenario inputs
- routing prompts
- role templates
- budget parameters
LOC must record injection events in run artifact (injections[]).
9. Budget and economics (expanded)
9.1 Token estimation
- Estimate tokens from characters (baseline) and/or model-specific heuristics.
- Record per-turn estimate and cumulative.
9.2 Cost simulation
For v1.1 (no real API calls):
- user selects “pricing profile” presets (cheap / mid / premium)
- LOC computes simulated cost per turn and total
- show “what this would cost” with model tiers
9.3 Dashboard outputs
- burn-down chart over time
- per-role share of tokens/cost
- budget threshold warnings
- cost of drift (extra turns caused by drift recovery)
10. Visual Inspector UI (new, high impact)
10.1 Views
- Run Timeline
- turn list with role, timestamp, budget, validation, drift severity
- Turn Graph (DAG)
- nodes: turns
- edges: handoffs / dependencies
- highlights: drift, contract violations
- Diff View
- compare two turns (or two runs) for changes in constraints, artifacts, budgets
- Rubric Panel
- category scores and thresholds
- reasons for pass/fail
- Injection Panel
- list and details of injected failures
10.2 UX principles
- No hidden magic. Every derived conclusion links to evidence.
- Export/import first-class.
- Works in iframe (no popups, no cross-origin dependencies).
11. Multi-model simulation layer (optional in v1.1)
11.1 Why
Prepare learners for API orchestration by teaching tradeoffs:
- latency
- cost
- reliability
- verbosity
11.2 How (without APIs)
Simulated Provider:
- assigns “model personality presets” to roles
- applies constraints (e.g., “fast model tends to be terse and miss edge cases”)
- introduces optional random error rates (seeded for determinism)
All simulation parameters must be recorded in the run artifact.
12. Tech stack and repo shape (static-first)
12.1 Proposed stack
- TypeScript (strict)
- Vite (static build)
- React (or Astro + React islands; choose one)
- Zod (schema validation)
- Vitest (tests)
- ESLint + Prettier (enforced)
- Docker for deterministic builds
12.2 Repo layout (proposed)
exnulla-orchestration-lab/
apps/
loc-web/ # static web app
packages/
core/ # state machine, schemas, scoring, drift
scenarios/ # scenario definitions + injection templates
ui/ # inspector components
cli/ # optional CLI runner/export tools (v1.2+)
runs/ # sample runs (optional; or in /examples)
docs/
blueprint/ # this blueprint
engineering-spec/ # detailed spec (separate doc)
role-instructions/ # ChatGPT Project templates per role
.github/workflows/
Dockerfile
package.json
pnpm-workspace.yaml
12.3 Deterministic build requirements
- Inject
GIT_SHAat build time (ARG + ENV) - Include
meta/version.jsonwith git SHA and build timestamp (timestamp optional/normalized) - Lockfile required (pnpm)
- CI must block merges if lint/test fail
13. Deployment and iframe embedding
13.1 Default deployment (standalone)
- Static build served by nginx or any static host
- Atomic deploy by swapping symlinked build directory
13.2 Iframe embedding (optional)
If embedded in exnulla-site or exnulla-demos:
- build outputs to a single folder root with relative assets
- no service-worker assumptions that conflict with host
- storage uses namespaced keys:
exnulla.orchestrationLab.<runId>etc.
- export/import uses file download/upload, not cross-window messaging
13.3 Integration contract (minimal)
- Provide a single embed URL (e.g.,
/demos/orchestration-lab/index.html) - Provide a
postMessage-optional integration later (v2+) but not required
14. Milestones
v1.1.0 (Showpiece baseline)
- Core state machine + run artifact schema
- HumanProvider workflow
- Role contract enforcement + repair prompts
- Drift detection v1 (rules)
- Budget + cost dashboards (simulated)
- Inspector UI with DAG + timeline + rubric
- Failure injection panel + recorded injection events
- Export/import runs (JSON) + deterministic replay validation
- Docker + CI hygiene (lint/test/build)
v1.2.x
- Scenario library expansion (3–6 scenarios)
- CLI utilities for run validation and report generation
- Run comparison tool (diff two runs)
v2.x
- API provider adapters (optional)
- Tool execution hooks (optional)
- Multi-tenant “course mode” (optional)
15. Acceptance criteria
A v1.1 release is “done” when:
- A learner can complete a guided run end-to-end using only copy/paste.
- LOC validates role outputs against the schema and produces repair prompts.
- Drift flags trigger reliably on injected failures.
- Inspector clearly explains why drift was flagged (evidence linked).
- Exported run artifact can be imported and replay-validated deterministically.
- Static build deploys cleanly and works in an iframe.
- CI enforces strict TypeScript, linting, formatting, and tests.
meta/version.jsonexposes build SHA.
16. Notes on scope control
This is a showpiece, but it stays manageable by enforcing:
- Deterministic core first
- UI second (inspector)
- Scenario count limited in v1.1
- Simulation kept optional and seeded (no randomness without seed)
17. Deliverables (docs)
This blueprint implies the following docs in-repo:
docs/blueprint/exnulla-blueprint-orchestration-lab-1-1-0.md(this file)docs/engineering-spec/exnulla-engineering-spec-orchestration-lab-1-1-0.md(next step)docs/role-instructions/*.md(ChatGPT Project templates)docs/runbook/DEPLOY.md(atomic static deploy)docs/runbook/IFRAME.md(embedding contract)
18. Repo naming rationale
Recommended: exnulla-orchestration-lab
Signals “serious systems lab” rather than “toy demo,” while staying on-brand.
Alternate options:
exnulla-agentic-labexnulla-orchestrator-labexnulla-human-to-api-orchestration
Engineering Specifications
The engineering spec for this operating model does not merely describe features. It defines behavioral law for the build process itself. That includes output format, file authority, acceptance gates, CI discipline, and what kinds of changes are explicitly forbidden.
In practical terms, the spec must answer these questions:
- Which files are canonical inputs to the build?
- What exact artifacts must each phase emit?
- What counts as drift?
- What work is allowed in hardening versus architecture?
- How are assumptions surfaced and eliminated?
- How does verification prove coverage rather than imply it?
- How is output constrained so the system remains reproducible?
1. Output Discipline
Full-file emission matters because it prevents hidden partial edits, accidental omissions, and conversational patch ambiguity. The system should produce complete artifacts, not vague change suggestions.
2. Structure Discipline
New files may only exist if they are explicitly defined in the spec or derived in the architecture plan with anchor mapping. Unanchored structure is drift.
3. Verification Discipline
Verification is not a final glance at output quality. It is a formal gate with required proof: drift check empty, anchor coverage present, assumptions empty.
4. CI Discipline
The process assumes lint, typecheck, and build are mandatory. The agentic workflow is not complete because it “looks right.” It is complete when the repo gates are green.
5. Idempotency Discipline
Every pass should be reproducible from scratch. The pipeline should not rely on hidden chat context, implicit globals, or fragile one-off edits that cannot be replayed.
6. No-Hidden-Globals Rule
Environment requirements, allowed inputs, and tool expectations must be explicit. Invisible ambient state is a major source of drift and operational failure.
Canonical Engineering Spec Markdown
The following appendix is mirrored locally from the orchestration lab engineering spec and displayed here as canonical markdown.
ExNulla Engineering Spec
Human Agentic Orchestration Lab (Standalone Showpiece)
Repository: exnulla-orchestration-lab
Slug: orchestration-lab
Spec Version: 1.1.0
Blueprint: exnulla-blueprint-orchestration-lab-1-1-0.md
Owner org: Thesis-Project
Primary mode: Static-first web app (iframe-safe)
Provider mode (v1.1): Human transport + simulated provider (no APIs)
Last Updated (UTC): 2026-02-27T00:00:00Z
0. Scope and determinism contract
0.1 What this spec is
An implementation-grade engineering spec for a standalone orchestration lab that:
- makes orchestration mechanics visible (role separation, routing, drift, budgets),
- captures every run as a deterministic run artifact ledger (
run.json), - provides an inspector UI (timeline, DAG, diffs, rubric, injections),
- supports export/import + deterministic replay validation,
- works in an iframe sandbox and deploys as an atomic static artifact.
This spec is written so it can be handed back later with: “build it” and executed with minimal drift.
0.2 Hard constraints (MUST)
- Static-first:
pnpm buildoutputs a static bundle that can be hosted by nginx / static host. - Iframe-safe: no popups, no cross-origin assumptions, no top-level navigation hacks.
- No UI automation/scraping: human transport is manual by design.
- Deterministic core: orchestration/state evaluation must be deterministic given the same inputs + responses.
- Export/import first-class: runs are portable JSON artifacts; UI can import/export.
- No secrets: browser build stores no API keys; v1.1 has no real provider calls.
- Repo hygiene: TypeScript strict, ESLint + Prettier, tests, Docker deterministic build.
0.3 Non-goals (v1.1)
- Live integration with ChatGPT UI.
- Multi-user authentication / cloud persistence.
- Real API providers (OpenAI/Anthropic/etc.) beyond interface stubs.
- ML-based drift classification (rule-based + evidence only).
0.4 Deterministic replay guarantee (MUST)
Given:
- identical
scenarioId, - identical scenario inputs,
- identical injection set (including seed),
- identical agent responses pasted into the ledger,
- identical
schemaVersion, then: - validation results, drift flags, rubric scores, budget totals, and derived digests MUST match.
Allowed non-determinism:
- wall-clock timestamps can exist but MUST be excluded from deterministic checks (or normalized under export).
1. Product definition
1.1 Core workflows
- Create run
- user selects scenario, provider mode, seed, budget/cost profile, and optional injections.
- Generate routed prompt
- LOC produces a prompt for a role and explicit routing instructions.
- Human transport
- user executes prompt in the role’s ChatGPT Project and pastes the response into the LOC.
- Validate + score
- LOC validates schema/format, computes budgets/cost, flags drift, updates rubric, derives next step.
- Inspect
- user inspects timeline, graph, diffs, drift evidence, rubric reasoning, injection events.
- Export / Import
- export run as JSON (and optional markdown transcript); import later and replay-validate deterministically.
- Compare
- compare runs (or turns) via diff UI (v1.1: within one run; v1.2: cross-run).
1.2 Target user profiles
- Learner / developer wanting “pre-calc → calc” understanding of orchestration.
- Hiring reviewers assessing systems thinking + determinism discipline.
- Future-you using the ledger/state machine for API orchestration later.
2. Architecture overview
2.1 Packages (MUST)
packages/core
Deterministic state machine, schemas, scoring, drift, budgets, providers, export/import, deterministic hashing.packages/scenarios
Scenario definitions, injection templates, seeded simulation knobs, scenario validation.packages/ui
Shared UI components (graph, diff, panels), pure/presentational where possible.apps/loc-web
Vite + React static web app: run wizard, prompt router, paste console, inspector.
2.2 Runtime boundaries
- All deterministic logic lives in
packages/coreand must be usable:- from the web app, and
- from future CLI tooling (v1.2+).
- The web app is a thin shell around the core.
2.3 Transport / provider abstraction
HumanProvider(v1.1): manual paste. Produces routing instructions only.SimulatedProvider(v1.1): produces deterministic “simulated outputs” for demonstration/testing, seeded.ApiProvider(v2+): stub interface only in v1.1 (no keys, no calls).
3. Tech stack and repo standards
3.1 Required stack (MUST)
- Node.js LTS (recommend 20.x)
- TypeScript
strict: true - pnpm + lockfile
- Vite + React (single-page app)
- Zod for runtime validation
- Vitest for unit/integration tests
- ESLint + Prettier enforced
- Docker for deterministic builds
3.2 Deterministic build provenance (MUST)
- Build accepts
ARG GIT_SHAand injects to app:import.meta.env.VITE_GIT_SHA(Vite) and/orprocess.env.GIT_SHA(tests/build scripts)
- Build outputs
meta/version.jsoncontaining:gitSha,schemaVersion,buildId(optional; may be derived deterministically from gitSha + package versions),builtAt(optional; if present must be excluded from determinism checks).
4. Repository layout
4.1 Canonical layout (MUST)
exnulla-orchestration-lab/
apps/
loc-web/
index.html
vite.config.ts
src/
app/
routes/
state/
components/
main.tsx
public/
meta/
version.json
packages/
core/
src/
schema/
engine/
providers/
scoring/
drift/
budget/
export/
util/
tests/
scenarios/
src/
scenarios/
injections/
pricing/
tests/
ui/
src/
graph/
diff/
panels/
widgets/
docs/
blueprint/
engineering-spec/
role-instructions/
runbooks/
examples/
runs/
scenarios/
.github/
workflows/
Dockerfile
docker-compose.yml (optional)
package.json
pnpm-workspace.yaml
pnpm-lock.yaml
tsconfig.base.json
eslint.config.js
prettier.config.cjs
4.2 Git ignore rules
- Ignore persisted runs by default:
apps/loc-web/.local/(dev-only)**/runs/**exceptexamples/runs/**
- Include:
- at least one sample run artifact in
examples/runs/for regression tests and UI demo.
- at least one sample run artifact in
5. Data model: canonical run ledger
5.1 Canonical artifact path semantics
The canonical artifact is a single JSON object:
- Web app storage: stored in browser (IndexedDB preferred; localStorage acceptable for v1.1 with size limits)
- Exported artifact: user downloads a file named:
orchestration-lab.run.<runId>.json
When building a “runs folder” later (CLI), the canonical structure will be:
runs/<runId>/run.json(not required for static build)
5.2 Schema versioning
schemaVersionis a semver-like string, pinned to spec version for v1.1:"1.1.0"
- Backward compatibility requirements:
- v1.1 UI must import artifacts with
schemaVersion"1.1.0". - Future versions must provide migration utilities (v1.2+).
- v1.1 UI must import artifacts with
5.3 RunArtifact schema (MUST)
5.3.1 Top-level
export type RunArtifact = {
schemaVersion: '1.1.0';
slug: 'orchestration-lab';
gitSha: string; // injected at build; "unknown" allowed
runId: string; // deterministic id format
createdAt?: string; // ISO; optional for determinism checks
updatedAt?: string; // ISO; optional for determinism checks
mode: {
provider: 'human' | 'simulated'; // v1.1
simulation?: SimulationConfig; // if simulated
};
scenario: {
scenarioId: string;
version: string; // scenario version string, e.g. "1.0.0"
inputs: Record<string, unknown>;
};
injections: InjectionEvent[]; // applied injections, deterministic order
roles: RoleProfile[]; // role contracts + instructions metadata
turns: Turn[]; // append-only
derived: DerivedState; // regenerated deterministically
budgets: BudgetLedger; // token estimates, warnings
economics: EconomicsLedger; // simulated cost and profiles
rubric: RubricLedger; // scoring + thresholds + evidence
drift: DriftLedger; // flags + evidence + severity summary
acceptance: {
passed: boolean;
reasons: string[];
checklist: { item: string; status: 'pass' | 'fail' | 'unknown'; evidence?: string[] }[];
};
};
5.3.2 RoleProfile
export type RoleName = 'architect' | 'developer' | 'critic' | 'tester';
export type RoleProfile = {
role: RoleName;
displayName: string;
chatgptProjectName: string; // user-configurable label
instructionTemplateId: string; // e.g. "role-architect-1.1.0"
contract: RoleContract;
};
export type RoleContract = {
responseFormat: 'structured_markdown_v1' | 'json_v1';
requiredHeaders: string[]; // exact heading strings
requiredSections: string[]; // section ids
forbiddenPatterns: string[]; // regex strings
maxCodeBlockChars?: number; // heuristic for role confusion
mustEchoRunTurnHeader: boolean; // require runId/turnId header block
};
5.3.3 Turn
export type Turn = {
turnId: number; // 1..n
role: RoleName;
prompt: {
templateId: string; // prompt template key
text: string;
charCount: number;
tokenEstimate: number;
stateDigestHash: string; // hash of digest included in prompt
};
response: {
text: string;
charCount: number;
tokenEstimate: number;
parsed?: ParsedResponse; // result of parsing per contract
contractValid: boolean;
contractErrors: string[];
};
analysis: {
driftFlags: DriftFlag[];
rubricScore: RubricScore;
notes: string[]; // deterministic, engine-generated notes only
};
timestamps?: { promptedAt: string; respondedAt: string }; // optional
};
5.3.4 DerivedState (regenerated)
export type DerivedState = {
digest: StateDigest; // compact state summary
digestHash: string; // stable hash of digest
openIssues: Issue[];
artifactsIndex: ArtifactRef[];
loopCountByStage: Record<string, number>;
completion: { done: boolean; nextRole: RoleName | null; stage: Stage };
};
5.3.5 Digest / issues / artifacts
export type Stage = 'kickoff' | 'implementation' | 'review' | 'test' | 'revise' | 'finalize';
export type StateDigest = {
scenarioSummary: string; // scenario-provided summary, bounded
constraints: string[]; // scenario constraints, stable order
acceptanceCriteria: string[]; // stable order
deliverables: string[]; // stable order
lastDecisions: string[]; // last 3 decisions (deterministic extraction)
openQuestions: string[]; // extracted from critic/tester
artifactHints: string[]; // from dev outputs / plan sections
};
export type Issue = {
id: string; // stable hash id
severity: 'info' | 'warn' | 'error';
source: 'critic' | 'tester' | 'engine';
message: string;
evidence: string[];
open: boolean;
};
export type ArtifactRef = {
id: string; // stable hash id
kind: 'snippet' | 'filetree' | 'patch' | 'plan' | 'testplan';
title: string;
producedByTurnId: number;
contentHash: string;
excerpt: string; // bounded excerpt for UI
};
5.4 Deterministic hashing (MUST)
- Use a stable hash for digests, issues, artifacts:
sha256(canonicalJsonString(value))
- Canonical JSON stringification:
- stable key ordering,
- no whitespace variability,
- arrays kept in order.
6. Scenario system
6.1 Scenario definition format (MUST)
Scenarios are authored as TypeScript objects in packages/scenarios and exported as a registry.
export type Scenario = {
scenarioId: string; // e.g. "hello-orchestration"
version: string; // semver string
title: string;
summary: string; // bounded summary
description: string;
constraints: string[]; // stable order
acceptanceCriteria: string[]; // stable order
deliverables: string[]; // stable order
roleTemplates: {
architect: PromptTemplateId;
developer: PromptTemplateId;
critic: PromptTemplateId;
tester: PromptTemplateId;
};
initialInputsSchema: z.ZodTypeAny; // validates scenario inputs
defaultInputs: Record<string, unknown>;
rubricProfileId: string; // ties to rubric weights
};
6.2 Required scenarios (v1.1)
Ship 3 scenarios minimum (MUST), each designed to show different drift/failure types:
hello-orchestration
Simple deterministic task, emphasizes contracts + budgets.drift-trap-spec
Ambiguous requirements; emphasizes clarification propagation and re-anchoring.regression-loop
Forces test failures and revise loops; emphasizes loop caps and cost-of-drift.
6.3 Scenario determinism rules
- Scenario registry ordering must be stable (sort by
scenarioId). - Scenario inputs are validated and stored verbatim in run artifact.
- Any scenario-generated derived values must be stored or recomputable deterministically.
7. Role system and ChatGPT Project setup
7.1 Role instruction templates (MUST)
Ship templates in docs/role-instructions/:
architect.mddeveloper.mdcritic.mdtester.md
Each template MUST contain:
- Mission
- Allowed outputs
- Forbidden actions
- Required response format contract
- Determinism rules (“no hallucinated filenames; state assumptions explicitly”)
- Interaction protocol for missing info (“ask targeted questions; do not proceed with guesses”)
7.2 Contract format: structured_markdown_v1 (default)
All role responses MUST begin with an exact header block:
# Role: <Architect|Developer|Critic|Tester>
# Run: <runId>
# Turn: <turnId>
Then role-specific sections with fixed headings (examples below). LOC must validate these headings (case-sensitive) as the contract baseline.
Architect required headings
## Constraints (Do Not Violate)## Acceptance Criteria (Checklist)## System Plan## Open Questions## Next Handoff
Developer required headings
## Implementation Plan## Proposed File Tree## Patch / Diff## Notes for Critic## Next Handoff
Critic required headings
## Contract Validation## Drift Signals## Rubric Scoring## Blocking Issues## Non-Blocking Suggestions## Next Handoff
Tester required headings
## Test Plan## Test Results## Failures / Repro Steps## Risk Assessment## Next Handoff
7.3 Repair prompts (MUST)
If a response fails contract validation:
- engine must generate a repair prompt for the same role that:
- explicitly lists missing headings/fields,
- instructs the role to rewrite in the required format,
- forbids changing substantive content beyond formatting unless requested.
Repair events must be recorded as:
- a drift flag
DRIFT_CONTRACT_VIOLATION, - plus an engine note explaining the repair required.
8. Orchestration engine (state machine)
8.1 Engine API surface (MUST)
In packages/core/src/engine/ implement:
export type EngineInput = {
run: RunArtifact;
event: EngineEvent;
};
export type EngineEvent =
| { type: 'INIT_RUN'; scenarioId: string; inputs: Record<string, unknown>; config: RunConfig }
| { type: 'PASTE_RESPONSE'; text: string }
| { type: 'APPLY_INJECTION'; injectionId: string; params?: Record<string, unknown> }
| { type: 'SET_BUDGET_CAP'; tokenEstimateCap: number }
| { type: 'SET_PRICING_PROFILE'; profileId: string }
| { type: 'RESET_TO_TURN'; turnId: number }; // optional v1.1, required v1.2
export type EngineOutput = {
run: RunArtifact; // updated artifact
next: {
role: RoleName | null;
stage: Stage;
routingInstruction?: string;
promptText?: string;
};
diagnostics: {
contractErrors?: string[];
driftFlags?: DriftFlag[];
rubricScore?: RubricScore;
};
};
export function stepEngine(input: EngineInput): EngineOutput;
8.2 Deterministic derivation pipeline (MUST)
On each PASTE_RESPONSE:
- Identify expected role/stage from
run.derived.completion. - Validate response contract; parse into
ParsedResponse. - Compute charCount + tokenEstimate.
- Run drift detection (rule-based) with evidence.
- Run rubric scoring (rule-based) with evidence.
- Update budgets + economics ledgers.
- Derive
DerivedStatefrom all prior turns deterministically. - Choose next role/stage based on transition rules.
8.3 Transition rules (v1.1) (MUST)
- Stage progression:
kickoff (architect)→implementation (developer)→review (critic)→test (tester)→finalize (architect)
- Loops:
- If critic finds blocking issues OR rubric score below threshold:
review (critic)→revise (developer)→review (critic)
- If tester reports failures:
test (tester)→revise (developer)→review (critic)→test (tester)(as needed)
- If critic finds blocking issues OR rubric score below threshold:
- Loop caps:
maxReviseLoopsdefault: 5- if exceeded:
- mark acceptance
passed=false, - force
finalize (architect)with reasons including loop cap triggered.
- mark acceptance
8.4 State digest regeneration (MUST)
Digest is regenerated from:
- scenario summary + constraints + acceptance criteria + deliverables,
- latest Architect “System Plan” section (bounded),
- open issues extracted from critic/tester sections (bounded),
- last 3 decisions extracted from “Next Handoff” sections.
Extraction rules must be deterministic and documented (regex-based with stable ordering).
9. Drift detection
9.1 Drift ledger schema
export type DriftLedger = {
flags: DriftFlag[];
maxSeverity: 'none' | 'info' | 'warn' | 'error';
score: number; // weighted sum
};
export type DriftFlag = {
id: string; // stable code
severity: 'info' | 'warn' | 'error';
message: string;
turnId: number;
evidence: string[]; // exact excerpts or rule hits
category: 'contract' | 'role_boundary' | 'constraint' | 'scope' | 'budget' | 'consistency';
};
9.2 Required drift rules (v1.1)
Contract
- Missing required headings / header block
- Invalid run/turn header values (non-matching runId, non-integer turn)
- Unparseable structured sections
Role boundary
- Architect includes large code blocks over
maxCodeBlockChars→ warn - Developer includes rubric scoring section → warn
- Critic proposes implementing code changes (not critique) → warn
- Tester proposes architecture changes (not test results) → warn
Constraints
- Mentions forbidden actions (scraping, secrets, automation, “I executed code”, etc.)
- Mentions external network calls if constraint forbids.
Scope
- Introduces new deliverables not in scenario deliverables
- Changes language/stack when constraints fix it
Budget
- Excess verbosity: response token estimate exceeds per-turn ceiling (configurable)
- Budget cap exceeded: error
Consistency
- Contradicts prior accepted constraints/decisions (simple text match + hash checks of constraint lists)
9.3 Drift scoring weights (MUST)
Provide a deterministic scoring table in code:
info = +1warn = +5error = +20Plus per-category multipliers:- contract ×1.0
- constraint ×1.5
- consistency ×1.2
- budget ×1.1
- scope ×1.3
- role_boundary ×1.0
10. Rubric scoring
10.1 Rubric ledger schema
export type RubricLedger = {
profileId: string;
thresholds: {
overallPassScore: number; // e.g. 80
maxAllowedDriftSeverity: 'warn' | 'error'; // default "warn"
consecutivePassTurns: number; // default 2
};
scores: RubricScore[];
lastTwoPass: boolean;
};
export type RubricScore = {
turnId: number;
role: RoleName;
score: number; // 0..100
breakdown: {
completeness: number; // 0..25
correctnessSignals: number; // 0..25
constraintAdherence: number; // 0..25
clarity: number; // 0..25
};
evidence: string[]; // bounded list
notes: string[];
};
10.2 Deterministic scoring heuristics (MUST)
Each dimension uses deterministic signals:
- Completeness:
- required headings present,
- acceptance criteria referenced (architect + finalize turns),
- deliverables addressed (developer).
- Correctness signals:
- explicit assumptions list present when needed,
- no contradiction flags,
- critic/tester issues include reproduction/evidence.
- Constraint adherence:
- no constraint drift flags,
- no forbidden patterns.
- Clarity:
- headings + bullet lists,
- bounded verbosity,
- actionable steps in “Next Handoff”.
Rubric code MUST output evidence that can be shown in the UI.
11. Budgeting and simulated economics
11.1 Token estimation (MUST)
tokenEstimate = ceil(charCount / 4)- Track:
- per-prompt and per-response estimates,
- cumulative totals,
- per-role totals.
11.2 Budget ledger schema
export type BudgetLedger = {
tokenEstimateCap?: number;
used: number;
usedByRole: Record<RoleName, number>;
warnings: { atTurn: number; severity: 'info' | 'warn' | 'error'; message: string }[];
};
11.3 Warning thresholds (MUST)
If cap exists:
- 70% → warn
- 85% → warn
- 100% → error (require explicit “continue anyway” toggle in UI)
11.4 Cost simulation (MUST)
No real pricing calls. Provide local profile table:
export type PricingProfile = {
profileId: string; // "cheap" | "mid" | "premium"
title: string;
promptPer1kTokensUSD: number;
completionPer1kTokensUSD: number;
};
export type EconomicsLedger = {
pricingProfileId: string;
simulatedCostUSD: number;
costByRoleUSD: Record<RoleName, number>;
costByTurnUSD: Record<number, number>;
costOfDriftUSD: number; // computed as cost of turns after first drift>=warn
};
12. Failure mode injection
12.1 Injection model (MUST)
Injections are deterministic transformations applied at run creation or mid-run.
export type InjectionEvent = {
injectionId: string; // stable id
appliedAtTurnId: number; // 0 for pre-run
params: Record<string, unknown>;
seed?: number; // if injection uses randomness
description: string;
};
12.2 Required injection types (v1.1)
AMBIGUOUS_SPEC- removes acceptance criteria items or makes one vague.
CONFLICTING_CONSTRAINTS- injects contradictory constraint pair and forces architect re-anchor.
TRUNCATED_CONTEXT- engine includes fewer turn summaries in prompt generation.
BAD_CRITIC- simulated critic produces incorrect critique (sim provider only).
BUDGET_CRUNCH- lowers cap mid-run and forces recovery strategy.
12.3 Recording and evidence (MUST)
- Every injection must be recorded in
run.injections[]. - Drift detection must reference injections where relevant (“this failure was injected”).
13. Prompt generation
13.1 Prompt template requirements (MUST)
Prompt templates must be:
- deterministic,
- minimal history,
- always include the current
StateDigest(bounded), - explicitly state the role contract format.
13.2 Prompt generation algorithm (MUST)
- Input:
- scenario definition,
- current digest,
- last N turns summaries (default N=2),
- injections affecting prompts,
- budget status.
- Output:
- a single prompt string.
History inclusion MUST be bounded:
- include only:
- digest,
- last N summaries (generated deterministically from parsed role sections),
- open issues list.
13.3 Prompt provenance
Store in each turn:
templateId,- included
digestHash(so later we can prove prompt was generated from digest X), - token estimates.
14. Persistence, export, import
14.1 In-browser persistence (v1.1)
Preferred: IndexedDB via a small wrapper (e.g. idb library) to store:
- run list metadata,
- full run artifacts.
Fallback: localStorage for metadata + compressed run JSON (only if small).
Key namespace (MUST):
exnulla.orchestrationLab.*- include schemaVersion in keys where useful.
14.2 Export format (MUST)
- Export is the canonical
RunArtifactJSON. - Additionally export (optional):
transcript.md(prompt/response pairs),summary.md(budgets, rubric, drift, acceptance checklist).
14.3 Import validation (MUST)
Import must:
- validate schemaVersion,
- validate Zod schema,
- recompute derived state and compare to stored derived (deterministic check),
- show any mismatches as “artifact integrity warnings.”
15. Inspector UI
15.1 Routes (MUST)
/→ landing + “New Run” + “Import Run”/runs→ run list/runs/:runId→ run overview (timeline)/runs/:runId/turns/:turnId→ turn detail/runs/:runId/graph→ DAG view/runs/:runId/diff→ diff view (turn-to-turn)/runs/:runId/rubric→ rubric panel/runs/:runId/drift→ drift panel/runs/:runId/injections→ injection panel/meta/version.json→ version endpoint (static)
15.2 Timeline view requirements
- per turn:
- role badge,
- contract status,
- token estimate + cumulative,
- drift severity,
- rubric score,
- links to detail and diff.
15.3 DAG view requirements
- nodes = turns (ordered left-to-right by turnId)
- edges = inferred stage transitions / loops
- node styles:
- contract invalid → highlight
- drift warn/error → highlight
- click node opens turn detail
Implementation:
- use a lightweight graph lib compatible with static builds (e.g. React Flow) OR custom SVG layout.
- determinism requirement:
- graph layout must be stable for a given run (seeded layout if using force algorithms).
15.4 Diff view requirements
Diff options:
- prompt vs prompt (two turns)
- response vs response
- digest vs digest across turns
Implementation:
- use a deterministic diff algorithm (e.g.
diffpackage) and render hunks.
15.5 Paste console requirements
- shows expected role + stage
- shows prompt block (copy button)
- provides paste input area
- validates contract live and shows errors before submission
- submits through
stepEngine({ type: "PASTE_RESPONSE" })
15.6 Accessibility / iframe constraints
- no reliance on
window.topcontrol - all downloads via standard browser download; no popups
- no external fonts required (optional)
16. Simulated provider (optional but REQUIRED for tests)
16.1 Purpose
- Provide deterministic “agent outputs” for:
- unit/integration tests,
- demo mode without ChatGPT UI,
- injecting failure patterns reproducibly.
16.2 SimulationConfig
export type SimulationConfig = {
seed: number; // required
modelPresetByRole: Record<RoleName, 'fast' | 'balanced' | 'thorough'>;
errorRateByRole: Record<RoleName, number>; // 0..1
verbosityByRole: Record<RoleName, number>; // 0..1
};
16.3 Simulation determinism rules
- Use a seeded PRNG (e.g.
seedrandom) in core. - Never use
Math.random()directly. - All simulated outputs must embed the run/turn header block and required headings.
17. Testing plan
17.1 Core unit tests (MUST)
- schema validation (valid + invalid fixtures)
- deterministic hashing + canonical json
- drift rules hit expected evidence
- rubric scoring stable given fixed input
- budget math and warning thresholds
- digest regeneration stable
- transition rules with loop caps
17.2 Integration tests (MUST)
- simulate an entire run with
SimulatedProvider:- with no injections → should pass acceptance,
- with each injection type → should flag drift and/or fail acceptance depending on design.
17.3 UI smoke tests (SHOULD)
- ensure build compiles
- ensure routes render with sample run artifact
18. CI and release hygiene
18.1 GitHub Actions (MUST)
Workflow steps:
pnpm install --frozen-lockfilepnpm lintpnpm testpnpm build- optional: upload
dist/as artifact
18.2 Version stamping (MUST)
GIT_SHAinjected in CI:GIT_SHA=${{ github.sha }}
meta/version.jsoncreated during build from env + package version.
19. Docker spec (deterministic build)
19.1 Dockerfile requirements (MUST)
- multi-stage build (build → nginx or dist output)
- uses pnpm with lockfile
- accepts
ARG GIT_SHA
Example (reference, adjust as needed):
FROM node:20-alpine AS build
WORKDIR /app
ARG GIT_SHA=unknown
ENV VITE_GIT_SHA=$GIT_SHA
COPY package.json pnpm-lock.yaml pnpm-workspace.yaml ./
COPY apps/loc-web/package.json apps/loc-web/package.json
COPY packages/core/package.json packages/core/package.json
COPY packages/scenarios/package.json packages/scenarios/package.json
COPY packages/ui/package.json packages/ui/package.json
RUN corepack enable && corepack prepare pnpm@latest --activate
RUN pnpm install --frozen-lockfile
COPY . .
RUN pnpm build
FROM nginx:alpine AS runtime
COPY --from=build /app/apps/loc-web/dist /usr/share/nginx/html
19.2 Determinism note
Avoid embedding build timestamps unless explicitly excluded from replay checks.
20. Security and safety
20.1 No secrets rule (MUST)
- UI must warn: “Do not paste secrets; this tool stores data locally.”
- Best-effort secret detection (SHOULD):
- regex for common token formats,
- show warning banner; allow user to proceed (do not hard-block in v1.1).
20.2 Content boundaries
- Role templates must forbid:
- claiming to have executed code,
- scraping/automation,
- accessing private systems.
21. Acceptance criteria (v1.1 release gate)
A v1.1.0 release is “done” when all are true:
- New run wizard works end-to-end in Human mode using copy/paste.
- Contract validation triggers and generates repair prompts.
- Drift rules reliably fire on injected failure modes with evidence.
- Inspector explains drift + rubric with clickable evidence.
- Export/import roundtrip works and deterministic replay validation passes.
- Static build runs cleanly and is iframe-safe.
- CI enforces strict TS, lint, tests, build.
/meta/version.jsonexposes git SHA and schemaVersion.
22. Implementation checklist (file-level)
22.1 packages/core (MUST)
src/schema/runArtifact.ts(types + zod)src/util/canonicalJson.ts(stable stringify)src/util/hash.ts(sha256 helpers)src/engine/stepEngine.tssrc/engine/deriveState.tssrc/drift/rules/*.tssrc/scoring/rubric.tssrc/budget/budget.tssrc/providers/humanProvider.tssrc/providers/simulatedProvider.tstests/*
22.2 packages/scenarios (MUST)
- scenario registry + zod input schemas
- injection registry + deterministic transforms
- pricing profiles
22.3 apps/loc-web (MUST)
- run store (IndexedDB wrapper)
- new run wizard
- prompt router + paste console
- inspector routes (timeline, turn detail, graph, diff, rubric, drift, injections)
- export/import UI
22.4 docs (MUST)
- role instruction templates
- runbooks:
DEPLOY.md(atomic static deploy)IFRAME.md(embedding contract and storage namespace)
23. Appendix A — Deterministic runId format
23.1 Format
Use a URL-safe id:
orl_<YYYYMMDD>_<hhmmss>_<randBase32>for human runs (time-based, not determinism-critical), ORorl_<hashPrefix>for deterministic runs if seed-based.
v1.1 choice (recommended):
- time-based is acceptable because determinism is based on artifact content, not runId.
23.2 Requirement
- runId must be unique within local store.
- export file naming uses runId.
24. Appendix B — UI embed contract (iframe)
24.1 Static hosting assumptions
- all assets served relative to app root
- no service worker required
- no absolute URLs
24.2 Storage namespace
All keys must be prefixed:
exnulla.orchestrationLab.v1.1.0.*
25. Roadmap hooks (v1.2+ / v2+)
25.1 v1.2 (planned)
- CLI validator:
validate-run <file>diff-runs <a> <b>
- cross-run comparison UI
- more scenarios (6+)
25.2 v2 (planned)
- API provider adapters
- optional server runtime for keys (not in browser)
- tool execution hooks (optional)
CI/CD and Verification Model
CI/CD is the external enforcement mechanism for this workflow. It is where subjective process claims become objective pass/fail behavior.
- Lint Required
Formatting and static quality discipline are not optional postscript tasks. - Typecheck Required
The pipeline must prove structural correctness, not just visual plausibility. - Build Required
A finished pass that does not build is not finished. - No Tooling Drift
The pipeline should not modify root build systems, lint configs, or monorepo behavior unless that change is explicitly allowed by the spec. - Verification Before Completion
Completion requires verification pass, anchor coverage, assumptions collapsed to zero, and no drift alert.
The point is simple: the repo gates are the truth surface. Any agentic process that bypasses them is theater.
Agentic Development Pipeline
The workflow is deliberately closer to a supervised build engine than a conversational coding assistant. Roles are isolated. Scope is constrained. Output is artifact-based. Verification has veto power.
- Architect Role
Defines anchors, non-negotiables, and architecture plan. Cannot silently implement. - Tooling Role
Confirms CI commands, repo expectations, and allowed configs. Cannot invent new stack decisions. - Implementation Role
Produces code only from the approved architecture tree. Cannot expand scope because a new idea appears attractive mid-pass. - Verification Role
Must halt the pass on drift, missing anchor coverage, or unresolved assumptions.
This role separation is not ceremony. It is what allows the system to scale reasoning without allowing authority to become ambiguous.
The hidden advantage is economic as well. By decomposing work into explicit phases, you can route simpler tasks to cheaper model tiers and reserve expensive reasoning for architecture, synthesis, and conflict resolution rather than paying premium rates for every token in the loop.
Human Roles
Human involvement is not a sign that the pipeline is unfinished. It is where system quality actually comes from.
- Inspection
Humans inspect whether the output actually satisfies the real-world intent, not just the literal text of the prompt. - Quality Control
Humans catch misframed assumptions, strategic mismatches, and low-signal cleverness. - Infra and Deployment Authority
Humans retain ownership of environment changes, release discipline, secrets handling, and operational boundaries. - Specification Control
Humans are responsible for tightening the blueprint/spec pair when the process reveals new ambiguity.
In short: the machine accelerates structured work, but the human remains accountable for engineering judgment.
Security and Drift Control
In this operating model, security and drift control are tightly linked. A system that cannot explain why a file exists, why a behavior changed, or where authority came from is both a process problem and a security problem.
- Drift Halt
If assumptions remain, anchors are incomplete, output format is violated, or unanchored files appear, the correct behavior is to stop the pass. - No Hidden Globals
Environment requirements must be explicit. Hidden state makes both reproducibility and security worse. - Scope-Constrained Output
The implementation agent should not expand capability beyond what the architecture already justified. - OS-Neutral, Repo-Relative Behavior
Process portability matters. Repo-relative paths and explicit assumptions reduce accidental environmental coupling. - Artifact Traceability
Every meaningful output should map back to a phase, a purpose, and a source authority.
The shortest summary is this: a safe agentic pipeline is not one that “does more.” It is one that fails visibly, explains itself, and refuses to outrun its own specification.