Projects

Four shipped agent systems - built solo, measured, and reproduced.

Each project is a real system with real numbers. I design the problem, ship the implementation, instrument the runs, and write up what worked and what didn't. Click any tab to see the full problem, motivation, and contribution walkthrough.

01OC1 · Agent Safety 02OC2 · On-Chain Debate 03MVF · Stablecoin (ICBC 2026) 04NOC · SNARK Oracle 05Web3 Agent Platform
01 · Agent Platform

OC1: Agent Safety Control System

A multi-provider agent stack with a formally verified safety circuit.

StatusShipped · production-grade
StackPython · FastAPI · TLA+ · Solidity
Scale18,400+ lines · 7 contracts · 14 test suites
JD fitAgent Harness · eval · trace · HITL

Problem

Frontier model calls fail in three ways the demos never show: provider outage, prompt-injection, and silent policy drift. A chatbot demo doesn't need to handle any of them. A production agent system has to fail safe under pressure, not just fail loud.

Motivation

I wanted a chassis I could put real money and real users behind. That meant a kill-circuit I could prove safe, not one I hoped was safe, and a stack where every external call is observable and reversible.

What I built

Multi-provider LLM orchestration across OpenAI, Anthropic, and local Ollama, with auto-failover, content-hashed disk caching, and per-call cost, latency, and token telemetry. A TLA+ safety state machine gates every action. Prompt-injection detection runs on the input path. An EVM execution layer handles on-chain side effects. A RAG policy oracle grounds the agent in our own documents.

My contribution

End to end. Architecture, TLA+ spec, agent code, Solidity contracts, eval framework, and the one-command pipeline that runs the whole battery. Solo: 18,400+ lines of Python, 7 Solidity contracts, 14 test suites, 5,866,037 TLA+ states model-checked with zero violations.

3.9s
P95 end-to-end latency
<5ms
Safety gating overhead
5.86M
TLA+ states, 0 violations
100/100
RL red-team attacks survived

Safety vs. speed: the kill-circuit overhead

Source: OC1 load test, 1,000 mixed-action requests
0 2s 5s 8s 10s 3.9s With safety circuit 2.95s Without (baseline) +0.95s · acceptable

What it means

The safety circuit costs under a second per call and is provably correct, so I pay that cost in latency, not in incidents. The alternative, hoping the LLM behaves, doesn't show up on a P95 chart but it shows up in a postmortem.

With safety Baseline

Why this matters for a JD

OC1 is the closest thing I have to a textbook Agent Harness: tool calling, eval, trace, observability, human-in-the-loop checkpoints, and a sandbox. The TLA+ kill-circuit is the part most teams don't bother with, and the part that turns a demo into a system you can put in front of users.

Multi-provider failover TLA+ model checking Prompt-injection F1 0.765 Q-learning red team EVM execution RAG policy oracle Content-hashed cache Per-call telemetry
02 · Web3 Chatbot · DeFi

OC2: On-Chain Multi-Agent Debate

Three LLM roles argue, then the verdict is enforced on-chain.

StatusShipped · 76 Foundry tests passing
StackSolidity 0.8.23 · Foundry · ECDSA · EVM
ArchitectureProposer · Challenger · Judge
JD fitMulti-agent · tool calling · eval · trace

Problem

DeFi actions are usually gated by a single signer or a small multisig, which means a single bad call or a single compromised key can drain a protocol. Adding a second human doesn't help if the second human just rubber-stamps the first.

Motivation

The same idea that makes a peer review stronger than a single review, an independent challenger forces the proposer to defend the proposal, can make an autonomous agent safer than a single LLM call. The catch is that you need cryptographic receipts, not vibes, or it's theater.

What I built

A 3-role debate pattern. The Proposer drafts a DeFi action, the Challenger attacks it, and the Judge commits a verdict. All three sign the transcript. The full debate is hashed and posted on-chain, and a Foundry-tested Solidity stack (DebateRegistry, JudgeCommitment, SlashingPool, EmergencyStop) gates execution. If anything looks wrong, EmergencyStop halts the action and a human takes over.

My contribution

End to end again. The pattern, the prompt engineering, the agent code, all four Solidity contracts, the ECDSA verification, the transcript hashing, the 76 Foundry tests, and the gas benchmarks. The interesting part was making the on-chain cost 8.7% lower than the Optimism fault-proof baseline while keeping a cryptographically verifiable audit trail.

90.2%
Judge verdict accuracy (CI [87.6, 92.8])
95%
Expert agreement on real on-chain actions
16.8s
End-to-end debate latency
-8.7%
Gas vs. Optimism fault-proof baseline

Proposer / Challenger / Judge flow with checkpoints

OC2 design, 4 Solidity contracts in production
PROPOSER drafts action CHALLENGER attacks the plan JUDGE commits verdict EXECUTE HUMAN REVIEW EMERGENCY STOP ON-CHAIN ECDSA + transcript hash action critique approve flag FOUNDRY TEST SUITE · 76 tests unit (28) · adversarial (24) · gas benchmarks (12) · fuzz (12) All four contracts verified; ECDSA + transcript hash enforced on-chain

Why this matters for a JD

OC2 is the multi-agent harness pattern in its production form. Proposer / Challenger / Judge maps cleanly onto the three roles every agent platform needs (planner, critic, gatekeeper), and the on-chain receipts mean the eval is external to the LLM, not just an LLM grading itself.

Foundry tests ECDSA verification Transcript hashing Slashing pool Emergency stop Multi-agent eval Gas-optimized
03 · Stablecoin · IEEE ICBC 2026

MVF-Composer: Stablecoin Reserve Controller

A 12-agent rescue system for a stablecoin's peg, peer-reviewed and presented at IEEE ICBC 2026.

StatusAccepted & presented at IEEE ICBC 2026
StackStress Harness · trust-weighted MVF · 3 LLM providers
Scale12 agents · 1,200 sims · 12,500 lines
JD fitWeb3 risk · multi-agent eval · adversarial robustness

Problem

Stablecoin reserve controllers calibrate on calm-period returns. Under stress, that covariance is wrong by a factor of 7.17× (the “2020 Omission”), so the optimal allocation is precisely the most fragile one. The result is the kind of peg collapse seen in 2020/03 and 2023/03.

Approach

12 LLM agents across 4 archetypes (trader, liquidity provider, arbitrageur, attacker) act inside a Stress Harness that injects shocks at t=30. A trust score T(a) down-weights coordinated or peg-destabilizing behavior, and a constrained mean-variance optimizer rebalances reserves against the stress-augmented covariance.

What I built

12 concurrent LLM agents (5 traders, 3 LPs, 2 arbitrageurs, 2 attackers) across OpenAI, Anthropic, and DeepSeek, with Pydantic-validated outputs feeding a trust-weighted risk-state and a constrained mean-variance optimizer. 1,200 seeded Black-Thursday simulations on commodity hardware (~47-99s/epoch).

My contribution

Single-author paper, single-engineer system: Stress Harness, trust-weighted aggregation, the stress-augmented covariance blend, and the 1,200 reproducible simulations. ~12,500 lines of typed Python across 46 modules. Each run records seed, commit, and timestamp.

57%
Peak peg deviation cut
3.1×
Faster crisis recovery
12
Concurrent agents
1,200
Reproducible simulations

Black Thursday replay: MVF-Composer vs. industry baseline (SAS)

1,200 sims, median peg-deviation trace, seed-controlled
0% 8% 4% time step → post-shock peg deviation shock @ t=30 1% recovery MVF-Composer SAS (industry baseline) peak 3.2%

Reading the chart

Both lines sit near peg until a shock at t=30, when SAS (the industry baseline) jumps to a 7.4% deviation and slowly recovers; MVF-Composer peaks at 3.2% and crosses the 1% recovery line ~3.1× faster (14 vs 44 time steps). Median over 1,200 seeded runs.

MVF-Composer (12 agents) SAS industry baseline

Why this matters

MVF-Composer packages the multi-agent + trust-weighted aggregation + mean-variance optimization pattern in a setting with real money and real regulators watching. The IEEE peer review is the part that holds the numbers up under independent scrutiny.

Stress Harness Trust-weighted aggregation Mean-variance optimizer Multi-provider LLM Pydantic v2 contracts Seed-controlled IEEE ICBC 2026
04 · Oracle · Cryptography

NOC: Cryptographically Verifiable Oracle

On-chain SNARK verification at a fraction of incumbent cost.

StatusShipped · 76 tests + 256-run fuzz
StackSolidity 0.8.23 · Groth16 · BN254
Byzantine100% detection up to 37.5% adversary
JD fitWeb3 infrastructure · security

Problem

Existing oracle designs either trust a small committee (cheap, fragile) or run a heavy consensus (expensive, slow). The committee design was exploited more than once, and the consensus design prices out the use cases that need an update every block.

Motivation

Cryptographic proofs let you skip the trust assumption entirely. The verifier runs on-chain, the proof is cheap, and the only way to lie is to break the curve. The open question was whether you could ship a production-ready system on top of that idea at L2 cost.

What I built

A Solidity 0.8.23 oracle with on-chain Groth16 / BN254 verification, plus a staking, slashing, and reputation-weighted consensus layer to handle the cases where proofs aren't available. 76 Foundry tests pass, including a 256-run fuzz suite and gas benchmarks. A 100% Byzantine detection rate up to a 37.5% adversary fraction.

My contribution

End to end. The Solidity contracts, the Groth16 verifier integration, the staking and slashing logic, the reputation model, the test harness, and the gas benchmarks. About 21x cheaper per update than incumbent committee oracles on L2, which is the part that makes it usable in production instead of just a paper.

$0.04
Cost per update (L2)
21×
Cheaper vs. committee baseline
100%
Byzantine detection (up to 37.5%)
76
Tests · 256 fuzz runs

Cost per oracle update (L2): NOC vs. committee baseline

L2 gas snapshot, 30 gwei, ETH $3,200
$0 $0.50 $0.75 $1.00 $0.04 NOC (Groth16) $0.84 Committee baseline 21× cheaper

What it means

At L2 cost, a single NOC update is around four cents. The same update from a committee oracle runs around 84 cents. That gap is the difference between an oracle you can call every block and an oracle you only call when you absolutely have to.

NOC Committee

Why this matters for a JD

NOC is the Web3 infrastructure piece in my portfolio. It shows I can ship cryptographic primitives to production and that I understand the cost / trust tradeoff well enough to make the call on which to use where. Most agent platform teams have at least one oracle question, and the answer is rarely "just use Chainlink."

Groth16 on-chain BN254 pairing Staking & slashing Reputation model Fuzz tested Gas optimized
In progress · demo on request

Web3 Agent Platform

A production-style agent platform built against the Agent Platform / Web3 Chatbot JD. Currently in final integration. Trace / replay, human-in-the-loop checkpoints, tool registry, and eval framework are wired up and demonstrable end to end on a private deploy. Happy to share a live walkthrough on request.

Built so far

  • Tool registry and policy-gated tool calling (per-tool permission scopes)
  • Trace / replay viewer with full LLM call and tool I/O capture
  • Human-in-the-loop checkpoints at action boundaries (approve / edit / reject)
  • Eval framework with seeded regression suite for prompt and tool changes
  • FastAPI on Cloud Run, Pydantic v2 contracts, asyncio pipeline
  • LangGraph orchestrator with proposer / challenger / judge role support

Designed for the JD

  • Capability layering, module boundaries, and a permission model as first-class concerns
  • Agent execution: planning, tool calling, context, result validation, retry, state recovery
  • Harness-level capabilities: tool calls, workflow engine, sandbox, eval, trace, MCP-friendly I/O
  • SDD-friendly: every change is spec, test, then code, with diff-scoped review

Why it is in progress, not shipped

The chassis is up. The last 20% is the part where I run it against real workloads with real users and find the failure modes I haven't thought of yet. I'd rather call it in progress than call it done and hope.

LangGraph FastAPI Cloud Run Pydantic v2 Trace / replay Human-in-the-loop Eval framework SDD workflow

Want the live demo or the code?

Code is shared on request under NDA. The Web3 Agent Platform walkthrough is also on request, with a short Loom if a live session doesn't fit the schedule. I read every message.