Projects · Shengwei You

01 · Agent Platform

OC1: Agent Safety Control System

A multi-provider agent stack with a formally verified safety circuit.

StatusShipped · production-grade

StackPython · FastAPI · TLA+ · Solidity

Scale18,400+ lines · 7 contracts · 14 test suites

JD fitAgent Harness · eval · trace · HITL

Problem

Frontier model calls fail in three ways the demos never show: provider outage, prompt-injection, and silent policy drift. A chatbot demo doesn't need to handle any of them. A production agent system has to fail safe under pressure, not just fail loud.

Motivation

I wanted a chassis I could put real money and real users behind. That meant a kill-circuit I could prove safe, not one I hoped was safe, and a stack where every external call is observable and reversible.

What I built

Multi-provider LLM orchestration across OpenAI, Anthropic, and local Ollama, with auto-failover, content-hashed disk caching, and per-call cost, latency, and token telemetry. A TLA+ safety state machine gates every action. Prompt-injection detection runs on the input path. An EVM execution layer handles on-chain side effects. A RAG policy oracle grounds the agent in our own documents.

My contribution

End to end. Architecture, TLA+ spec, agent code, Solidity contracts, eval framework, and the one-command pipeline that runs the whole battery. Solo: 18,400+ lines of Python, 7 Solidity contracts, 14 test suites, 5,866,037 TLA+ states model-checked with zero violations.

3.9s

P95 end-to-end latency

<5ms

Safety gating overhead

5.86M

TLA+ states, 0 violations

100/100

RL red-team attacks survived

Safety vs. speed: the kill-circuit overhead

Source: OC1 load test, 1,000 mixed-action requests

What it means

The safety circuit costs under a second per call and is provably correct, so I pay that cost in latency, not in incidents. The alternative, hoping the LLM behaves, doesn't show up on a P95 chart but it shows up in a postmortem.

With safety Baseline

Why this matters for a JD

OC1 is the closest thing I have to a textbook Agent Harness: tool calling, eval, trace, observability, human-in-the-loop checkpoints, and a sandbox. The TLA+ kill-circuit is the part most teams don't bother with, and the part that turns a demo into a system you can put in front of users.

Multi-provider failover TLA+ model checking Prompt-injection F1 0.765 Q-learning red team EVM execution RAG policy oracle Content-hashed cache Per-call telemetry

02 · Web3 Chatbot · DeFi

OC2: On-Chain Multi-Agent Debate

Three LLM roles argue, then the verdict is enforced on-chain.

StatusShipped · 76 Foundry tests passing

StackSolidity 0.8.23 · Foundry · ECDSA · EVM

ArchitectureProposer · Challenger · Judge

JD fitMulti-agent · tool calling · eval · trace

Problem

DeFi actions are usually gated by a single signer or a small multisig, which means a single bad call or a single compromised key can drain a protocol. Adding a second human doesn't help if the second human just rubber-stamps the first.

Motivation

The same idea that makes a peer review stronger than a single review, an independent challenger forces the proposer to defend the proposal, can make an autonomous agent safer than a single LLM call. The catch is that you need cryptographic receipts, not vibes, or it's theater.

What I built

A 3-role debate pattern. The Proposer drafts a DeFi action, the Challenger attacks it, and the Judge commits a verdict. All three sign the transcript. The full debate is hashed and posted on-chain, and a Foundry-tested Solidity stack (DebateRegistry, JudgeCommitment, SlashingPool, EmergencyStop) gates execution. If anything looks wrong, EmergencyStop halts the action and a human takes over.

My contribution

End to end again. The pattern, the prompt engineering, the agent code, all four Solidity contracts, the ECDSA verification, the transcript hashing, the 76 Foundry tests, and the gas benchmarks. The interesting part was making the on-chain cost 8.7% lower than the Optimism fault-proof baseline while keeping a cryptographically verifiable audit trail.

90.2%

Judge verdict accuracy (CI [87.6, 92.8])

95%

Expert agreement on real on-chain actions

16.8s

End-to-end debate latency

-8.7%

Gas vs. Optimism fault-proof baseline

Proposer / Challenger / Judge flow with checkpoints

OC2 design, 4 Solidity contracts in production

Why this matters for a JD

OC2 is the multi-agent harness pattern in its production form. Proposer / Challenger / Judge maps cleanly onto the three roles every agent platform needs (planner, critic, gatekeeper), and the on-chain receipts mean the eval is external to the LLM, not just an LLM grading itself.

Foundry tests ECDSA verification Transcript hashing Slashing pool Emergency stop Multi-agent eval Gas-optimized

03 · Stablecoin · IEEE ICBC 2026

MVF-Composer: Stablecoin Reserve Controller

A 12-agent rescue system for a stablecoin's peg, peer-reviewed and presented at IEEE ICBC 2026.

StatusAccepted & presented at IEEE ICBC 2026

StackStress Harness · trust-weighted MVF · 3 LLM providers

Scale12 agents · 1,200 sims · 12,500 lines

JD fitWeb3 risk · multi-agent eval · adversarial robustness

Problem

Stablecoin reserve controllers calibrate on calm-period returns. Under stress, that covariance is wrong by a factor of 7.17× (the “2020 Omission”), so the optimal allocation is precisely the most fragile one. The result is the kind of peg collapse seen in 2020/03 and 2023/03.

Approach

12 LLM agents across 4 archetypes (trader, liquidity provider, arbitrageur, attacker) act inside a Stress Harness that injects shocks at t=30. A trust score T(a) down-weights coordinated or peg-destabilizing behavior, and a constrained mean-variance optimizer rebalances reserves against the stress-augmented covariance.

What I built

12 concurrent LLM agents (5 traders, 3 LPs, 2 arbitrageurs, 2 attackers) across OpenAI, Anthropic, and DeepSeek, with Pydantic-validated outputs feeding a trust-weighted risk-state and a constrained mean-variance optimizer. 1,200 seeded Black-Thursday simulations on commodity hardware (~47-99s/epoch).

My contribution

Single-author paper, single-engineer system: Stress Harness, trust-weighted aggregation, the stress-augmented covariance blend, and the 1,200 reproducible simulations. ~12,500 lines of typed Python across 46 modules. Each run records seed, commit, and timestamp.

57%

Peak peg deviation cut

3.1×

Faster crisis recovery

12

Concurrent agents

1,200

Reproducible simulations

Black Thursday replay: MVF-Composer vs. industry baseline (SAS)

1,200 sims, median peg-deviation trace, seed-controlled

Reading the chart

Both lines sit near peg until a shock at t=30, when SAS (the industry baseline) jumps to a 7.4% deviation and slowly recovers; MVF-Composer peaks at 3.2% and crosses the 1% recovery line ~3.1× faster (14 vs 44 time steps). Median over 1,200 seeded runs.

MVF-Composer (12 agents) SAS industry baseline

Why this matters

MVF-Composer packages the multi-agent + trust-weighted aggregation + mean-variance optimization pattern in a setting with real money and real regulators watching. The IEEE peer review is the part that holds the numbers up under independent scrutiny.

Stress Harness Trust-weighted aggregation Mean-variance optimizer Multi-provider LLM Pydantic v2 contracts Seed-controlled IEEE ICBC 2026

04 · Oracle · Cryptography

NOC: Cryptographically Verifiable Oracle

On-chain SNARK verification at a fraction of incumbent cost.

StatusShipped · 76 tests + 256-run fuzz

StackSolidity 0.8.23 · Groth16 · BN254

Byzantine100% detection up to 37.5% adversary

JD fitWeb3 infrastructure · security

Problem

Existing oracle designs either trust a small committee (cheap, fragile) or run a heavy consensus (expensive, slow). The committee design was exploited more than once, and the consensus design prices out the use cases that need an update every block.

Motivation

Cryptographic proofs let you skip the trust assumption entirely. The verifier runs on-chain, the proof is cheap, and the only way to lie is to break the curve. The open question was whether you could ship a production-ready system on top of that idea at L2 cost.

What I built

A Solidity 0.8.23 oracle with on-chain Groth16 / BN254 verification, plus a staking, slashing, and reputation-weighted consensus layer to handle the cases where proofs aren't available. 76 Foundry tests pass, including a 256-run fuzz suite and gas benchmarks. A 100% Byzantine detection rate up to a 37.5% adversary fraction.

My contribution

End to end. The Solidity contracts, the Groth16 verifier integration, the staking and slashing logic, the reputation model, the test harness, and the gas benchmarks. About 21x cheaper per update than incumbent committee oracles on L2, which is the part that makes it usable in production instead of just a paper.

$0.04

Cost per update (L2)

21×

Cheaper vs. committee baseline

100%

Byzantine detection (up to 37.5%)

76

Tests · 256 fuzz runs

Cost per oracle update (L2): NOC vs. committee baseline

L2 gas snapshot, 30 gwei, ETH $3,200

What it means

At L2 cost, a single NOC update is around four cents. The same update from a committee oracle runs around 84 cents. That gap is the difference between an oracle you can call every block and an oracle you only call when you absolutely have to.

NOC Committee

Why this matters for a JD

NOC is the Web3 infrastructure piece in my portfolio. It shows I can ship cryptographic primitives to production and that I understand the cost / trust tradeoff well enough to make the call on which to use where. Most agent platform teams have at least one oracle question, and the answer is rarely "just use Chainlink."

Groth16 on-chain BN254 pairing Staking & slashing Reputation model Fuzz tested Gas optimized

In progress · demo on request

Web3 Agent Platform

A production-style agent platform built against the Agent Platform / Web3 Chatbot JD. Currently in final integration. Trace / replay, human-in-the-loop checkpoints, tool registry, and eval framework are wired up and demonstrable end to end on a private deploy. Happy to share a live walkthrough on request.

Built so far

Tool registry and policy-gated tool calling (per-tool permission scopes)
Trace / replay viewer with full LLM call and tool I/O capture
Human-in-the-loop checkpoints at action boundaries (approve / edit / reject)
Eval framework with seeded regression suite for prompt and tool changes
FastAPI on Cloud Run, Pydantic v2 contracts, asyncio pipeline
LangGraph orchestrator with proposer / challenger / judge role support

Designed for the JD

Capability layering, module boundaries, and a permission model as first-class concerns
Agent execution: planning, tool calling, context, result validation, retry, state recovery
Harness-level capabilities: tool calls, workflow engine, sandbox, eval, trace, MCP-friendly I/O
SDD-friendly: every change is spec, test, then code, with diff-scoped review

Why it is in progress, not shipped

The chassis is up. The last 20% is the part where I run it against real workloads with real users and find the failure modes I haven't thought of yet. I'd rather call it in progress than call it done and hope.

LangGraph FastAPI Cloud Run Pydantic v2 Trace / replay Human-in-the-loop Eval framework SDD workflow

Four shipped agent systems - built solo, measured, and reproduced.

OC1: Agent Safety Control System

Problem

Motivation

What I built

My contribution

Safety vs. speed: the kill-circuit overhead

What it means

Why this matters for a JD

OC2: On-Chain Multi-Agent Debate

Problem

Motivation

What I built

My contribution

Proposer / Challenger / Judge flow with checkpoints

Why this matters for a JD

MVF-Composer: Stablecoin Reserve Controller

Problem

Approach

What I built

My contribution

Black Thursday replay: MVF-Composer vs. industry baseline (SAS)

Reading the chart

Why this matters

NOC: Cryptographically Verifiable Oracle

Problem

Motivation

What I built

My contribution

Cost per oracle update (L2): NOC vs. committee baseline

What it means

Why this matters for a JD

Web3 Agent Platform

Built so far

Designed for the JD

Why it is in progress, not shipped

Want the live demo or the code?