Each project is a real system with real numbers. I design the problem, ship the implementation, instrument the runs, and write up what worked and what didn't. Click any tab to see the full problem, motivation, and contribution walkthrough.
A multi-provider agent stack with a formally verified safety circuit.
Frontier model calls fail in three ways the demos never show: provider outage, prompt-injection, and silent policy drift. A chatbot demo doesn't need to handle any of them. A production agent system has to fail safe under pressure, not just fail loud.
I wanted a chassis I could put real money and real users behind. That meant a kill-circuit I could prove safe, not one I hoped was safe, and a stack where every external call is observable and reversible.
Multi-provider LLM orchestration across OpenAI, Anthropic, and local Ollama, with auto-failover, content-hashed disk caching, and per-call cost, latency, and token telemetry. A TLA+ safety state machine gates every action. Prompt-injection detection runs on the input path. An EVM execution layer handles on-chain side effects. A RAG policy oracle grounds the agent in our own documents.
End to end. Architecture, TLA+ spec, agent code, Solidity contracts, eval framework, and the one-command pipeline that runs the whole battery. Solo: 18,400+ lines of Python, 7 Solidity contracts, 14 test suites, 5,866,037 TLA+ states model-checked with zero violations.
The safety circuit costs under a second per call and is provably correct, so I pay that cost in latency, not in incidents. The alternative, hoping the LLM behaves, doesn't show up on a P95 chart but it shows up in a postmortem.
OC1 is the closest thing I have to a textbook Agent Harness: tool calling, eval, trace, observability, human-in-the-loop checkpoints, and a sandbox. The TLA+ kill-circuit is the part most teams don't bother with, and the part that turns a demo into a system you can put in front of users.
Three LLM roles argue, then the verdict is enforced on-chain.
DeFi actions are usually gated by a single signer or a small multisig, which means a single bad call or a single compromised key can drain a protocol. Adding a second human doesn't help if the second human just rubber-stamps the first.
The same idea that makes a peer review stronger than a single review, an independent challenger forces the proposer to defend the proposal, can make an autonomous agent safer than a single LLM call. The catch is that you need cryptographic receipts, not vibes, or it's theater.
A 3-role debate pattern. The Proposer drafts a DeFi action, the Challenger attacks it, and the Judge commits a verdict. All three sign the transcript. The full debate is hashed and posted on-chain, and a Foundry-tested Solidity stack (DebateRegistry, JudgeCommitment, SlashingPool, EmergencyStop) gates execution. If anything looks wrong, EmergencyStop halts the action and a human takes over.
End to end again. The pattern, the prompt engineering, the agent code, all four Solidity contracts, the ECDSA verification, the transcript hashing, the 76 Foundry tests, and the gas benchmarks. The interesting part was making the on-chain cost 8.7% lower than the Optimism fault-proof baseline while keeping a cryptographically verifiable audit trail.
OC2 is the multi-agent harness pattern in its production form. Proposer / Challenger / Judge maps cleanly onto the three roles every agent platform needs (planner, critic, gatekeeper), and the on-chain receipts mean the eval is external to the LLM, not just an LLM grading itself.
A 12-agent rescue system for a stablecoin's peg, peer-reviewed and presented at IEEE ICBC 2026.
Stablecoin reserve controllers calibrate on calm-period returns. Under stress, that covariance is wrong by a factor of 7.17× (the “2020 Omission”), so the optimal allocation is precisely the most fragile one. The result is the kind of peg collapse seen in 2020/03 and 2023/03.
12 LLM agents across 4 archetypes (trader, liquidity provider, arbitrageur, attacker) act inside a Stress Harness that injects shocks at t=30. A trust score T(a) down-weights coordinated or peg-destabilizing behavior, and a constrained mean-variance optimizer rebalances reserves against the stress-augmented covariance.
12 concurrent LLM agents (5 traders, 3 LPs, 2 arbitrageurs, 2 attackers) across OpenAI, Anthropic, and DeepSeek, with Pydantic-validated outputs feeding a trust-weighted risk-state and a constrained mean-variance optimizer. 1,200 seeded Black-Thursday simulations on commodity hardware (~47-99s/epoch).
Single-author paper, single-engineer system: Stress Harness, trust-weighted aggregation, the stress-augmented covariance blend, and the 1,200 reproducible simulations. ~12,500 lines of typed Python across 46 modules. Each run records seed, commit, and timestamp.
Both lines sit near peg until a shock at t=30, when SAS (the industry baseline) jumps to a 7.4% deviation and slowly recovers; MVF-Composer peaks at 3.2% and crosses the 1% recovery line ~3.1× faster (14 vs 44 time steps). Median over 1,200 seeded runs.
MVF-Composer packages the multi-agent + trust-weighted aggregation + mean-variance optimization pattern in a setting with real money and real regulators watching. The IEEE peer review is the part that holds the numbers up under independent scrutiny.
On-chain SNARK verification at a fraction of incumbent cost.
Existing oracle designs either trust a small committee (cheap, fragile) or run a heavy consensus (expensive, slow). The committee design was exploited more than once, and the consensus design prices out the use cases that need an update every block.
Cryptographic proofs let you skip the trust assumption entirely. The verifier runs on-chain, the proof is cheap, and the only way to lie is to break the curve. The open question was whether you could ship a production-ready system on top of that idea at L2 cost.
A Solidity 0.8.23 oracle with on-chain Groth16 / BN254 verification, plus a staking, slashing, and reputation-weighted consensus layer to handle the cases where proofs aren't available. 76 Foundry tests pass, including a 256-run fuzz suite and gas benchmarks. A 100% Byzantine detection rate up to a 37.5% adversary fraction.
End to end. The Solidity contracts, the Groth16 verifier integration, the staking and slashing logic, the reputation model, the test harness, and the gas benchmarks. About 21x cheaper per update than incumbent committee oracles on L2, which is the part that makes it usable in production instead of just a paper.
At L2 cost, a single NOC update is around four cents. The same update from a committee oracle runs around 84 cents. That gap is the difference between an oracle you can call every block and an oracle you only call when you absolutely have to.
NOC is the Web3 infrastructure piece in my portfolio. It shows I can ship cryptographic primitives to production and that I understand the cost / trust tradeoff well enough to make the call on which to use where. Most agent platform teams have at least one oracle question, and the answer is rarely "just use Chainlink."
A production-style agent platform built against the Agent Platform / Web3 Chatbot JD. Currently in final integration. Trace / replay, human-in-the-loop checkpoints, tool registry, and eval framework are wired up and demonstrable end to end on a private deploy. Happy to share a live walkthrough on request.
The chassis is up. The last 20% is the part where I run it against real workloads with real users and find the failure modes I haven't thought of yet. I'd rather call it in progress than call it done and hope.
Code is shared on request under NDA. The Web3 Agent Platform walkthrough is also on request, with a short Loom if a live session doesn't fit the schedule. I read every message.