Benchmark Pipeline
Shows the end-to-end workflow: question authoring, live data capture, agent execution, scoring, and diagnostics.
CryptoBench is the first crypto-native benchmark authored by DeFi analysts, on-chain investigators, and derivatives traders. It ships 50 new questions every month, forcing agents to retrieve, reason, and predict under real market pressure.
We introduce CryptoBench, a live benchmark that stress-tests LLM agents in time-sensitive, adversarial crypto workflows. Existing agent benchmarks overlook the need to synthesize on-chain intelligence, market data, DEX flows, and MEV alerts. CryptoBench delivers 50 domain-authentic questions per month, categorized into Simple/Complex Retrieval and Simple/Complex Prediction, mirroring professional analyst workloads.
Evaluating ten state-of-the-art LLMs (with and without the SmolAgent framework) reveals a pronounced retrieval–prediction imbalance: models that excel at factual lookup frequently collapse on predictive reasoning. Agentic orchestration can reshuffle leaderboard positions, proving that raw model IQ does not equal field performance.
Shows the end-to-end workflow: question authoring, live data capture, agent execution, scoring, and diagnostics.
Every prompt is drafted by crypto-native researchers—DeFi risk teams, on-chain sleuths, and quantitative traders—to ensure ecological validity.
Template variables (wallets, tickers, time windows) are swapped quarterly, preventing memorization and keeping pace with new primitives like restaking and intent layers.
Tasks require navigating on-chain explorers, CeFi + DeFi dashboards, derivatives terminals, DEX routing tools, and AI-generated signal feeds.
Agents must pick authoritative sources, adapt to UX quirks, and reconcile conflicting feeds before answering—mirroring real analyst workflows.
Illustrates the Simple/Complex × Retrieval/Prediction matrix that guides task authoring and score reporting each month.
Wallet archetyping, exploit tracing, whale/KOL flow, phishing and “rat trading” detection.
Liquidations, funding rates, open interest, and macro volatility feeds stitched together in real time.
Protocol dashboards for TVL/TVS, oracle drift analysis, and risk surface comparisons.
Order-flow toxicity, AMM routing, basis spreads, and delta-hedge stress testing.
Intent mempools, bundle profitability, and model-generated trade alerts from crypto-native terminals.
Scorecards for direct LLM calls and SmolAgent runs, plus the most diagnostic breakdowns.
This overview contrasts direct LLM evaluation with the SmolAgent setup, showing how agentic scaffolding reshuffles rankings.
Simple retrieval is saturated, but complex prediction exposes the largest gaps, confirming the need for multi-hop reasoning.
Segments tasks by whether retail participants or institutional desks care more about the question, highlighting coverage across investor profiles.