CryptoBench — A Dynamic Benchmark for Expert-Level LLM Agents

We introduce CryptoBench, a live benchmark that stress-tests LLM agents in time-sensitive, adversarial crypto workflows. Existing agent benchmarks overlook the need to synthesize on-chain intelligence, market data, DEX flows, and MEV alerts. CryptoBench delivers 50 domain-authentic questions per month, categorized into Simple/Complex Retrieval and Simple/Complex Prediction, mirroring professional analyst workloads.

Evaluating ten state-of-the-art LLMs (with and without the SmolAgent framework) reveals a pronounced retrieval–prediction imbalance: models that excel at factual lookup frequently collapse on predictive reasoning. Agentic orchestration can reshuffle leaderboard positions, proving that raw model IQ does not equal field performance.

Benchmark Design

Benchmark Pipeline

Shows the end-to-end workflow: question authoring, live data capture, agent execution, scoring, and diagnostics.

Professional task authorship

Every prompt is drafted by crypto-native researchers—DeFi risk teams, on-chain sleuths, and quantitative traders—to ensure ecological validity.

Dynamic refresh

Template variables (wallets, tickers, time windows) are swapped quarterly, preventing memorization and keeping pace with new primitives like restaking and intent layers.

Broad platform coverage

Tasks require navigating on-chain explorers, CeFi + DeFi dashboards, derivatives terminals, DEX routing tools, and AI-generated signal feeds.

Tool orchestration focus

Agents must pick authoritative sources, adapt to UX quirks, and reconcile conflicting feeds before answering—mirroring real analyst workflows.

System Overview

Four-Quadrant Framework

Illustrates the Simple/Complex × Retrieval/Prediction matrix that guides task authoring and score reporting each month.

Crypto-native Task Domains

On-chain Intelligence

Wallet archetyping, exploit tracing, whale/KOL flow, phishing and “rat trading” detection.

Market Data

Liquidations, funding rates, open interest, and macro volatility feeds stitched together in real time.

DeFi & Oracle Analytics

Protocol dashboards for TVL/TVS, oracle drift analysis, and risk surface comparisons.

DEX & Derivatives

Order-flow toxicity, AMM routing, basis spreads, and delta-hedge stress testing.

MEV & AI Signals

Intent mempools, bundle profitability, and model-generated trade alerts from crypto-native terminals.

Leaderboards

Scorecards for direct LLM calls and SmolAgent runs, plus the most diagnostic breakdowns.

Combined Evaluation Scores

This overview contrasts direct LLM evaluation with the SmolAgent setup, showing how agentic scaffolding reshuffles rankings.

Task Quadrants

Simple retrieval is saturated, but complex prediction exposes the largest gaps, confirming the need for multi-hop reasoning.

Investor Focus Breakdown

Segments tasks by whether retail participants or institutional desks care more about the question, highlighting coverage across investor profiles.

CryptoBench: Evaluating LLM Agents in the most adversarial market.