How we test what we test

Three tests every strategy must pass before we deploy a dollar.

We rejected ~195 of 200 strategies tested. Here's the filter that did the rejecting, the metrics we report, and why HODL is our anchor benchmark — but never the only one.

The 3-Test Stack

Each test catches a different failure mode. Skipping any one means accepting a higher rate of deploying broken strategies. Each filter rejects roughly 50-80% of what passed the previous one — combined rejection ~95%.

1

Continuous Full-Period Backtest

Catches: obviously broken strategies

Run the strategy over the longest available data window (we typically use 6-8 years of daily Bitcoin data from Binance/Bybit). Apply realistic 0.10% per-trade fees and 0.05% slippage. Compare total return to HODL baseline.

Pass criterion: positive return after fees, ideally beats HODL
Rejection rate: ~50% of ideas (most popular indicators fail this)
2

3-Window Walk-Forward

Catches: regime-overfit strategies

Split the historical data into 3 non-overlapping equal windows. Run the strategy independently in each. Does it beat HODL in EACH window separately?

A strategy that beats HODL by +400% over 8 years might have produced ALL of that edge in one bull-market regime and lost in two others. The full-period number averages this away. Walk-forward exposes it.

Pass criterion: beats HODL in 2 of 3 windows minimum (3/3 ideal)
Rejection rate: ~70% of strategies that passed Test 1
3
The one almost nobody runs

Parameter Robustness Sweep

Catches: noise-overfit strategies

For your chosen parameter (lookback period, threshold, etc.), test ±2 to ±5 nearby values. Does the response form a plateau or an isolated spike?

A real edge produces similar results across nearby parameter choices (plateau). A lottery-winning fluke produces a single magic number with mediocre neighbors (spike). Walk-forward CAN'T catch this — only neighborhood testing does.

Pass criterion: at least 3 nearby parameter values produce similar results
Rejection rate: ~80% of strategies that passed Tests 1+2
→ See the Sharpshooter post-mortem— a strategy that passed Test 1+2 but failed Test 3.
Combined rejection: ~95%. If a strategy passes all three, it earns a paper-trading slot for 3+ months of live observation before any real-capital decision. Most don't pass.

Why HODL Is Our Anchor Benchmark

Every strategy is tested against buy-and-hold of the same asset over the same window. Here's why — and what HODL comparison doesn't tell you.

Why HODL works as anchor
  • Opportunity cost. Every dollar in your strategy is a dollar not in HODL. If HODL would have made more, your strategy is paying you a negative wage.
  • Zero-skill baseline. Anyone can HODL. Beating HODL requires demonstrable skill or edge.
  • Tax + fee accumulation. Over multi-year windows, fees and tax friction compound and can wipe out short-term alpha. HODL captures none of these.
  • Statistical robustness. Walk-forward over multiple regimes requires long windows, which forces the HODL frame.
Where HODL comparison falls short
  • Time-horizon mismatch. A high-frequency bot trading 50x/year operates on a different natural timescale than 8-year HODL. Total-return comparison hides operational differences.
  • Risk profile. HODL = constant exposure. A dynamic strategy = variable exposure. Same return, very different risk profile.
  • Diversification value. A strategy with lower return but uncorrelated returns can be valuable in a portfolio context.
  • Investor reality. Some investors must trade (tax-loss harvesting, regulatory). For them, HODL isn't available.
→ Full essay on the limits of HODL benchmarking

What We Report (Not Just Total Return)

HODL anchors the comparison. But the full picture needs 5+ additional metrics.

MetricWhat it tells youMost relevant for
Total Return vs HODLDid the strategy create wealth vs the zero-skill alternative?all strategies
Max DrawdownWorst peak-to-trough loss. Psychological / margin-call relevance.all strategies
Calmar RatioTotal return ÷ |Max Drawdown|. Higher = better risk-adjusted.all strategies
Sharpe RatioExcess return per unit of volatility. Standard risk-adjusted metric.all strategies
Walk-Forward ScoreNumber of independent sub-periods where strategy beats benchmark.regime-robustness
Rolling N-Month Beat-Rate% of rolling 6/12-month windows where strategy beats HODL.HF strategies, investor experience
Win Rate% of trades that close positive after fees.HF strategies
Avg Win / Avg LossAsymmetry profile. Strategies can have low win rate but high asymmetry.HF strategies
% Time in MarketWhat fraction of time is capital deployed vs sitting in cash?cycle-filter strategies
Parameter Plateau WidthNumber of nearby parameter values that also pass walk-forward.overfit-detection
Why so many metrics? No single number tells you whether a strategy is good. A bot can have a great Total Return but bad Sharpe (high vol). A great Sharpe but low Win Rate (rare big wins). A great Walk-Forward but a noisy Parameter Plateau (overfit). The combination matters. We report all of these on every Bot Card so you can judge by the dimension that matters to you.

What We Don't Do

Cherry-pick time windows

We test on full available history, not just bull markets. The losing windows are reported.

Hide failed strategies

Every retired strategy is documented on /post-mortems with date, reason, and lesson.

Ignore fees and slippage

All backtests include realistic 0.10%-0.20% per-trade costs. The free-fees backtest is a lie.

Optimize on the test data

Parameter selection happens out-of-sample. We don't fit and test on the same window.

Show only Sharpe or only Total Return

Single-metric reports hide failure modes. We show 5+ metrics per bot.

Promise alpha

Backtests don't guarantee future returns. Every bot card carries a tier label and a clear caveat.

See the methodology applied

Every bot on /bots passes the 3-test stack and reports the multi-metric panel. Every retired strategy on /post-mortems failed at least one of the tests — explained.