A calibrated AI analyst is only as good as the methodology you can audit. Here's ours, in full.
Most AI benchmarks measure whether a model can do a task. We measure something harder: when a model says "I am 70% confident," is it actually right 70% of the time? This distinction matters in prediction markets because the product of our work is a number, and the only way to check it is to collect thousands of them over time.
Our backtest infrastructure enforces sterility at the tool level. When the agent runs a search at simulation time T, our search layer filters results to documents available before T, and runs a second classifier pass to catch backdated content.
Every market gets analyzed by multiple model families. Agreement is required before we publish a directional signal. This decorrelates errors between models that don't share a training distribution.
Every signal is timestamped, stored in an append-only ledger, and scored on accuracy, Brier score, and calibration curves. All three are visible in the public track record.
Everything we write is public. Signals can't be edited after publishing; corrections are appended. We don't delete losing calls. We don't claim numbers we can't reproduce from a sterile backtest.