IARPA crowd prediction moving to commercial spinoff and lessons on making better predictions

You can pre-register for beta testing of a commercial crowd prediction spinoff of the IARPA Good Judgement project.

The Good Judgment Project out-performed all other research teams in geopolitical forecasting. The Good Judgment Project was a four-year research study organized as part of a government-sponsored forecasting tournament. Thousands of people around the world predict global events. Their collective forecasts are surprisingly accurate.

The Aggregative Contingent Estimation (ACE) Program was sponsored by IARPA (the U.S. Intelligence Advanced Research Projects Activity). The ACE Program aimed “to dramatically enhance the accuracy, precision, and timeliness of forecasts for a broad range of event types

Science-based forecasting
Good Judgment™ helps organizations quantify risk. We get rid of vague verbiage and put ourselves on the line with specific probability estimates of the scenarios you’re most worried about, with near real-time forecast updates. We continuously test and refine our forecasting techniques. And we keep score, providing clients with a track record of our accuracy that other forecasting services are unable – or unwilling – to reveal.

Following information is from a blog post at the Good Judgment Project by Pavel Atanasov and Angela Minster

1. Prediction skill over time.

Prior research had demonstrated that expert forecasters are embarrassingly as good as Dart Throwing Chimps. Does this mean that accurate forecasting is simply a matter of luck, with no skill involved? Not at all.

Here’s how we tested for the persistence of forecasting skill. Within a sample of 600 Good Judgment Project forecasters, we identified the 100 most and least accurate forecasters, based on standardized Brier scores, over the first 25 questions in the forecasting tournament. Then, we tracked their scores over the next 175 questions.

As we can see in the figure below, the top 100 forecasters (based on initial performance) were consistently more accurate than the bottom 100, with the top guns beating the rear-guard on 169 out of 174 questions.

Upshot: Forecasting accuracy is a matter of skill, not just luck.

2. Training and teaming

How important are situational factors in affecting individual forecasting performance? We focused on two such factors: training and teaming.

Training was delivered in a 1-hour online module and focused on forecasting reasoning tips, such as using base rates, mathematical models and updating one’s beliefs. Teaming allowed forecasters to collaborate as members of 12-15 person teams who had online tools for allocating effort, sharing information and rationales with one another. As the figure shows, training and teaming significantly reduced forecasting error in the tournament. These results are replicated across all four seasons of the tournament.

Upshot: Forecasting training and teaming improve forecasting performance.

3. Superforecasters.

The first two lessons concerned persistence of individual skill and the importance of environmental factors. We wondered what would happen if we introduce highly skilled forecasters to an enriched environment. To test whether such “tracking” would further improve performance, we promoted the top 2% most accurate forecasters to “superforecaster” status and placed them in teams. The resulting super teams had an elite-egalitarian structure: they were composed of top past-performers, all of whom had equal rights and responsibilities within their teams.

The performance of super-teams was extremely strong. We document this using a simple version of discontinuity analysis. Namely, we compare superforecasters (top 2%) with those who just missed the cut (3-5%). In Year 1, when the selection took place, both groups performed much better than average. In both Years 2 & 3, super-teams increased their lead over the comparison group. Rather than regressing toward the mean, super teams increased their level of engagement and produced highly accurate forecasts.

Upshot: Tracking top performers, and placing them in flat, non-hierarchical teams improves motivation, engagement and performance.

4. Belief updating.

What is the best behavioral predictor of prediction skill, apart from accuracy? One potential indicator is belief updating. The average duration of Good Judgment Project forecasting questions was over three months; forecasters were able to update their predictions whenever they wished. The pattern of belief updating was strongly and robustly related to forecasting accuracy.

We measured the frequency of belief updates and their magnitude. For example, we could compare a forecaster who places 1.5 predictions per question and whose average update is 20 percentage points with another one who makes 2.3 predictions and updates by 11 percentage points, on average. Forecasters who updated their beliefs more often, and in smaller increments, tended to be more accurate than those who made fewer, or larger updates. Frequency and magnitude independently predicted accuracy. We verified the robustness of these relationships in and out of sample.

Upshot: Frequent, small belief updates are the marks of an accurate forecaster.

5. Measuring forecasting skill over time

So which are the best predictors of individual forecasting skill? It depends on the amount of performance data available.

With minimal accuracy data, we can best predict future performance with behavioral/effort measures (belief updating), situational variables (in our case, training and teaming) and dispositional measures, such as fluid intelligence, cognitive reflection, numeracy and actively open-minded thinking.

The more data available on past accuracy of participants, the less noisy this measure becomes. As soon as we had 10 resolved questions on which to judge individual accuracy, past accuracy became the best single predictor of future performance. Once we had 50 resolved questions, past accuracy was more predictive of future performance than a model combining dispositional, situational and behavioral measures.

Upshot: Dispositional, behavioral and situational factors, as well as past performance, are highly predictive of individual accuracy.

6. Prediction polls and markets

The results discussed above all are derived from prediction polls (surveys), a method for crowdsourcing probability judgments. Individual estimates from prediction polls can be aggregated to produce wisdom-of-crowds forecasts. Prediction markets also produce crowd assessments, by aggregating the price signals of market participants. Which method produces more accurate forecasts, prediction polls or prediction markets?

We compared a continuous double auction market with individual and team-based prediction polls over the course of one year in the tournament. Forecasters were randomly assigned to conditions and produced more than 50,000 market orders and 100,000 probability predictions. Accuracy scores across 114 questions are shown below. Prediction markets outperformed simple, unweighted forecasts from polls. However, the accuracy of poll aggregates increased when we introduced three out-of-sample derived adjustments to aggregation algorithms: temporal decay placed higher relative weights on more recent forecasts; past-performance weights increased the relative influence of individuals who tended to update their forecasts more frequently and who had a better track record of accuracy; and an recalibration function pushed aggregated estimates toward the extremes of the probability scale. These adjustments helped team-based prediction polls significantly outperform prediction markets.

Upshot: Probability prediction polls can produce more accurate crowd estimates than prediction markets.