This article explains how the Experiments report uses the confidence score, how Mixpanel calculates the confidence score, and how to interpret the confidence score. Send this article to your data scientist to learn about Mixpanel’s confidence score methodology.
Confidence scores come from the hypothesis testing framework in the field of statistics. In hypothesis testing, you first choose a null hypothesis. In Mixpanel, the null hypothesis is that two groups of users behave the same on average for a given metric. The groups of users might be variant and control groups in an A/B test, or they might just be two different cohorts of users. The alternative hypothesis is that the two groups of users behave differently for the metric.
When Mixpanel compares a metric for two cohorts of users, we calculate the probability that we would observe a metric difference equal to or greater than the difference between the two cohorts. That probability is called a p-value. Generally speaking, the smaller the p-value, the more likely it is that the null hypothesis is false, and the alternative hypothesis is true.
The confidence score is 1-p-value, expressed as a percentage. So the higher the confidence score, the more likely it is that the alternative hypothesis is true (meaning that the two cohorts really do behave differently for the metric in question). We follow the traditional threshold of 95% for the confidence score, so we highlight results above 95% confidence in green for positive differences and in red for negative differences.
Confidence Score Calculation
For event counts, we assume under the null hypothesis that each user cohort has a total event count that follows a Poisson distribution, where the parameter θ = cohort size * λ, and where λ is the same for both cohorts. For conversion rates, we assume under the null hypothesis that each user is a Bernoulli trial with the same parameter p. For both event counts and conversion rates, Mixpanel calculates the z-score, and the confidence score in the standard way. See this article for more information about the formulas Mixpanel uses for z-score calculations, and Poisson and binomial distributions.
Interpreting a Confidence Score
Generally speaking, higher confidence results mean that it is more likely that two cohorts of users differ significantly on your chosen metric. You can use the confidence score as a metric to quickly interpret large numbers of results. The higher the number of metrics you are analyzing, the higher percentage of those results that may be false positives.
If you are using our color-coded thresholds of 95%, there is a 5% chance that any individual result is a false positive. So if you are looking at 20 metrics at once, it is more likely that a larger number of those metrics could be false positives. If you want more precision in decision making, we recommend that you calculate your sample size prior to running an A/B test, and then only use the results you see in the Experimentation Report once you achieve that sample size. Higher confidence results are less likely to be false positives.