[DS Principles] How to tell if your data is uniformly distributed or not
Build better optimizers, draw better conclusions, and move forward with more confidence
Data Science Principles to Master
Knowing the distribution of your data helps you understand how to draw conclusions from it
If you’re building an optimizer, it’s important to know how your sampler explores the search space.
There are an unending amount of statistical tests you can use
Background and Context
A uniform distribution is one in which all outcomes are equally likely. For instance, if we consider a continuous uniform distribution over the interval [0,1][0,1], any interval of equal length within this range should have the same probability. Similarly, for a discrete uniform distribution, each discrete value has an equal chance of occurring.
When we collect a sample of data, we might want to test if this sample is consistent with coming from a uniform distribution. This is where uniformity tests come in.
Common Tests for Uniformity
Kolmogorov-Smirnov Test (KS Test): This is a non-parametric test that compares the empirical distribution function (EDF) of the sample to the cumulative distribution function (CDF) of the hypothesized distribution, in this case, the uniform distribution. The test statistic is the maximum absolute difference between the EDF and CDF. A significant result indicates that the sample does not come from a uniform distribution.
Anderson-Darling Test: Similar to the KS Test, this test also compares the EDF of the sample to the CDF of the uniform distribution. However, it gives more weight to differences in the tails. It can be more powerful than the KS test in detecting deviations from uniformity, especially at the tails.
Chi-squared Test: This test is typically used for discrete distributions. The idea is to divide the range of the data into k intervals and then compare the observed frequency to the expected frequency in each interval. If the sample comes from a uniform distribution, the observed frequencies should be close to the expected frequencies.
Cramer-von Mises Criterion: This is another test that compares the EDF of the sample to the CDF of the uniform distribution. The test statistic is based on the squared difference between the two, integrated over the range.
Example: Kolmogorov-Smirnov Test
Let's demonstrate how to conduct the Kolmogorov-Smirnov test on a sample to test for uniformity using Python. I'll generate a sample and then test it.
import numpy as np
from scipy.stats import kstest
# Generate a sample: half from uniform, half from normal distribution
sample = np.concatenate([np.random.uniform(0, 1, 500), np.random.normal(0.5, 0.1, 500)])
# Perform KS test against uniform distribution
statistic, p_value = kstest(sample, 'uniform')
statistic, p_value
Here, a low p-value (typically below 0.05) would indicate that the sample is not from a uniform distribution.
Go forth and sample
You can use this approach to build unit tests or to build data quality checks to ensure your routines are appropriately handling the information correctly.