You Know Nothing About Your Standard Deviation

6 min readSep 17, 2024

Companies conduct bunch of data analysis every day to better understand performance, customer behavior, and market trends.

Many people are interested in seeing customer segments, p-values, forecasts, z-scores, etc in their executive summaries. However, most of us skip the fundamentals of statistics. In this article, we will cover the standard deviation and normal distribution. The goal is to show how you can manipulate your data to optimize strategies.

Starting from Standard Deviation and Normal Distribution

Standard deviation is a statistical measurement that looks at how far individual points in a dataset are dispersed from the mean of that set. It is calculated as the square root of the variance.

Low Standard Deviation: Data points are close to the mean, indicating low variability.
High Standard Deviation: Data points are spread out over a wider range, indicating high variability.

Important Note: I would like to remind you that a low SD does not always mean good data. Imagine a low SD in disease diagnosis, which may cause false results in patient tests. In other words, the effect of SD depends on the condition.

A normal distribution is a symmetric, bell-shaped curve where most of the observations cluster around the central peak, the mean. The spread of the data is determined by the standard deviation.

**The Example of Normal Distributed Histogram**

The Empirical Rule (68–95–99.7 rule) states that for a normal distribution:

68% of the data falls within ±1σ of the mean.
95% falls within ±2σ.
99.73% falls within ±3σ.

Why σ Matters?

Performance: By calculating the mean and standard deviation for KPIs like conversion rates, we can identify whether a campaign is performing within the expected range. If a campaign’s performance deviates significantly beyond ±3σ, it may indicate underlying issues or outliers, such as technical problems or misaligned targeting.
Anomalies: There are different opinions on this subject, but there is no single truth. Of course, values falling beyond the ±3σ range are statistically rare, and we can consider them outliers. However, who doesn’t want faster solutions with risky decisions? In such circumstances, we can define σ as a limit for our dataset. Why not? Below, you will find more details about it.
Decision-Making: Beyond those evaluations, σ provides a framework for confidence in data. For example, an A/B test that shows a difference in conversion rates beyond ±2σ or ±3σ could suggest a genuine improvement, rather than random chance.

Examples of σ with Real Life Datasets

Till now, we have mostly discussed the theoretical aspects of standard deviation. From this point onward, some ideas will start to form in your mind. If you are familiar with datasets, you might already be asking the question: what if the data is not normally distributed? Calm down, we are getting to that point slowly and both datasets are real data that I analyzed in case studies.

Imagine you’re analyzing users who subscribed to your fitness app, and you’ve already collected user variables like weight and height. Below is an explanation of the calculation process step by step.

mean_height = df_filtered['height'].mean()
stdev_height = df_filtered['height'].std()

plt.figure(figsize=(10, 6))
plt.hist(df_filtered['height'], bins=30, color='blue', edgecolor='black', alpha=0.7)

plt.axvline(mean_height, color='red', linestyle='dashed', linewidth=2, label='Mean')
plt.axvline(mean_height + 3 * stdev_height, color='green', linestyle='dashed', linewidth=2, label='+3 Stdev')
plt.axvline(mean_height - 3 * stdev_height, color='orange', linestyle='dashed', linewidth=2, label='-3 Stdev')

**Height Distribution of Fitness App Dataset**

The above histogram is the visualization of the heights of users who subscribed to the app. We can easily say that this dataset is normally distributed, so there is nothing surprising here. A classic distribution, just like our teachers showed us at university.

BUT REALITY KICKED OUR ASS IN WORK LIFE.

Long story short, we are never ever playing with normally distributed datasets after university. Here comes the skewness, especially the right-skewed histograms. It does not matter in which sector you are working; it is an indisputable fact that you are handling a dataset with skewness every day.

Theoretical explanations mean nothing when the topic comes to skewed data. Skewness happens because of boundaries of your data, outliers, population heterogeneity, sampling bias, etc. There are several real-life examples that we can talk about for skewed data, like income distribution, human age at death, or exam scores.

Imagine that you are analyzing your e-commerce brand’s products’ performance according to categorical distribution, and you see something like below. There are huge data points placed just near zero. It shows there are lots of product categories which may be added to cart less than 10 times, and our aim is to make it readable from the eyes of analysts.

So, What Now?

Even if your data is skewed, you are still able to make it normally distributed with several different methods. You can find the methods and related Python codes below. While transforming the data, keep in mind your original and each row conducts a different method.

df_filtered['ln_category_added'] = np.log1p(df_filtered['category_added'])

df_filtered['sqrt_category_added'] = np.sqrt(df_filtered['category_added'])

df_filtered['cbrt_category_added'] = np.cbrt(df_filtered['category_added'])

Each calculation has its own histogram, and skewness value too. I highly recommend checking the compatibility of the dataset with your method and skewness value after corrections. It shows how much the new histogram chart becomes normally distributed. Also, adding the corrected value as a new column helps you to see the original data in the histogram by using the code below.

for i in range(len(patches)):
    plt.text(bins[i] + (bins[i+1] - bins[i]) / 2,   
             n[i],                                  
             int(n[i]),                             
             ha='center', va='bottom', fontsize=10)  

xticks_positions = np.linspace(df_filtered['ln_category_added'].min(), df_filtered['ln_category_added'].max(), num=12)  
xticks_labels = [f"{int(np.expm1(pos))}({round(pos, 2)})" for pos in xticks_positions]

It is obvious that our histogram is still not normally distributed, but it does not mean we cannot continue analyzing with that visual. Remember we had a highly skewed chart at first; then we reached this. Apart from here, it belongs to our knowledge. You should ask the correct questions to drive value-added work. Here are some example strategies below.

Strategies

1. Height Distribution

Since the largest portion of sample data near the mean, forecast or targeting could focus on those individuals.
Humankind not always a trustable while signing up a platform. because of that you could consider handling or investigating the outliers below 130 cm and above 200 cm. They minght represent measurement errors or data entry issues.
The dataset could be segmented based on height ranges like short, average and tall categories. This could be useful for analysis related to fitness behaviors, as height might influence different fitness activities or performance.

2. Add to Cart Distribution

Product categories with the highest number of add to cart events (outliers) should be examined for potential trends. Are they on sale, highly reviewed, or promoted differently?
Consider optimizing the category listing pages, promotions, or visibility for products with lower than average add to cart events.
Consider segmenting the products into categories such as high performers (above +1 standard deviation), average performers (within 1 standard deviation of the mean), and low performers (below the mean). This segmentation can help to find out the pain points in data and enhance the marketing strategies.

Conclusion

Understanding and applying standard deviation is one of the other tools for digital marketers looking to optimize campaign performance, identify anomalies, and make confident decisions based on data. Do not give up on your data if it’s skewed. You can directly apply this statistical concept whether it’s monitoring campaign metrics, conducting A/B tests, or understanding customer behavior.

Follow Me on Linkedin: https://www.linkedin.com/in/deniz-mogulkoc/

You Know Nothing About Your Standard Deviation

Starting from Standard Deviation and Normal Distribution

Why σ Matters?

Examples of σ with Real Life Datasets

So, What Now?

Strategies

1. Height Distribution

2. Add to Cart Distribution

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Deniz Mogulkoc

No responses yet