C Basics

Mean vs Median vs Mode: Understanding Central Tendency in Data Analysis

In the world of data analysis, understanding your data begins with identifying its central tendency – the value that best represents a dataset. Whether you’re analyzing student marks, business revenue, or training a machine learning model, three fundamental statistical measures play a crucial role:

Mean (Average)
Median (Middle Value)
Mode (Most Frequent Value)

These measures are not just theoretical concepts but are widely used in real-world scenarios such as:

Evaluating business performance
Analyzing academic results
Understanding salary distributions
Preparing datasets for machine learning models

However, each measure tells a different story about the data. Choosing the wrong one can lead to misleading conclusions – especially when dealing with skewed data or outliers.
In this article, we will break down:

The theory behind mean, median, and mode
Their mathematical understanding
Python implementations (both basic and using libraries)
When and where to use each in real-world applications

Mean

The mean, commonly known as the average, is the sum of all values in a dataset divided by the total number of values.
It is the most widely used measure of central tendency because it gives a single value representing the entire dataset.

Formula

Mean:

$ \bar{x} = \frac{\sum x}{n} $

Where:

$ \sum x $ = Sum of all data values

n = number of values

Consider an example

$$X = [10, 20, 30, 40, 50]$$

Step-by-step calculation:

\text{Sum} = 10 + 20 + 30 + 40 + 50 = 150

\text{Number of values (n)} = 5

\text{Mean} (\bar{x}) = \frac{150}{5} = 30

Important Insight (Outlier Effect)

The mean is highly sensitive to outliers.
Example:$$[10, 20, 30, 40, 1000]$$

\text{Mean} = 1100 / 5 = 220

Notice how one extreme value (1000) drastically changes the result.
This is why mean is not always reliable for skewed data.

Python Implementation

				
					def mean(data):
    return sum(data) / len(data)

data = [10, 20, 30, 40, 50]
print("Mean:", mean(data))

Median (Middle Value)

The median is the middle value of a dataset when the values are arranged in ascending or descending order.

Unlike the mean, the median focuses on the center position rather than the actual values, making it a more robust measure when dealing with uneven or skewed data.

Formula

\text{Median} = \begin{cases} x_{\frac{n+1}{2}} & \text{if } n \text{ is odd} \\ \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2} & \text{if } n \text{ is even} \end{cases}

Where

x = \text{Ordered dataset}

Odd vs Even Cases

Odd Number of Values

Dataset: $$[10, 20, 30, 40, 50]$$

$$Middle value = 30$$$$Median = 30$$

Even Number of Values

Dataset: $$[10, 20, 30, 40]$$

$$Middle values = (20, 30) $$

\text{Median} = \frac{20 + 30}{2} = 25

Important Insight (Outliers Resistant)

The median is not affected by extreme values (outliers).

Example:

$$[10, 20, 30, 40, 1000]$$

$$Median = 30$$

Even though 1000 is very large, the median remains stable.
This makes it ideal for skewed distributions like salary data.

Median for Grouped Data

When data is presented in the form of class intervals (grouped data), we cannot directly find the middle value. Instead, we use a formula to estimate the median.

\text{Median} = L_1 + \frac{\frac{N}{2} - cf}{f} \times i

where:

L1: Lower limit of the median class.
N: Total frequency (sum of all frequencies).
cf: Cumulative frequency of the class preceding the median class.
f: Frequency of the median class itself.
i: Class width or interval size (e.g., if the class is 20-30, i = 10).

Consider the example below:

Class Interval (x)	Frequency (f)
0 - 10	5
10 - 20	4
20 - 30	3
30 - 40	7
40 - 50	8
50 - 60	3

Let’s find the median for Grouped Data:

Step 1: Find the cumulative frequency (c.f) by adding frequencies in each interval.

Class Interval (x)	Frequency (f)	Cumulative Frequency (c.f)
0 - 10	5	5
10 - 20	10	5 + 10 = 15
20 - 30	12	15 + 12 = 27 (c.f)
(L1)30 - 40	8 (f)	27 + 8 = 35
40 - 50	17	35 + 17 = 52
50 - 60	8	52 + 8 = 60
	N = 60

Step 2: Find the Median range by using the formula

\text{Median}(M) = \text{Size of } \left[\frac{N}{2}\right]^{th} \text{item}

= \text{Size of } \left[\frac{30}{2}\right]^{th} \text{item} = \text{Size of } 15^{th} \text{item}

Hence, the median lies in the range of (30 – 40).

L1 = 30 f = 8 i = 10 c.f = 27

Applying the Median formula we get

30 + \left( \frac{30 - 27}{8} \right) \times 10

30 + \frac{30}{8} = 30 + 3.75 = \mathbf{33.75}

Python Implementation

				
					def median(data):
    data = sorted(data)
    n = len(data)
    mid = n // 2
    
    if n % 2 == 0:
        return (data[mid - 1] + data[mid]) / 2
    else:
        return data[mid]

data = [10, 20, 30, 40, 50]
print("Median:", median(data))

When to Use Median?

Use median when:

Data contains outliers
Distribution is skewed
You want the true middle representation

Mode (Most Frequent Value)

The mode is the value that appears most frequently in a dataset.
Unlike mean and median, the mode is particularly useful for identifying common or popular values, especially in categorical or discrete data.

Types of Mode

Depending on the dataset, mode can be:

Unimodal-> One most frequent value
Bimodal-> Two values with the same highest frequency
Multimodal-> More than two frequent values

Examples

Unimodal Dataset

$$[1, 2, 2, 3, 4]$$

$$Mode = 2$$

2. Bimodal Dataset

$$[1, 2, 2, 3, 3, 4]$$

$$Mode = 2, 3$$

Mode is not affected by outliers
Works well with categorical data (e.g., most popular product, most chosen option)
There can be multiple modes, unlike mean and median

Python Implementation

				
					def mode(data):
    freq = {}
    
    for num in data:
        freq[num] = freq.get(num, 0) + 1
    
    max_freq = max(freq.values())
    modes = [key for key, val in freq.items() if val == max_freq]
    
    return modes

data = [1, 2, 2, 3, 3, 4]
print("Mode:", mode(data))