Mean vs Median vs Mode: Understanding Central Tendency in Data Analysis

In the world of data analysis, understanding your data begins with identifying its central tendency – the value that best represents a dataset. Whether you’re analyzing student marks, business revenue, or training a machine learning model, three fundamental statistical measures play a crucial role:

  • Mean (Average)
  • Median (Middle Value)
  • Mode (Most Frequent Value)

These measures are not just theoretical concepts but are widely used in real-world scenarios such as:

  • Evaluating business performance
  • Analyzing academic results
  • Understanding salary distributions
  • Preparing datasets for machine learning models

However, each measure tells a different story about the data. Choosing the wrong one can lead to misleading conclusions – especially when dealing with skewed data or outliers.
In this article, we will break down:

  • The theory behind mean, median, and mode
  • Their mathematical understanding
  • Python implementations (both basic and using libraries)
  • When and where to use each in real-world applications

 

Mean

The mean, commonly known as the average, is the sum of all values in a dataset divided by the total number of values.
It is the most widely used measure of central tendency because it gives a single value representing the entire dataset.

Formula

Mean:

 \( \bar{x} = \frac{\sum x}{n} \)

Where:

\( \sum x \) = Sum of all data values 

n = number of values

Consider an example

$$X = [10, 20, 30, 40, 50]$$

Step-by-step calculation:

$$\text{Sum} = 10 + 20 + 30 + 40 + 50 = 150$$
$$\text{Number of values (n)} = 5$$
$$\text{Mean} (\bar{x}) = \frac{150}{5} = 30$$
 

Important Insight (Outlier Effect)

The mean is highly sensitive to outliers.
Example:$$[10, 20, 30, 40, 1000]$$

$$\text{Mean} = 1100 / 5 = 220$$
 
Notice how one extreme value (1000) drastically changes the result.
This is why mean is not always reliable for skewed data.
 

Python Implementation

 
				
					def mean(data):
    return sum(data) / len(data)

data = [10, 20, 30, 40, 50]
print("Mean:", mean(data))

				
			

Median (Middle Value)

The median is the middle value of a dataset when the values are arranged in ascending or descending order.

Unlike the mean, the median focuses on the center position rather than the actual values, making it a more robust measure when dealing with uneven or skewed data.

Formula

$$\text{Median} = \begin{cases} x_{\frac{n+1}{2}} & \text{if } n \text{ is odd} \\ \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2} & \text{if } n \text{ is even} \end{cases}$$
 
Where
 

= Number of values

x = Ordered dataset

 

Odd vs Even Cases

Odd Number of Values

Dataset: $$[10, 20, 30, 40, 50]$$

$$Middle value = 30$$$$Median = 30$$

 

Even Number of Values

Dataset: $$[10, 20, 30, 40]$$

$$Middle values = (20, 30) $$

$$\text{Median} = \frac{20 + 30}{2} = 25$$
 

Important Insight (Outliers Resistant)

The median is not affected by extreme values (outliers).

Example:

$$[10, 20, 30, 40, 1000]$$

$$Median = 30$$

Even though 1000 is very large, the median remains stable.
This makes it ideal for skewed distributions like salary data.

 

Median for Grouped Data

When data is presented in the form of class intervals (grouped data), we cannot directly find the middle value. Instead, we use a formula to estimate the median.

$$\text{Median} = L_1 + \frac{\frac{N}{2} – cf}{f} \times i$$

where:

  1. L1: Lower limit of the median class.
  2. N: Total frequency (sum of all frequencies).
  3. cf: Cumulative frequency of the class preceding the median class.
  4. f: Frequency of the median class itself.
  5. i: Class width or interval size (e.g., if the class is 20-30, i = 10).

Consider the example below:

Class Interval (x)Frequency (f)
0 - 105
10 - 204
20 - 303
30 - 407
40 - 508
50 - 603

Let’s find the median for Grouped Data:

Step 1: Find the cumulative frequency (c.f) by adding frequencies in each interval.

Class Interval (x)Frequency (f)Cumulative Frequency (c.f)
0 - 1055
10 - 20105 + 10 = 15
20 - 301215 + 12 = 27 (c.f)
(L1)30 - 408 (f)27 + 8 = 35
40 - 501735 + 17 = 52
50 - 60852 + 8 = 60
N = 60

Step 2: Find the Median range by using the formula   

$$\text{Median}(M) = \text{Size of } \left[\frac{N}{2}\right]^{th} \text{item}$$
$$= \text{Size of } \left[\frac{60}{2}\right]^{th} \text{item} = \text{Size of } 30^{th} \text{item}$$
 
Hence, the median lies in the range of (30 – 40).
 
L1 = 30   f = 8   i = 10  c.f = 27
 
Applying  the Median formula we get 
$$30 + \left( \frac{30 – 27}{8} \right) \times 10$$
 
$$ = 30 + 3.75 = \mathbf{33.75}$$
 

Python Implementation

				
					def median(data):
    data = sorted(data)
    n = len(data)
    mid = n // 2
    
    if n % 2 == 0:
        return (data[mid - 1] + data[mid]) / 2
    else:
        return data[mid]

data = [10, 20, 30, 40, 50]
print("Median:", median(data))
				
			

When to Use Median?

Use median when:

  • Data contains outliers
  • Distribution is skewed
  • You want the true middle representation

 

Mode (Most Frequent Value)

The mode is the value that appears most frequently in a dataset.
Unlike mean and median, the mode is particularly useful for identifying common or popular values, especially in categorical or discrete data.

 

Types of Mode

Depending on the dataset, mode can be:

  • Unimodal-> One most frequent value
  • Bimodal-> Two values with the same highest frequency
  • Multimodal-> More than two frequent values

Examples

  1. Unimodal Dataset

$$[1, 2, 2, 3, 4]$$

$$Mode = 2$$

       2. Bimodal Dataset

$$[1, 2, 2, 3, 3, 4]$$

$$Mode = 2, 3$$

  • Mode is not affected by outliers
  • Works well with categorical data (e.g., most popular product, most chosen option)
  • There can be multiple modes, unlike mean and median

Python Implementation

				
					def mode(data):
    freq = {}
    
    for num in data:
        freq[num] = freq.get(num, 0) + 1
    
    max_freq = max(freq.values())
    modes = [key for key, val in freq.items() if val == max_freq]
    
    return modes

data = [1, 2, 2, 3, 3, 4]
print("Mode:", mode(data))