5.Outliers In Pandas

Introduction To Outliers

In statistics, an outlier is a data point that differs significantly from other observations.

To put it more simply , Values that lies outside from most of the other values are outliers.

To understand it more simply , let us take an example.

Let's say we have a data of Math scores of students of a class, Something like this:

In [2]:

students = ['Abhishek','Raj','Deepak','Neha','Ramesh','Aaradhya','Mahesh','Sam']

scores = [25,29,3,32,85,33,27,28]

To filter out outliers , we must convert it in a line chart.

In [5]:

import pandas as pd
import matplotlib.pyplot as plt

students = ['Abhishek','Raj','Deepak','Neha','Ramesh','Ajay','Mahesh','Sam']

scores = [25,29,3,32,85,33,27,28]

plt.scatter(students,scores)

plt.show()

Now as per our definition of outliers , Students that are outliers here are Deepak with 3 marks and Ramesh with 85 marks.

Why Outliers are Important and should not be ignored ?

Outliers are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. Unfortunately, all analysts will confront outliers and be forced to make decisions about what to do with them. Given the problems they can cause, you might think that it’s best to remove them from your data.

So let us deal with Outliers :

Now that we have understood what are outliers , there are certain things to know in detail before we deep dive into its complexities:

Mean and Median and Mode and Range:

As we already know what are mean , median or mode , still for reference go through below image :

Normal Distribution

Early statisticians noticed the same shape coming up over and over again in different distributions—so they named it the normal distribution.

Let us assume , you have data of heights of people , Now they can either be too short or too tall or falls in average height category. And this is how statisticians found the same shape over and over again because this is how most of distributions would be.

Normal distributions have the following features:

Symmetric bell shape
Mean and Median are equal. Both located at the center of the distribution.
68(Approx.) percent of the data falls within 1 standard deviation of the mean
95(Approx.) percent of the data falls within 2 standard deviations of the mean
99.7(Approx.) percent of the data falls within 3 standard deviations of the mean

Standard Deviation

Deviation just means how far are you from the normal

A standard deviation is a measure of how dispersed the data is in relation to the mean. Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out.

Imagine you have measured the height of dogs in millimeters and it looks something like this :

The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.

Find out the Mean, the Variance, and the Standard Deviation.

Your first step is to find the Mean:

In [1]:

Mean=(600 + 470 + 170 + 430 + 300)/5
print(Mean)

394.0

so the mean (average) height is 394 mm. Let's plot this on the chart:

Now we calculate each dog's difference from the Mean:

To calculate the Variance, take each difference, square it, and then average the result:

variance =( 206**2 + 76**2 + (-224)**2 + 36**2 + (-94)**2 ) / 5
print(variance)

21704.0

So the Variance is 21,704

And the Standard Deviation is just the square root of Variance, so:

In [8]:

import math
standard_deviation = math.sqrt(variance)
print(round(standard_deviation))

147

And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one Standard Deviation (147mm) of the Mean:

So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small.

Rottweilers are tall dogs. And Dachshunds are a bit short, right?

Quartile in Pandas:

So, when we talk about quartiles, we are dividing the data set into 4 quarters.

Each quarter is 25% of the total number of data points.
The first quartile or Q1 : Value in the data set such that 25% of the data points are less than this value and 75% of the data set is greater than this value.
The second quartile or Q2 : Value in the data set such that 50% of the data points are less than this value and 50% of the data set are greater than this value.
The third quartile or Q3 : Value such that 75% of the values are less than this value and 25% of the values are greater than this value.

Interquartile Range :

The term Interquartile Range (IQR) refers to the difference between Q3 and Q1 (IQR = Q3 – Q1).

Box and Whisker Plot:

Box plot diagram also termed as Whisker’s plot is a graphical method typically depicted by quartiles and inter quartiles that helps in defining the upper limit and lower limit beyond which any data lying will be considered as outliers. The very purpose of this diagram is to identify outliers and discard it from the data series before making any further observation so that the conclusion made from the study gives more accurate results not influenced by any extremes or abnormal values.

Let's Find Outliers:

We have already dealt many times with Walmart Data set. Let's try to find outliers in Weekly_Sales.

import pandas as pd

wlmrt = pd.read_csv('walmart.csv')

wlmrt

Store

Type

Department

Weekly_Sales

Is_Holiday

Temperature

Fuel_Price

Unemployment

Date1

Date2

57258.43

False

62.27

2.719

7.808

01-01-2010

16-06-2022

16333.14

False

80.91

2.669

7.787

07-03-2010

07-04-2010

19403.54

False

46.63

2.561

8.106

26-02-2012

27-02-2012

22517.56

False

49.27

2.708

7.838

12-03-2011

12-03-2012

17596.96

False

66.32

2.808

7.808

16-04-2010

...

282446

467.30

False

65.32

4.038

8.684

21-09-2012

282447

508.37

False

64.88

3.997

8.684

28-09-2012

282448

770.86

True

37.00

3.640

8.424

02-10-2012

282449

727.49

False

78.65

3.722

8.684

08-10-2012

282450

893.60

False

61.24

3.889

8.567

05-11-2012

Let's Find max and min of the data:

In [3]:

mx = wlmrt['Weekly_Sales'].max()
mn = wlmrt['Weekly_Sales'].min()
print(mx)
print(mn)

693099.36
-4988.94

We will use this later in finding outliers.

Lower Extreme and Upper Extreme:

Outliers are data points that are more extreme than than Q1 - 1.5 * IQR or Q3 + 1.5 * IQR.

Lower Extreme : Q1 - 1.5 * IQR

Upper Extreme : Q3 + 1.5 * IQR

Let's Find Q1 and Q3 :

To Find Lower extreme , we must find Q1 and Q3. We will use percentile method of numpy to calculate quartiles.

In [4]:

import numpy as np

q1,q3 = np.percentile(wlmrt['Weekly_Sales'],[25,75])

print(q1,q3)

2079.33 20245.745000000003

Let's Find Interquartile Range (IQR):

In [5]:

iqr = q3-q1
iqr

Let's Find Lower Extreme and Upper Extreme:

In [6]:

lx = q1 - 1.5 * iqr
ux = q3 + 1.5 * iqr
print('Lower Extreme : ',lx)
print('Upper Extreme : ',ux)

Lower Extreme :  -25170.292500000003
Upper Extreme :  47495.36750000001

To find outliers beyond lower extreme our lower extreme must be greater than minimum value of dataset.

In [7]:

outliers_present = lx>mn
outliers_present

Out[7]:

False

Hence there are no outliers beyond lower extreme.

To find outliers beyond upper extreme our upper extreme must be smaller than maximum value of dataset.

In [8]:

outliers_present = ux<mx
outliers_present

Out[8]:

True

Hence there are outliers present beyond upper extreme.

No Outliers:

So we must remove outliers and doing that is pretty simple!!

In [9]:

no_outlier = wlmrt.loc[(wlmrt['Weekly_Sales']<ux) & (wlmrt['Weekly_Sales']>lx)]
no_outlier

Store

Type

Department

Weekly_Sales

Is_Holiday

Temperature

Fuel_Price

Unemployment

Date1

Date2

16333.14

False

80.91

2.669

7.787

07-03-2010

07-04-2010

19403.54

False

46.63

2.561

8.106

26-02-2012

27-02-2012

22517.56

False

49.27

2.708

7.838

12-03-2011

12-03-2012

17596.96

False

66.32

2.808

7.808

16-04-2010

16555.11

False

67.41

2.780

7.808

30-04-2010

...

282446

467.30

False

65.32

4.038

8.684

21-09-2012

282447

508.37

False

64.88

3.997

8.684

28-09-2012

282448

770.86

True

37.00

3.640

8.424

02-10-2012

282449

727.49

False

78.65

3.722

8.684

08-10-2012

282450

893.60

False

61.24

3.889

8.567

05-11-2012

Previous4.Missing Values In Pandas Next6.Aggregating Data

Last updated 2 years ago