5.Outliers In Pandas
Last updated
Last updated
In statistics, an outlier is a data point that differs significantly from other observations.
To put it more simply , Values that lies outside from most of the other values are outliers.
To understand it more simply , let us take an example.
Let's say we have a data of Math scores of students of a class, Something like this:
In [2]:
To filter out outliers , we must convert it in a line chart.
In [5]:
Now as per our definition of outliers , Students that are outliers here are Deepak with 3 marks and Ramesh with 85 marks.
Outliers are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. Unfortunately, all analysts will confront outliers and be forced to make decisions about what to do with them. Given the problems they can cause, you might think that it’s best to remove them from your data.
Now that we have understood what are outliers , there are certain things to know in detail before we deep dive into its complexities:
As we already know what are mean , median or mode , still for reference go through below image :
Early statisticians noticed the same shape coming up over and over again in different distributions—so they named it the normal distribution.
Let us assume , you have data of heights of people , Now they can either be too short or too tall or falls in average height category. And this is how statisticians found the same shape over and over again because this is how most of distributions would be.
Symmetric bell shape
Mean and Median are equal. Both located at the center of the distribution.
68(Approx.) percent of the data falls within 1 standard deviation of the mean
95(Approx.) percent of the data falls within 2 standard deviations of the mean
99.7(Approx.) percent of the data falls within 3 standard deviations of the mean
Deviation just means how far are you from the normal
A standard deviation is a measure of how dispersed the data is in relation to the mean. Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out.
Imagine you have measured the height of dogs in millimeters and it looks something like this :
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Your first step is to find the Mean:
In [1]:
so the mean (average) height is 394 mm. Let's plot this on the chart:
Now we calculate each dog's difference from the Mean:
To calculate the Variance, take each difference, square it, and then average the result:
So the Variance is 21,704
And the Standard Deviation is just the square root of Variance, so:
In [8]:
And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one Standard Deviation (147mm) of the Mean:
So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small.
Rottweilers are tall dogs. And Dachshunds are a bit short, right?
So, when we talk about quartiles, we are dividing the data set into 4 quarters.
Each quarter is 25% of the total number of data points.
The first quartile or Q1 : Value in the data set such that 25% of the data points are less than this value and 75% of the data set is greater than this value.
The second quartile or Q2 : Value in the data set such that 50% of the data points are less than this value and 50% of the data set are greater than this value.
The third quartile or Q3 : Value such that 75% of the values are less than this value and 25% of the values are greater than this value.
The term Interquartile Range (IQR) refers to the difference between Q3 and Q1 (IQR = Q3 – Q1).
Box plot diagram also termed as Whisker’s plot is a graphical method typically depicted by quartiles and inter quartiles that helps in defining the upper limit and lower limit beyond which any data lying will be considered as outliers. The very purpose of this diagram is to identify outliers and discard it from the data series before making any further observation so that the conclusion made from the study gives more accurate results not influenced by any extremes or abnormal values.
We have already dealt many times with Walmart Data set. Let's try to find outliers in Weekly_Sales.
0
1
A
1
57258.43
False
62.27
2.719
7.808
01-01-2010
16-06-2022
1
1
A
1
16333.14
False
80.91
2.669
7.787
07-03-2010
07-04-2010
2
1
A
1
19403.54
False
46.63
2.561
8.106
26-02-2012
27-02-2012
3
1
A
1
22517.56
False
49.27
2.708
7.838
12-03-2011
12-03-2012
4
1
A
1
17596.96
False
66.32
2.808
7.808
16-04-2010
16-04-2010
...
...
...
...
...
...
...
...
...
...
...
282446
45
B
98
467.30
False
65.32
4.038
8.684
21-09-2012
21-09-2012
282447
45
B
98
508.37
False
64.88
3.997
8.684
28-09-2012
28-09-2012
282448
45
B
98
770.86
True
37.00
3.640
8.424
02-10-2012
02-10-2012
282449
45
B
98
727.49
False
78.65
3.722
8.684
08-10-2012
08-10-2012
282450
45
B
98
893.60
False
61.24
3.889
8.567
05-11-2012
05-11-2012
In [3]:
We will use this later in finding outliers.
Outliers are data points that are more extreme than than Q1 - 1.5 * IQR or Q3 + 1.5 * IQR.
Lower Extreme : Q1 - 1.5 * IQR
Upper Extreme : Q3 + 1.5 * IQR
To Find Lower extreme , we must find Q1 and Q3. We will use percentile method of numpy to calculate quartiles.
In [4]:
In [5]:
In [6]:
To find outliers beyond lower extreme our lower extreme must be greater than minimum value of dataset.
In [7]:
Out[7]:
Hence there are no outliers beyond lower extreme.
To find outliers beyond upper extreme our upper extreme must be smaller than maximum value of dataset.
In [8]:
Out[8]:
Hence there are outliers present beyond upper extreme.
So we must remove outliers and doing that is pretty simple!!
In [9]:
1
1
A
1
16333.14
False
80.91
2.669
7.787
07-03-2010
07-04-2010
2
1
A
1
19403.54
False
46.63
2.561
8.106
26-02-2012
27-02-2012
3
1
A
1
22517.56
False
49.27
2.708
7.838
12-03-2011
12-03-2012
4
1
A
1
17596.96
False
66.32
2.808
7.808
16-04-2010
16-04-2010
5
1
A
1
16555.11
False
67.41
2.780
7.808
30-04-2010
30-04-2010
...
...
...
...
...
...
...
...
...
...
...
282446
45
B
98
467.30
False
65.32
4.038
8.684
21-09-2012
21-09-2012
282447
45
B
98
508.37
False
64.88
3.997
8.684
28-09-2012
28-09-2012
282448
45
B
98
770.86
True
37.00
3.640
8.424
02-10-2012
02-10-2012
282449
45
B
98
727.49
False
78.65
3.722
8.684
08-10-2012
08-10-2012
282450
45
B
98
893.60
False
61.24
3.889
8.567
05-11-2012
05-11-2012