6.Aggregating Data
Mean
Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total number of numbers.
Median
The Median is the Middle value in the list of numbers.
Data
In [30]:
0
1
A
1
57258.43
False
62.27
2.719
7.808
2010-04-02
1
1
A
1
16333.14
False
80.91
2.669
7.787
2010-07-02
2
1
A
1
19403.54
False
46.63
2.561
8.106
2010-02-26
3
1
A
1
22517.56
False
49.27
2.708
7.838
2010-12-03
4
1
A
1
17596.96
False
66.32
2.808
7.808
2010-04-16
So, this is our Data about Walmart store. In this module we are going to use Walmart DataSet.
Let's find out the the mean and median for Weekly_Sales column.
In [31]:
Out[31]:
In [32]:
Out[32]:
Summarizing dates
Maximum and Minimum allow us to see what time range your data covers.
In [33]:
Out[33]:
In [34]:
Out[34]:
By using .max() and .min() we get to know that our DataSet have data from January of 2010 to December of 2012.
Efficient Summaries
Agg Method:
The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super efficient.
Have you noticed that our temperature column is in Fahrenheit and I want to convert it into Celsius. Now, we will make my custom function that will convert temperature from Fahrenheit to Celsius and we will apply it through .agg() method to a DataFrame column.
Here, we are applying a custom function with the help of .agg() method, you can pass the list of functions to apply more than one function or you can pass the list of columns if you want to apply the function on more than one column.
Cumulative Statistics
Cumulative statistics can also be helpful in tracking summary statistics over time. We'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow us to identify what the total sales were so far as well as what the highest weekly sales were so far.
Cumulative sum of weekly_sales
In [36]:
Dropping duplicates
Removing duplicates is very essential to get accurate counts, because we don't want to count the same thing multiple times. In this exercise, we'll create some new DataFrames using unique values from walmart.
In [37]:
0
1
A
1
57258.43
False
62.27
2.719
7.808
2010-04-02
6946
2
A
1
27023.35
False
69.24
2.603
8.163
2010-10-01
13804
3
B
1
4238.56
False
82.20
2.669
7.346
2010-07-02
19897
4
A
1
36486.28
False
63.96
2.619
7.127
2010-10-01
26713
5
B
1
8588.14
False
71.10
2.603
6.768
2010-10-01
Counting categorical variables
Counting is a great way to get an overview of your data and to spot unknown information that you might not notice. We'll count the number of each type of store and the number of each department number using the DataFrames we created in Dropping Duplicates.
Counting the number of stores of each Type
In [38]:
Out[38]:
Proportion of stores of each Type
Normalize: normalize argument is used to turn the counts into proportion of the total.
In [39]:
Out[39]:
Counting the number of each department number and sorting values
In [40]:
Out[40]:
Proportion of departments of each number and sorting values
In [41]:
Out[41]:
.groupby()
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently.
Group by Type to calculate total Weekly_Sales
In [42]:
Out[42]:
Here, we group by the Type and perform sum on the column Weekly_Sales.
Group by Type and Is_Holiday to calculate total Weekly_Sales
In [43]:
Out[43]:
In this case we are grouping by two columns and summarizing one column if you want to perfrom any other summary statistics on more than one column then you have to pass the list of those column.
Multiple grouped summaries
Earlier we saw that the .agg() method is useful to compute multiple statistics on multiple variables. It also works with .groupby or grouped data.
For each store Type, aggregating Weekly_Sales to gett min, max, mean, and median
In [44]:
Type
A
-4988.94
381072.11
20079.169258
10096.75
B
-1750.00
693099.36
12263.647825
6197.43
C
-379.00
112152.35
9484.482310
1137.34
Notice that the minimum Weekly_Sales is negative because some stores had more returns than sales.
For each store Type, aggregating Unemployment and Fuel_Price to get min, max, mean, and median
In [45]:
amin
amax
mean
median
amin
amax
mean
median
Type
A
3.879
14.313
7.798863
7.818
2.472
4.468
3.343295
3.416
B
4.125
14.313
7.936579
7.874
2.514
4.468
3.381915
3.494
C
5.217
14.313
8.948222
8.300
2.514
4.468
3.363569
3.417
Pivot tables
Pivot tables are another way of calculating grouped summary statistics. Basically pivot tables are used in spreadsheets, if you have ever used spreadsheets then chances are you have used pivot tables. Let's see how to create pivot tables in pandas.
Pivoting on one variable
We'll perform calculations using .pivot_table() to replicate the calculations we performed in the .groupby().
Pivoting for mean Weekly_Sales for each store Type
In [46]:
Type
A
20079.169258
B
12263.647825
C
9484.482310
Pivoting for mean and median Weekly_Sales for each store Type
In [47]:
Weekly_Sales
Weekly_Sales
Type
A
20079.169258
10096.75
B
12263.647825
6197.43
C
9484.482310
1137.34
Pivoting for mean Seekly_Sales by store Type and Holiday
Type
A
20009.541574
21001.295283
B
12166.429612
13549.646294
C
9464.078381
9757.554889
Fill in missing values and sum values with pivot tables
.pivot_table() method has several useful arguments, including fill_value and margins.
fill_value: replaces missing values with real value. margins: is a shortcut for when we pivoted by two variables, but also wanted to pivot by each of those variables separately, it gives the row and column totals of the pivot table contents.
mean of Weekly_Sales by Department and Type, fill missing values with 0
In [49]:
Department
1
23280.657103
17821.729825
8955.170884
2
51935.206120
43359.816952
14398.908941
3
14044.489715
12880.306941
803.546661
4
33128.245264
21457.823086
13506.047051
5
26589.677249
21109.894917
766.519121
...
...
...
...
95
97473.593714
41428.486751
50990.968371
96
19771.434235
4874.171608
15720.388253
97
21937.066379
3726.370891
13468.309357
98
10971.473930
317.638051
5406.640087
99
486.289516
21.008889
3.332143
mean Weekly_Sales by Department and Type, fill missing values with 0s, sum all rows and cols
Last updated