PDA Assignments
  • Python For Data Analytics
    • 1.Python
      • 1.Python Documents
        • 1.Data Types
        • 2.Variables In Python
        • 3.Operators In Python
        • 4.User Input In Python
        • 5.TypeCasting In Python
        • 6.Strings In Python
        • 7.Conditional Statements In Python
        • 8.Branching using Conditional Statements and Loops in Python
        • 9.Lists In Python
        • 10.Sets In Python
        • 11.Tuples In Python
        • 12.Dictionary In Python
        • 13.Functions In Python
        • 14.File Handling In Python
        • 15.Numerical Computing with Python and Numpy
      • 2.Python Assignments
        • Data Type & Variables
        • Operators Assignment
        • User Input & Type Casting
        • Functions- Basic Assignments
        • String Assignments
          • String CheatSheet
        • Conditional Statements Assignments
        • Loops Assignments
        • List Assignments
          • List Cheatsheet
        • Set Assignments
          • Sets Cheatsheet
        • Dictionary Assignments
          • Dictionary Cheatsheet
        • Function Assignments
        • Functions used in Python
      • 3.Python Projects
        • Employee Management System
        • Hamming distance
        • Webscraping With Python
          • Introduction To Web Scraping
          • Importing Necessary Libraries
          • Basic Introduction To HTML
          • Introduction To BeautifulSoup
          • Flipkart Web Scraping
            • Scraping Step By Step
        • Retail Sales Analysis
        • Guess the Word Game
        • Data Collection Through APIs
        • To-Do List Manager
        • Atm-functionalities(nested if)
        • Distribution of Cards(List & Nested for)
        • Guess the Number Game
      • 4.Python + SQL Projects
        • Bookstore Management System
    • 2.Data Analytics
      • 1.Pandas
        • 1.Pandas Documents
          • 1.Introduction To Pandas
          • Reading and Loading Different Data
          • 2.Indexing and Slicing In Pandas
          • 3.Joining In Pandas
          • 4.Missing Values In Pandas
          • 5.Outliers In Pandas
          • 6.Aggregating Data
          • 7.DateTime In Pandas
          • 8.Validation In Pandas
          • 9.Fetching Data From SQL
          • 10. Automation In Pandas
          • 11.Matplotlib - Data Visualization
          • 12. Seaborn - Data Visualization
          • 13. Required Files
        • 3.Pandas Projects
          • Retail Sales Analysis
            • Retail Sales Step By Step
          • IMDB - Dataset Analysis - Basic
        • 2. Pandas Assignments
          • 1. Reading and Loading the Data
          • 2. Data frame Functions and Properties
          • 3. Series - Basic Operations
          • 4. Filtering in Pandas
          • 5. Advance Filtering
          • 6. Aggregate Functions & Groupby
          • 7. Pivot Tables
          • 8. Datetime
          • 9. String Functions
Powered by GitBook
On this page
  • Mean
  • Median
  • Data
  • Summarizing dates
  • Efficient Summaries
  • Cumulative Statistics
  • Dropping duplicates
  • Counting categorical variables
  • .groupby()
  • Multiple grouped summaries
  • Pivot tables
  • Fill in missing values and sum values with pivot tables
  1. Python For Data Analytics
  2. 2.Data Analytics
  3. 1.Pandas
  4. 1.Pandas Documents

6.Aggregating Data

Mean

Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total number of numbers.

Median

The Median is the Middle value in the list of numbers.

Data

In [30]:

import pandas as pd
import numpy as np
walmart = pd.read_csv('walmart.csv')
walmart['Date']  = pd.to_datetime(walmart['Date'])
walmart.head()
Store
Type
Department
Weekly_Sales
Is_Holiday
Temperature
Fuel_Price
Unemployment
Date

0

1

A

1

57258.43

False

62.27

2.719

7.808

2010-04-02

1

1

A

1

16333.14

False

80.91

2.669

7.787

2010-07-02

2

1

A

1

19403.54

False

46.63

2.561

8.106

2010-02-26

3

1

A

1

22517.56

False

49.27

2.708

7.838

2010-12-03

4

1

A

1

17596.96

False

66.32

2.808

7.808

2010-04-16

So, this is our Data about Walmart store. In this module we are going to use Walmart DataSet.

Let's find out the the mean and median for Weekly_Sales column.

In [31]:

walmart['Weekly_Sales'].mean()

Out[31]:

15983.4296920532

In [32]:

walmart['Weekly_Sales'].median()

Out[32]:

7616.55

Summarizing dates

Maximum and Minimum allow us to see what time range your data covers.

In [33]:

walmart['Date'].max()

Out[33]:

Timestamp('2012-10-26 00:00:00')

In [34]:

walmart['Date'].min()

Out[34]:

Timestamp('2010-02-05 00:00:00')

By using .max() and .min() we get to know that our DataSet have data from January of 2010 to December of 2012.

Efficient Summaries

Agg Method:

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super efficient.

Have you noticed that our temperature column is in Fahrenheit and I want to convert it into Celsius. Now, we will make my custom function that will convert temperature from Fahrenheit to Celsius and we will apply it through .agg() method to a DataFrame column.

def f_to_c(column):
    return (column-32)*5/9

walmart['Temperature'].agg(f_to_c)
0         16.816667
1         27.172222
2          8.127778
3          9.594444
4         19.066667
            ...    
282446    18.511111
282447    18.266667
282448     2.777778
282449    25.916667
282450    16.244444
Name: Temperature, Length: 282451, dtype: float64

Here, we are applying a custom function with the help of .agg() method, you can pass the list of functions to apply more than one function or you can pass the list of columns if you want to apply the function on more than one column.

Cumulative Statistics

Cumulative statistics can also be helpful in tracking summary statistics over time. We'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow us to identify what the total sales were so far as well as what the highest weekly sales were so far.

Cumulative sum of weekly_sales

In [36]:

# cum_walmart['Cum_Weekly_Sales'] = cum_walmart['Weekly_Sales'].cumsum()

Dropping duplicates

Removing duplicates is very essential to get accurate counts, because we don't want to count the same thing multiple times. In this exercise, we'll create some new DataFrames using unique values from walmart.

In [37]:

walmart_store_types = walmart.drop_duplicates(subset=['Store','Type'])
walmart_store_types.head()
Store
Type
Department
Weekly_Sales
Is_Holiday
Temperature
Fuel_Price
Unemployment
Date

0

1

A

1

57258.43

False

62.27

2.719

7.808

2010-04-02

6946

2

A

1

27023.35

False

69.24

2.603

8.163

2010-10-01

13804

3

B

1

4238.56

False

82.20

2.669

7.346

2010-07-02

19897

4

A

1

36486.28

False

63.96

2.619

7.127

2010-10-01

26713

5

B

1

8588.14

False

71.10

2.603

6.768

2010-10-01

Counting categorical variables

Counting is a great way to get an overview of your data and to spot unknown information that you might not notice. We'll count the number of each type of store and the number of each department number using the DataFrames we created in Dropping Duplicates.

Counting the number of stores of each Type

In [38]:

walmart_store_counts = walmart_store_types['Type'].value_counts()
walmart_store_counts

Out[38]:

A    22
B    17
C     6
Name: Type, dtype: int64

Proportion of stores of each Type

Normalize: normalize argument is used to turn the counts into proportion of the total.

In [39]:

walmart_store_props = walmart_store_types['Type'].value_counts(normalize=True)
walmart_store_props

Out[39]:

A    0.488889
B    0.377778
C    0.133333
Name: Type, dtype: float64

Counting the number of each department number and sorting values

In [40]:

walmart_dept_counts_sorted = walmart['Department'].value_counts(sort=True)
walmart_dept_counts_sorted

Out[40]:

90    4394
2     4379
79    4360
91    4355
14    4352
      ... 
78     147
65      96
77      93
39       9
43       8
Name: Department, Length: 81, dtype: int64

Proportion of departments of each number and sorting values

In [41]:

walmart_dept_props_sorted = walmart['Department'].value_counts(sort=True, normalize=True)
walmart_dept_props_sorted

Out[41]:

90    0.015557
2     0.015504
79    0.015436
91    0.015419
14    0.015408
        ...   
78    0.000520
65    0.000340
77    0.000329
39    0.000032
43    0.000028
Name: Department, Length: 81, dtype: float64

.groupby()

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently.

Group by Type to calculate total Weekly_Sales

In [42]:

walmart_sales_by_type = walmart.groupby("Type")["Weekly_Sales"].sum()
walmart_sales_by_type

Out[42]:

Type
A    2.902625e+09
B    1.341802e+09
C    2.701086e+08
Name: Weekly_Sales, dtype: float64

Here, we group by the Type and perform sum on the column Weekly_Sales.

Group by Type and Is_Holiday to calculate total Weekly_Sales

In [43]:

walmart_sales_by_type_is_holiday = walmart.groupby(["Type",'Is_Holiday'])["Weekly_Sales"].sum()
walmart_sales_by_type_is_holiday

Out[43]:

Type  Is_Holiday
A     False         2.689482e+09
      True          2.131421e+08
B     False         1.237606e+09
      True          1.041968e+08
C     False         2.507886e+08
      True          1.931996e+07
Name: Weekly_Sales, dtype: float64

In this case we are grouping by two columns and summarizing one column if you want to perfrom any other summary statistics on more than one column then you have to pass the list of those column.

Multiple grouped summaries

Earlier we saw that the .agg() method is useful to compute multiple statistics on multiple variables. It also works with .groupby or grouped data.

For each store Type, aggregating Weekly_Sales to gett min, max, mean, and median

In [44]:

walmart_sales_stats = walmart.groupby('Type')['Weekly_Sales'].agg([np.min,np.max,np.mean,np.median])
walmart_sales_stats
amin
amax
mean
median

Type

A

-4988.94

381072.11

20079.169258

10096.75

B

-1750.00

693099.36

12263.647825

6197.43

C

-379.00

112152.35

9484.482310

1137.34

Notice that the minimum Weekly_Sales is negative because some stores had more returns than sales.

For each store Type, aggregating Unemployment and Fuel_Price to get min, max, mean, and median

In [45]:

walmart_unemp_fuel_stats = walmart.groupby('Type')[['Unemployment','Fuel_Price']].agg([np.min,np.max,np.mean,np.median])
walmart_unemp_fuel_stats
Unemployment
Fuel_Price

amin

amax

mean

median

amin

amax

mean

median

Type

A

3.879

14.313

7.798863

7.818

2.472

4.468

3.343295

3.416

B

4.125

14.313

7.936579

7.874

2.514

4.468

3.381915

3.494

C

5.217

14.313

8.948222

8.300

2.514

4.468

3.363569

3.417

Pivot tables

Pivot tables are another way of calculating grouped summary statistics. Basically pivot tables are used in spreadsheets, if you have ever used spreadsheets then chances are you have used pivot tables. Let's see how to create pivot tables in pandas.

Pivoting on one variable

We'll perform calculations using .pivot_table() to replicate the calculations we performed in the .groupby().

Pivoting for mean Weekly_Sales for each store Type

In [46]:

mean_sales_by_type = walmart.pivot_table(values="Weekly_Sales", index="Type")
mean_sales_by_type
Weekly_Sales

Type

A

20079.169258

B

12263.647825

C

9484.482310

Pivoting for mean and median Weekly_Sales for each store Type

In [47]:

mean_med_sales_by_type = walmart.pivot_table(values="Weekly_Sales", index="Type", aggfunc=[np.mean,np.median])
mean_med_sales_by_type
mean
median

Weekly_Sales

Weekly_Sales

Type

A

20079.169258

10096.75

B

12263.647825

6197.43

C

9484.482310

1137.34

Pivoting for mean Seekly_Sales by store Type and Holiday

mean_sales_by_type_holiday = walmart.pivot_table(values="Weekly_Sales", index="Type", columns='Is_Holiday')
mean_sales_by_type_holiday
Is_Holiday
False
True

Type

A

20009.541574

21001.295283

B

12166.429612

13549.646294

C

9464.078381

9757.554889

Fill in missing values and sum values with pivot tables

.pivot_table() method has several useful arguments, including fill_value and margins.

fill_value: replaces missing values with real value. margins: is a shortcut for when we pivoted by two variables, but also wanted to pivot by each of those variables separately, it gives the row and column totals of the pivot table contents.

mean of Weekly_Sales by Department and Type, fill missing values with 0

In [49]:

walmart.pivot_table(values='Weekly_Sales', index='Department', columns='Type', fill_value=0)
Type
A
B
C

Department

1

23280.657103

17821.729825

8955.170884

2

51935.206120

43359.816952

14398.908941

3

14044.489715

12880.306941

803.546661

4

33128.245264

21457.823086

13506.047051

5

26589.677249

21109.894917

766.519121

...

...

...

...

95

97473.593714

41428.486751

50990.968371

96

19771.434235

4874.171608

15720.388253

97

21937.066379

3726.370891

13468.309357

98

10971.473930

317.638051

5406.640087

99

486.289516

21.008889

3.332143

mean Weekly_Sales by Department and Type, fill missing values with 0s, sum all rows and cols

Previous5.Outliers In PandasNext7.DateTime In Pandas

Last updated 2 years ago