PDA Assignments
  • Python For Data Analytics
    • 1.Python
      • 1.Python Documents
        • 1.Data Types
        • 2.Variables In Python
        • 3.Operators In Python
        • 4.User Input In Python
        • 5.TypeCasting In Python
        • 6.Strings In Python
        • 7.Conditional Statements In Python
        • 8.Branching using Conditional Statements and Loops in Python
        • 9.Lists In Python
        • 10.Sets In Python
        • 11.Tuples In Python
        • 12.Dictionary In Python
        • 13.Functions In Python
        • 14.File Handling In Python
        • 15.Numerical Computing with Python and Numpy
      • 2.Python Assignments
        • Data Type & Variables
        • Operators Assignment
        • User Input & Type Casting
        • Functions- Basic Assignments
        • String Assignments
          • String CheatSheet
        • Conditional Statements Assignments
        • Loops Assignments
        • List Assignments
          • List Cheatsheet
        • Set Assignments
          • Sets Cheatsheet
        • Dictionary Assignments
          • Dictionary Cheatsheet
        • Function Assignments
        • Functions used in Python
      • 3.Python Projects
        • Employee Management System
        • Hamming distance
        • Webscraping With Python
          • Introduction To Web Scraping
          • Importing Necessary Libraries
          • Basic Introduction To HTML
          • Introduction To BeautifulSoup
          • Flipkart Web Scraping
            • Scraping Step By Step
        • Retail Sales Analysis
        • Guess the Word Game
        • Data Collection Through APIs
        • To-Do List Manager
        • Atm-functionalities(nested if)
        • Distribution of Cards(List & Nested for)
        • Guess the Number Game
      • 4.Python + SQL Projects
        • Bookstore Management System
    • 2.Data Analytics
      • 1.Pandas
        • 1.Pandas Documents
          • 1.Introduction To Pandas
          • Reading and Loading Different Data
          • 2.Indexing and Slicing In Pandas
          • 3.Joining In Pandas
          • 4.Missing Values In Pandas
          • 5.Outliers In Pandas
          • 6.Aggregating Data
          • 7.DateTime In Pandas
          • 8.Validation In Pandas
          • 9.Fetching Data From SQL
          • 10. Automation In Pandas
          • 11.Matplotlib - Data Visualization
          • 12. Seaborn - Data Visualization
          • 13. Required Files
        • 3.Pandas Projects
          • Retail Sales Analysis
            • Retail Sales Step By Step
          • IMDB - Dataset Analysis - Basic
        • 2. Pandas Assignments
          • 1. Reading and Loading the Data
          • 2. Data frame Functions and Properties
          • 3. Series - Basic Operations
          • 4. Filtering in Pandas
          • 5. Advance Filtering
          • 6. Aggregate Functions & Groupby
          • 7. Pivot Tables
          • 8. Datetime
          • 9. String Functions
Powered by GitBook
On this page
  • Project:
  • Download Dataset:
  • Import Necessary Libraries
  • Assign Path Variable to Path of Data
  • Merging 12 months of sales data in a single file:
  • List all files from the directory
  • Concatenate all files :
  • Convert To CSV
  • Read From Updated DataFrame
  • Augment Data With Additional Columns
  • Clean up the data :
  • Convert Order Date in Datetime
  • Add Month Column:
  • What was the best month for sales? How much was earned that month?
  • Best Month For Sales : December
  • Total Sale Value In December : 9226886.68
  • Plot the Data:
  • What city sold the most product?
  • City Sold Most Product : San Francisco
  • Plot the Data:
  • What time should we display advertisements to maximize the likelihood of customer’s buying product?
  • Plot the Data :
  • What Products are most often sold together ?
  • What product sold the most? Why do you think it sold the most?
  1. Python For Data Analytics
  2. 2.Data Analytics
  3. 1.Pandas
  4. 3.Pandas Projects

Retail Sales Analysis

Previous3.Pandas ProjectsNextRetail Sales Step By Step

Last updated 1 year ago

Project:

In this project, we are going to analyze 12 months of data of sales. We will learn how to clean, manage and analyze the dataset to find some meaningful information.

We start by cleaning our data. Tasks during this section include:

Once we have cleaned up our data a bit, we move the data exploration section. In this section we explore 5 high level business questions related to our data:

Download Dataset:

Once you have downloaded the dataset , extract and save it somewhere.

Import Necessary Libraries

import os
import pandas as pd

Assign Path Variable to Path of Data

base_path = r'C:\Users\abhis\Downloads\SalesData (1)'

Merging 12 months of sales data in a single file:

To Calculate and analyze Company Sales Data, We must merge all 12 CSV files into one.

Let us go through one-month data first , to know things better.

one_month = pd.read_csv(fr"{base_path}\Sales_April_2019.csv")
one_month.head(3)
Order ID
Product
Quantity Ordered
Price Each
Order Date
Purchase Address

176558

USB-C Charging Cable

2

11.95

04/19/19 08:46

917 1st St, Dallas, TX 75001

176559

Bose SoundSport Headphones

1

99.99

04/07/19 22:30

682 Chestnut St, Boston, MA 02215

176560

Google Phone

1

600

04/12/19 14:38

669 Spruce St, Los Angeles, CA 90001

This is how our one month data looks like. Can we get all files from the directory ?

List all files from the directory

To get all files from a specific directory , we can use listdir() method.

files = os.listdir(path)
files

Output :

['Sales_April_2019.csv',
 'Sales_August_2019.csv',
 'Sales_December_2019.csv',
 'Sales_February_2019.csv',
 'Sales_January_2019.csv',
 'Sales_July_2019.csv',
 'Sales_June_2019.csv',
 'Sales_March_2019.csv',
 'Sales_May_2019.csv',
 'Sales_November_2019.csv',
 'Sales_October_2019.csv',
 'Sales_September_2019.csv']

Now we need to concat all these file in a single file, for that we need to create an empty DataFrame and concat all sales file.

Concatenate all files :

all_month_data = pd.DataFrame()  #Empty DataFrame

for file in files:    #Iteration In Sales Files
    monthly_data = pd.read_csv(fr"{path}\{file}")    #Read File
    all_month_data = pd.concat((all_month_data,monthly_data))  #Concatenate file
    
all_month_data

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

176558

USB-C Charging Cable

2

11.95

04/19/19 08:46

917 1st St, Dallas, TX 75001

176559

Bose SoundSport Headphones

1

99.99

04/07/19 22:30

682 Chestnut St, Boston, MA 02215

176560

Google Phone

1

600

04/12/19 14:38

669 Spruce St, Los Angeles, CA 90001

176560

Wired Headphones

1

11.99

04/12/19 14:38

669 Spruce St, Los Angeles, CA 90001

176561

Wired Headphones

1

11.99

04/30/19 09:27

333 8th St, Los Angeles, CA 90001

...

...

...

...

...

...

259353

AAA Batteries (4-pack)

3

2.99

09/17/19 20:56

840 Highland St, Los Angeles, CA 90001

259354

iPhone

1

700

09/01/19 16:00

216 Dogwood St, San Francisco, CA 94016

259355

iPhone

1

700

09/23/19 07:39

220 12th St, San Francisco, CA 94016

259356

34in Ultrawide Monitor

1

379.99

09/19/19 17:30

511 Forest St, San Francisco, CA 94016

259357

USB-C Charging Cable

1

11.95

09/30/19 00:18

250 Meadow St, San Francisco, CA 94016

Now that we have a DataFrame with all sales Files , let us convert it into a CSV File.

Convert To CSV

all_month_data.to_csv(fr'{path}\{"Sales.csv"}',index=False)

Let us read out Data From Updated DataFrame.

Read From Updated DataFrame

all_data = pd.read_csv(fr'{path}\{"Sales.csv"}')
all_data.head()
Order ID
Product
Quantity Ordered
Price Each
Order Date
Purchase Address

176558

USB-C Charging Cable

2

11.95

04/19/19 08:46

917 1st St, Dallas, TX 75001

176559

Bose SoundSport Headphones

1

99.99

04-07-2019 22:30

682 Chestnut St, Boston, MA 02215

176560

Google Phone

1

600

04-12-2019 14:38

669 Spruce St, Los Angeles, CA 90001

176560

Wired Headphones

1

11.99

04-12-2019 14:38

669 Spruce St, Los Angeles, CA 90001

176561

Wired Headphones

1

11.99

04/30/19 09:27

333 8th St, Los Angeles, CA 90001

Augment Data With Additional Columns

Clean up the data :

The first step in this is figuring out what we need to clean. I have found in practice, that you find things you need to clean as you perform operations and get errors. Based on the error, you decide how you should go about cleaning the data.

Display All rows of NAN:

all_data[all_data.isna().any(axis=1)]

Convert Order Date in Datetime

To convert Order Date in Datetime, first thing that needs to be done is to bring all date in a single format.

all_data['Order Date'] = all_data['Order Date'].str.replace('/','-')
all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])

Output :

File "C:\Users\abhis\.virtualenvs\python\lib\site-packages\dateutil\parser\_parser.py", line 643, in parse
    raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: Order Date

Reason Behind Error :

---> There is some unknown string named as 'Order Date' in our Dataset.

Let us see where this 'Order Date' is by using conditional formatting.

temp = all_data.loc[all_data['Order Date'] == 'Order Date']
temp

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

517

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

1146

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

1152

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

2869

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

2884

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

...

...

...

...

...

...

...

2234519

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

2234906

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

2235918

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

2235987

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

2236093

Order ID

Product

Quantity Ordered

Price Each

Order Date

Purchase Address

Let us remove all these 4260 rows from our Dataframe.

all_data = all_data.loc[all_data['Order Date']!='Order Date']
all_data.head()

Now let us try to convert Order Date again.

all_data['Order Date'] = all_data['Order Date'].str.replace('/','-')
all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])

Add Month Column:

To add month column , all we need to do is to use date time methods.

all_data['Month'] = all_data['Order Date'].dt.month_name()
all_data['Month']

Output:

0              April
1              April
2              April
3              April
4              April
             ...    
4285010    September
4285011    September
4285012    September
4285013    September
4285014    September
Name: Month, Length: 4276850, dtype: object

What was the best month for sales? How much was earned that month?

In Our DataFrame , There are Two columns Price Each and Quantity Ordered. But we do not have Total Sale Value (TSV), We should also add a calculated column Total Sale Value. Let's do it first.

Now Price Each , Quantity Ordered is object data type , so we need to convert it.

Convert Price Each and Quantity Ordered in Float and Calculate TSV.

#Conversion To Float
all_data['Quantity Ordered'] = all_data['Quantity Ordered'].astype('float')
all_data['Price Each'] = all_data['Price Each'].astype('float')

#Calculated Column
all_data['TSV'] = all_data['Quantity Ordered'] * all_data['Price Each']
all_data.head()

Now we can easily use groupby method to find maximum sales value.

all_data.groupby('month')['TSV'].sum()
month
1     3644513.46
2     4404044.84
3     5614200.76
4     6781340.48
5     6305213.50
6     5155604.52
7     5295551.52
8     4488935.76
9     4195120.26
10    7473453.76
11    6399206.40
12    9226886.68
Name: TSV, dtype: float64

Best Month For Sales : December

Total Sale Value In December : 9226886.68

Plot the Data:

months = range(1,13)
plt.bar(months,results)
plt.xticks(months)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()

December is a Festival Month , the biggest thing is December is Christmas, So people spend money and maybe that is the reason for december is best month for sales.

What city sold the most product?

There is no City Column in our Dataset, But you can see Purchase Address ,where City is mentioned in middle, we can use Purchase address and create a City Column.

all_data['City'] = all_data['Purchase Address'].str.split(',',expand=True)[1]
all_data

Now we can groupby City and Calculate Total Quantity Ordered.

all_data.groupby('City')['Quantity Ordered'].sum().sort_values()

Output :

City
 Austin            22306.0
 Portland          28106.0
 Seattle           33106.0
 Atlanta           33204.0
 Dallas            33460.0
 Boston            45056.0
 New York City     55864.0
 Los Angeles       66578.0
 San Francisco    100478.0
Name: Quantity Ordered, dtype: float64

City Sold Most Product : San Francisco

Plot the Data:

import matplotlib.pyplot as plt

keys = [city for city, df in all_data.groupby(['City'])]

plt.bar(keys,all_data.groupby(['City']).sum()['TSV'])
plt.ylabel('Sales in USD ($)')
plt.xlabel('Month number')
plt.xticks(keys, rotation='vertical', size=8)
plt.show()

What time should we display advertisements to maximize the likelihood of customer’s buying product?

Because we need to work on time, let's create Hour and Minute Column.

all_data['Hour'] = all_data['Order Date'].dt.hour
all_data['Minute'] = all_data['Order Date'].dt.minute
all_data.head()

Let us see what time most people bought from this Company .

all_data.groupby('Hour')['TSV'].sum().sort_values(ascending=False)
Hour
19    4825877.08
12    4633642.68
11    4601220.48
20    4563432.48
18    4438696.60
13    4310779.60
17    4258723.22
14    4167345.46
21    4084001.72
10    3888573.54
15    3883099.20
16    3809202.62
9     3278061.16
22    3215098.42
8     2384697.94
23    2358608.88
7     1489708.24
0     1427442.54
1      921733.76
6      896226.00
2      469702.88
5      461359.64
4      325322.02
3      291515.78
Name: TSV, dtype: float64

Plot the Data :

keys = [pair for pair, df in all_data.groupby(['Hour'])]

plt.plot(keys, all_data.groupby(['Hour']).count())
plt.xticks(keys)
plt.grid()
plt.show()

What Products are most often sold together ?

This is going to be a tricky business. Before diving into the problem, let us understand the problem.

What Products are most often sold together gives a very important insight that helps company lure customers into spending more and generate more profit.

You must have seen Amazon's Customer Who Bought this item also bought this Item. We are going to do something like this.

Let's try to solve this problem Step by step :

Duplicate Order ID :

Same Order ID for different products suggests customer bought different products Together in Our Data. Let us look for Duplicate OrderId Products:

# All Duplicate Order ID 
df = all_data[all_data['Order ID'].duplicated(keep=False)]
df

Group Different Products With Same Order ID :

Let's create a new column that contains different Products with the same Order ID. We can do this with the help of transform method.

df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x : ','.join(x))
df.head(10)
from itertools import combinations
from collections import Counter

count = Counter()
row_list = []
for row in df['Grouped']:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list, 3)))

count.most_common(10)

Output :

[(('Google Phone', 'USB-C Charging Cable', 'Wired Headphones'), 86),
 (('iPhone', 'Lightning Charging Cable', 'Wired Headphones'), 62),
 (('iPhone', 'Lightning Charging Cable', 'Apple Airpods Headphones'), 47),
 (('Google Phone', 'USB-C Charging Cable', 'Bose SoundSport Headphones'), 35),
 (('Vareebadd Phone', 'USB-C Charging Cable', 'Wired Headphones'), 33),
 (('iPhone', 'Apple Airpods Headphones', 'Wired Headphones'), 27),
 (('Google Phone', 'Bose SoundSport Headphones', 'Wired Headphones'), 24),
 (('Vareebadd Phone', 'USB-C Charging Cable', 'Bose SoundSport Headphones'),
  16),
 (('USB-C Charging Cable', 'Bose SoundSport Headphones', 'Wired Headphones'),
  5),
 (('Vareebadd Phone', 'Bose SoundSport Headphones', 'Wired Headphones'), 5)]

What product sold the most? Why do you think it sold the most?

This is pretty simple , we will group all product and calculate quantity ordered.

all_data.groupby('Product')['Quantity Ordered'].sum().sort_values().plot(kind='bar')
Counting "unique pairs" of numbers into a python dictionary?Stack Overflow
Logo
2MB
SalesData (3).zip
archive