2. DataFrame and Series Operations
In this document , we will learn mostly about datatype of pandas and also work on Column Operation
What is DataFrame
A DataFrame
is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns) used extensively in data analysis and manipulation tasks. It is similar to a spreadsheet or SQL table and is part of the Pandas library in Python.
Example
import pandas as pd
# Creating a simple DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
In this example, we create a DataFrame df
using a dictionary data
containing columns 'Name', 'Age', and 'City', each with three entries. The DataFrame provides an intuitive way to manipulate and analyze such tabular data.
What is Series
A Pandas Series is a one-dimensional array-like object that can hold a sequence of values of any data type. It is analogous to a single column in a DataFrame. One of its key features is the ability to assign an index to each item, allowing for more flexible and powerful data manipulation. A Series can be created using various input types like lists, NumPy arrays, or dictionaries.
Example:
import pandas as pd
# Creating a Series
data = [5, 10, 15, 20]
series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)
Output:
a 5
b 10
c 15
d 20
dtype: int64
In this example, a Series named series
is created from a list data
with an explicit index ['a', 'b', 'c', 'd']
. Each element in the list corresponds to an index label, enabling easy access and operations like slicing and filtering.
To load the retail_dataset.csv
file using Python, you can use the pandas
library. Here's a quick guide:
import pandas as pd
# Load the dataset
file_path = 'retail_sales_dataset.csv'
df = pd.read_csv(file_path)
# Display the first few rows
print(df.head())
11/24/2023
CUST001
Male
34
Beauty
3
50
2/27/2023
CUST002
Female
26
Clothing
2
500
1/13/2023
CUST003
Male
50
Electronics
1
30
5/21/2023
CUST004
Male
37
Clothing
1
500
5/6/2023
CUST005
Male
30
Beauty
2
50
4/25/2023
CUST006
Female
45
Beauty
1
30
3/13/2023
CUST007
Male
46
Clothing
2
25
Understanding DataFrame Properties
When working with a DataFrame in Pandas, it's beneficial to understand its core properties, which include Shape, Columns, Index, and Values:
Shape: Returns the number of rows and columns in the DataFrame, indicating its dimensions.
Columns: Outputs the labels of the columns as an Index object containing all column names.
Index: Provides an Index object representing row labels, which can be default (RangeIndex) or a custom index.
Values: Displays the underlying data of the DataFrame as a 2-dimensional NumPy array.
Here's how you can output these properties:
import pandas as pd
# Load the dataset
file_path = 'retail_sales_dataset.csv'
df = pd.read_csv(file_path)
# Display the properties
print("Shape of the dataframe:", df.shape)
print("Column labels:", df.columns)
print("Index labels:", df.index)
print("Data values:\n", df.values)
Output:
Shape of the dataframe: (7, 7)
Column labels: Index(['Date', 'Customer ID', 'Gender', 'Age', 'Product Category', 'Quantity', 'Price per Unit'], dtype='object')
Index labels: RangeIndex(start=0, stop=7, step=1)
Data values:
[['11/24/2023' 'CUST001' 'Male' 34 'Beauty' 3 50]
['2/27/2023' 'CUST002' 'Female' 26 'Clothing' 2 500]
['1/13/2023' 'CUST003' 'Male' 50 'Electronics' 1 30]
['5/21/2023' 'CUST004' 'Male' 37 'Clothing' 1 500]
['5/6/2023' 'CUST005' 'Male' 30 'Beauty' 2 50]
['4/25/2023' 'CUST006' 'Female' 45 'Beauty' 1 30]
['3/13/2023' 'CUST007' 'Male' 46 'Clothing' 2 25]]
How To Access a Single Column
To access a single column in a DataFrame, you can use the column label inside square brackets, like this:
# Access the 'Customer ID' column
customer_ids = df['Customer ID']
# Display the 'Customer ID' column
print("Customer IDs:\n", customer_ids)
Output:
Customer IDs:
0 CUST001
1 CUST002
2 CUST003
3 CUST004
4 CUST005
5 CUST006
6 CUST007
Name: Customer ID, dtype: object
How To Access Multiple Columns
To access multiple columns in a DataFrame, you can use a list of column labels inside double square brackets. For example:
# Access the 'Customer ID' and 'Name' columns
customer_data = df[['Customer ID', 'Name']]
# Display the selected columns
print("Customer Data:\n", customer_data)
This will extract and display the specified columns from the DataFrame.
How to Create a new column with single value in it
To create a new column in a DataFrame where each entry is set to the value "India," you can utilize the following example code:
import pandas as pd
# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')
# Create a new 'Country' column with 'India' as the value for every row
df['Country'] = 'India'
# Display the updated DataFrame
print("Updated DataFrame with Country column:\n", df)
In this code snippet, a new column called 'Country' is added to the DataFrame. Every entry in this column is populated with the string "India," effectively setting "India" as the default country for all rows in the DataFrame.
How to create a new Column With Serial Numbers
To create a new column in a DataFrame and add a serial number from 1 to 1000 for each row in the retail_sales_dataset.csv
file, you can use the following code example:
import numpy as np
import pandas as pd
# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')
# Create a new 'Serial Number' column
df['Serial Number'] = np.arange(1, 1001)
# Display the updated DataFrame
print("Updated DataFrame with Serial Numbers:\n", df)
This code snippet reads the dataset, adds a new column called 'Serial Number' with values ranging from 1 to 1000, and then displays the updated DataFrame.
How to create a new Column With Random Numbers
To create a new column named 'Discount' filled with 1000 random numbers from 1 to 10 in a DataFrame, you can use the numpy
library to generate random values. Below is a code snippet demonstrating how to achieve this with a DataFrame loaded from the retail_sales_dataset.csv
file:
import numpy as np
import pandas as pd
# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')
# Create a new 'Discount' column with random numbers between 0 and 1
df['Discount'] = np.random.randint(1,10,1000)
# Display the updated DataFrame
print("Updated DataFrame with Discount:\n", df)
This code reads the dataset, generates random numbers using numpy
, and adds them as a new column named 'Discount' to the DataFrame.
Mathematical Operations On a Single Column
To perform a single mathematical operation on a single column in a DataFrame, you can use mathematical operators directly on the column. Here's an example where we will divide the 'Price' column by 2 in a DataFrame loaded from the retail_sales_dataset.csv
file:
import pandas as pd
# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')
# Perform the operation
df['Price'] = df['Price'] / 2
# Display the updated DataFrame
print("Updated DataFrame:\n", df)
This code snippet reads the dataset, divides each value in the 'Price' column by 2, and then displays the updated DataFrame.
Mathematical Operations On Multiple Columns
To perform mathematical operations on multiple columns in a DataFrame and create a new column, you can directly reference the columns and use mathematical operators. Here's how to create a new column Total_Amount
by multiplying the Price
and Quantity
columns in a DataFrame loaded from the retail_sales_dataset.csv
file:
import pandas as pd
# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')
# Create a new column by multiplying 'Price' and 'Quantity'
df['Total_Amount'] = df['Price'] * df['Quantity']
# Display the updated DataFrame
print("Updated DataFrame with Total_Amount column:\n", df)
This code snippet loads the dataset, calculates the total amount for each row by multiplying the Price
and Quantity
columns, and appends the result as a new column Total_Amount
in the DataFrame.
How To Delete a Column
To delete a single column from a DataFrame, you can use the drop
method and specify the column to remove. Here's how you can delete a single column, such as Discount
, from the retail_sales_dataset.csv
file:
import pandas as pd
# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')
# Delete a single column 'Discount'
df = df.drop(columns='Discount')
# Display the updated DataFrame
print("Updated DataFrame without Discount column:\n", df)
To delete multiple columns, you can provide a list of column names to the drop
method. Here’s how to delete both Discount
and Profit
columns from the DataFrame:
import pandas as pd
# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')
# Delete multiple columns 'Discount' and 'Profit'
df = df.drop(columns=['Discount', 'Profit'])
# Display the updated DataFrame
print("Updated DataFrame without Discount and Profit columns:\n", df)
Last updated