2. DataFrame and Series Operations

In this document , we will learn mostly about datatype of pandas and also work on Column Operation

What is DataFrame

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns) used extensively in data analysis and manipulation tasks. It is similar to a spreadsheet or SQL table and is part of the Pandas library in Python.

Example

import pandas as pd

# Creating a simple DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

print(df)

Output:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

In this example, we create a DataFrame df using a dictionary data containing columns 'Name', 'Age', and 'City', each with three entries. The DataFrame provides an intuitive way to manipulate and analyze such tabular data.

What is Series

A Pandas Series is a one-dimensional array-like object that can hold a sequence of values of any data type. It is analogous to a single column in a DataFrame. One of its key features is the ability to assign an index to each item, allowing for more flexible and powerful data manipulation. A Series can be created using various input types like lists, NumPy arrays, or dictionaries.

Example:

import pandas as pd

# Creating a Series
data = [5, 10, 15, 20]
series = pd.Series(data, index=['a', 'b', 'c', 'd'])

print(series)

Output:

a     5
b    10
c    15
d    20
dtype: int64

In this example, a Series named series is created from a list data with an explicit index ['a', 'b', 'c', 'd']. Each element in the list corresponds to an index label, enabling easy access and operations like slicing and filtering.

To load the retail_dataset.csv file using Python, you can use the pandas library. Here's a quick guide:

import pandas as pd

# Load the dataset
file_path = 'retail_sales_dataset.csv'
df = pd.read_csv(file_path)

# Display the first few rows
print(df.head())
Date
Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

11/24/2023

CUST001

Male

34

Beauty

3

50

2/27/2023

CUST002

Female

26

Clothing

2

500

1/13/2023

CUST003

Male

50

Electronics

1

30

5/21/2023

CUST004

Male

37

Clothing

1

500

5/6/2023

CUST005

Male

30

Beauty

2

50

4/25/2023

CUST006

Female

45

Beauty

1

30

3/13/2023

CUST007

Male

46

Clothing

2

25

Understanding DataFrame Properties

When working with a DataFrame in Pandas, it's beneficial to understand its core properties, which include Shape, Columns, Index, and Values:

  • Shape: Returns the number of rows and columns in the DataFrame, indicating its dimensions.

  • Columns: Outputs the labels of the columns as an Index object containing all column names.

  • Index: Provides an Index object representing row labels, which can be default (RangeIndex) or a custom index.

  • Values: Displays the underlying data of the DataFrame as a 2-dimensional NumPy array.

Here's how you can output these properties:

import pandas as pd

# Load the dataset
file_path = 'retail_sales_dataset.csv'
df = pd.read_csv(file_path)

# Display the properties
print("Shape of the dataframe:", df.shape)
print("Column labels:", df.columns)
print("Index labels:", df.index)
print("Data values:\n", df.values)

Output:

Shape of the dataframe: (7, 7)
Column labels: Index(['Date', 'Customer ID', 'Gender', 'Age', 'Product Category', 'Quantity', 'Price per Unit'], dtype='object')
Index labels: RangeIndex(start=0, stop=7, step=1)
Data values:
 [['11/24/2023' 'CUST001' 'Male' 34 'Beauty' 3 50]
  ['2/27/2023' 'CUST002' 'Female' 26 'Clothing' 2 500]
  ['1/13/2023' 'CUST003' 'Male' 50 'Electronics' 1 30]
  ['5/21/2023' 'CUST004' 'Male' 37 'Clothing' 1 500]
  ['5/6/2023' 'CUST005' 'Male' 30 'Beauty' 2 50]
  ['4/25/2023' 'CUST006' 'Female' 45 'Beauty' 1 30]
  ['3/13/2023' 'CUST007' 'Male' 46 'Clothing' 2 25]]

How To Access a Single Column

To access a single column in a DataFrame, you can use the column label inside square brackets, like this:

# Access the 'Customer ID' column
customer_ids = df['Customer ID']

# Display the 'Customer ID' column
print("Customer IDs:\n", customer_ids)

Output:

Customer IDs:
 0    CUST001
1    CUST002
2    CUST003
3    CUST004
4    CUST005
5    CUST006
6    CUST007
Name: Customer ID, dtype: object

How To Access Multiple Columns

To access multiple columns in a DataFrame, you can use a list of column labels inside double square brackets. For example:

# Access the 'Customer ID' and 'Name' columns
customer_data = df[['Customer ID', 'Name']]

# Display the selected columns
print("Customer Data:\n", customer_data)

This will extract and display the specified columns from the DataFrame.

How to Create a new column with single value in it

To create a new column in a DataFrame where each entry is set to the value "India," you can utilize the following example code:

import pandas as pd

# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')

# Create a new 'Country' column with 'India' as the value for every row
df['Country'] = 'India'

# Display the updated DataFrame
print("Updated DataFrame with Country column:\n", df)

In this code snippet, a new column called 'Country' is added to the DataFrame. Every entry in this column is populated with the string "India," effectively setting "India" as the default country for all rows in the DataFrame.

How to create a new Column With Serial Numbers

To create a new column in a DataFrame and add a serial number from 1 to 1000 for each row in the retail_sales_dataset.csv file, you can use the following code example:

import numpy as np
import pandas as pd

# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')

# Create a new 'Serial Number' column
df['Serial Number'] = np.arange(1, 1001)

# Display the updated DataFrame
print("Updated DataFrame with Serial Numbers:\n", df)

This code snippet reads the dataset, adds a new column called 'Serial Number' with values ranging from 1 to 1000, and then displays the updated DataFrame.

How to create a new Column With Random Numbers

To create a new column named 'Discount' filled with 1000 random numbers from 1 to 10 in a DataFrame, you can use the numpy library to generate random values. Below is a code snippet demonstrating how to achieve this with a DataFrame loaded from the retail_sales_dataset.csv file:

import numpy as np
import pandas as pd

# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')

# Create a new 'Discount' column with random numbers between 0 and 1
df['Discount'] = np.random.randint(1,10,1000)

# Display the updated DataFrame
print("Updated DataFrame with Discount:\n", df)

This code reads the dataset, generates random numbers using numpy, and adds them as a new column named 'Discount' to the DataFrame.

Mathematical Operations On a Single Column

To perform a single mathematical operation on a single column in a DataFrame, you can use mathematical operators directly on the column. Here's an example where we will divide the 'Price' column by 2 in a DataFrame loaded from the retail_sales_dataset.csv file:

import pandas as pd

# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')

# Perform the operation
df['Price'] = df['Price'] / 2

# Display the updated DataFrame
print("Updated DataFrame:\n", df)

This code snippet reads the dataset, divides each value in the 'Price' column by 2, and then displays the updated DataFrame.

Mathematical Operations On Multiple Columns

To perform mathematical operations on multiple columns in a DataFrame and create a new column, you can directly reference the columns and use mathematical operators. Here's how to create a new column Total_Amount by multiplying the Price and Quantity columns in a DataFrame loaded from the retail_sales_dataset.csv file:

import pandas as pd

# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')

# Create a new column by multiplying 'Price' and 'Quantity'
df['Total_Amount'] = df['Price'] * df['Quantity']

# Display the updated DataFrame
print("Updated DataFrame with Total_Amount column:\n", df)

This code snippet loads the dataset, calculates the total amount for each row by multiplying the Price and Quantity columns, and appends the result as a new column Total_Amount in the DataFrame.

How To Delete a Column

To delete a single column from a DataFrame, you can use the drop method and specify the column to remove. Here's how you can delete a single column, such as Discount, from the retail_sales_dataset.csv file:

import pandas as pd

# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')

# Delete a single column 'Discount'
df = df.drop(columns='Discount')

# Display the updated DataFrame
print("Updated DataFrame without Discount column:\n", df)

To delete multiple columns, you can provide a list of column names to the drop method. Here’s how to delete both Discount and Profit columns from the DataFrame:

import pandas as pd

# Load the dataset
df = pd.read_csv('retail_sales_dataset.csv')

# Delete multiple columns 'Discount' and 'Profit'
df = df.drop(columns=['Discount', 'Profit'])

# Display the updated DataFrame
print("Updated DataFrame without Discount and Profit columns:\n", df)

Last updated