2. DataFrame and Series Operations
In this document , we will learn mostly about datatype of pandas and also work on Column Operation
What is DataFrame
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns) used extensively in data analysis and manipulation tasks. It is similar to a spreadsheet or SQL table and is part of the Pandas library in Python.
Example
import pandas as pd
# Creating a simple DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)Output:
In this example, we create a DataFrame df using a dictionary data containing columns 'Name', 'Age', and 'City', each with three entries. The DataFrame provides an intuitive way to manipulate and analyze such tabular data.
What is Series
A Pandas Series is a one-dimensional array-like object that can hold a sequence of values of any data type. It is analogous to a single column in a DataFrame. One of its key features is the ability to assign an index to each item, allowing for more flexible and powerful data manipulation. A Series can be created using various input types like lists, NumPy arrays, or dictionaries.
Example:
Output:
In this example, a Series named series is created from a list data with an explicit index ['a', 'b', 'c', 'd']. Each element in the list corresponds to an index label, enabling easy access and operations like slicing and filtering.
To load the retail_dataset.csv file using Python, you can use the pandas library. Here's a quick guide:
11/24/2023
CUST001
Male
34
Beauty
3
50
2/27/2023
CUST002
Female
26
Clothing
2
500
1/13/2023
CUST003
Male
50
Electronics
1
30
5/21/2023
CUST004
Male
37
Clothing
1
500
5/6/2023
CUST005
Male
30
Beauty
2
50
4/25/2023
CUST006
Female
45
Beauty
1
30
3/13/2023
CUST007
Male
46
Clothing
2
25
Understanding DataFrame Properties
When working with a DataFrame in Pandas, it's beneficial to understand its core properties, which include Shape, Columns, Index, and Values:
Shape: Returns the number of rows and columns in the DataFrame, indicating its dimensions.
Columns: Outputs the labels of the columns as an Index object containing all column names.
Index: Provides an Index object representing row labels, which can be default (RangeIndex) or a custom index.
Values: Displays the underlying data of the DataFrame as a 2-dimensional NumPy array.
Here's how you can output these properties:
Output:
How To Access a Single Column
To access a single column in a DataFrame, you can use the column label inside square brackets, like this:
Output:
How To Access Multiple Columns
To access multiple columns in a DataFrame, you can use a list of column labels inside double square brackets. For example:
This will extract and display the specified columns from the DataFrame.
How to Create a new column with single value in it
To create a new column in a DataFrame where each entry is set to the value "India," you can utilize the following example code:
In this code snippet, a new column called 'Country' is added to the DataFrame. Every entry in this column is populated with the string "India," effectively setting "India" as the default country for all rows in the DataFrame.
How to create a new Column With Serial Numbers
To create a new column in a DataFrame and add a serial number from 1 to 1000 for each row in the retail_sales_dataset.csv file, you can use the following code example:
This code snippet reads the dataset, adds a new column called 'Serial Number' with values ranging from 1 to 1000, and then displays the updated DataFrame.
How to create a new Column With Random Numbers
To create a new column named 'Discount' filled with 1000 random numbers from 1 to 10 in a DataFrame, you can use the numpy library to generate random values. Below is a code snippet demonstrating how to achieve this with a DataFrame loaded from the retail_sales_dataset.csv file:
This code reads the dataset, generates random numbers using numpy, and adds them as a new column named 'Discount' to the DataFrame.
Mathematical Operations On a Single Column
To perform a single mathematical operation on a single column in a DataFrame, you can use mathematical operators directly on the column. Here's an example where we will divide the 'Price' column by 2 in a DataFrame loaded from the retail_sales_dataset.csv file:
This code snippet reads the dataset, divides each value in the 'Price' column by 2, and then displays the updated DataFrame.
Mathematical Operations On Multiple Columns
To perform mathematical operations on multiple columns in a DataFrame and create a new column, you can directly reference the columns and use mathematical operators. Here's how to create a new column Total_Amount by multiplying the Price and Quantity columns in a DataFrame loaded from the retail_sales_dataset.csv file:
This code snippet loads the dataset, calculates the total amount for each row by multiplying the Price and Quantity columns, and appends the result as a new column Total_Amount in the DataFrame.
How To Delete a Column
To delete a single column from a DataFrame, you can use the drop method and specify the column to remove. Here's how you can delete a single column, such as Discount, from the retail_sales_dataset.csv file:
To delete multiple columns, you can provide a list of column names to the drop method. Here’s how to delete both Discount and Profit columns from the DataFrame:
Last updated