3. DataFrame Basic Functions
In this note, we will be working on basic dataframe functions.
head() function
The head()
function in pandas is used to view the first few rows of a DataFrame. By default, head()
returns the first 5 rows, but you can specify the number of rows you wish to see by passing an integer to the function. Here is an example:
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [24, 27, 22, 32, 29]}
df = pd.DataFrame(data)
# Display the first 3 rows of the DataFrame
print(df.head(3))
This function is useful for quickly inspecting large datasets without loading the entire DataFrame into memory.
tail() function
The tail()
function in pandas is used to view the last few rows of a DataFrame. By default, tail()
returns the last 5 rows, but you can specify a different number by passing an integer argument. Here is an example:
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [24, 27, 22, 32, 29]}
df = pd.DataFrame(data)
# Display the last 3 rows of the DataFrame
print(df.tail(3))
This function is useful for quickly examining the end portion of a dataset to gain insights without the need to load the entire DataFrame.
info() function
The info()
function in pandas provides a concise summary of a DataFrame. It includes the index datatype and range, columns, non-null values, and memory usage. This is useful for understanding the overall structure and data types of a DataFrame without printing the entire dataset.
Example
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [24, 27, 22, 32, 29]}
df = pd.DataFrame(data)
# Display information about the DataFrame
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
dtypes: int64(1), object(1)
memory usage: 208.0 bytes
This output helps in getting a quick overview of the DataFrame, especially when dealing with large datasets.
describe() function
The describe()
function in pandas provides a quick statistical summary of a DataFrame, which is valuable for understanding the distribution and spread of data. By default, the describe()
function computes the summary statistics for numerical columns in the DataFrame, but it can also be used on object data when specified.
Here is an example of using the describe()
function:
# Get a summary of statistics for the DataFrame
summary = df.describe()
print(summary)
Example Output:
Age
count 5.000000
mean 26.800000
std 3.834058
min 22.000000
25% 24.000000
50% 27.000000
75% 29.000000
max 32.000000
Explanation of the Output:
count: The number of non-null entries.
mean: The average of the values.
std: The standard deviation, which measures the spread of the data.
min: The minimum value in the column.
25% (1st Quartile): The value below which 25% of the data fall.
50% (Median): The middle value of the dataset.
75% (3rd Quartile): The value below which 75% of the data fall.
max: The maximum value in the column.
This statistical overview is helpful for identifying trends, spotting anomalies, and performing data cleaning tasks.
drop() function
The drop()
function in Pandas is used to remove specified labels from rows or columns. This function can be applied to both DataFrames and Series and allows for alterations to be performed either in place or by returning a new object.
Usage
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
Parameters
labels: The index or column labels to drop.
axis: Default is 0. The axis to drop labels from (0 for index and 1 for columns).
index: Alternative to specifying labels, useful for dropping rows.
columns: Alternative to specifying labels, useful for dropping columns.
level: The level from which to drop labels in a multi-index.
inplace: If True, modifies the DataFrame/Series in place.
errors: Specifies handling for errors: 'raise' (default) to throw errors, 'ignore' to suppress them.
Detailed Output
New DataFrame/Series: If
inplace=False
, a new DataFrame or Series is returned with the specified labels removed.Modified In Place: If
inplace=True
, the original DataFrame or Series is modified directly, and no new object is returned.
Example
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Gender': ['F', 'M', 'M']
})
# Dropping a column
df_dropped = df.drop('Age', axis=1)
# Dropping a row
df_row_dropped = df.drop(1, axis=0)
# Dropping a column in place
df.drop('Gender', axis=1, inplace=True)
In this example, df_dropped
has the 'Age' column removed, df_row_dropped
has the row at index 1 removed, and df
itself has the 'Gender' column removed directly with inplace=True
.
Last updated