3. DataFrame Basic Functions

In this note, we will be working on basic dataframe functions.

head() function

The head() function in pandas is used to view the first few rows of a DataFrame. By default, head() returns the first 5 rows, but you can specify the number of rows you wish to see by passing an integer to the function. Here is an example:

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [24, 27, 22, 32, 29]}
df = pd.DataFrame(data)

# Display the first 3 rows of the DataFrame
print(df.head(3))

This function is useful for quickly inspecting large datasets without loading the entire DataFrame into memory.

tail() function

The tail() function in pandas is used to view the last few rows of a DataFrame. By default, tail() returns the last 5 rows, but you can specify a different number by passing an integer argument. Here is an example:

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [24, 27, 22, 32, 29]}
df = pd.DataFrame(data)

# Display the last 3 rows of the DataFrame
print(df.tail(3))

This function is useful for quickly examining the end portion of a dataset to gain insights without the need to load the entire DataFrame.

info() function

The info() function in pandas provides a concise summary of a DataFrame. It includes the index datatype and range, columns, non-null values, and memory usage. This is useful for understanding the overall structure and data types of a DataFrame without printing the entire dataset.

Example

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [24, 27, 22, 32, 29]}
df = pd.DataFrame(data)

# Display information about the DataFrame
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 208.0 bytes

This output helps in getting a quick overview of the DataFrame, especially when dealing with large datasets.

describe() function

The describe() function in pandas provides a quick statistical summary of a DataFrame, which is valuable for understanding the distribution and spread of data. By default, the describe() function computes the summary statistics for numerical columns in the DataFrame, but it can also be used on object data when specified.

Here is an example of using the describe() function:

# Get a summary of statistics for the DataFrame
summary = df.describe()
print(summary)

Example Output:

             Age
count   5.000000
mean   26.800000
std     3.834058
min    22.000000
25%    24.000000
50%    27.000000
75%    29.000000
max    32.000000

Explanation of the Output:

  • count: The number of non-null entries.

  • mean: The average of the values.

  • std: The standard deviation, which measures the spread of the data.

  • min: The minimum value in the column.

  • 25% (1st Quartile): The value below which 25% of the data fall.

  • 50% (Median): The middle value of the dataset.

  • 75% (3rd Quartile): The value below which 75% of the data fall.

  • max: The maximum value in the column.

This statistical overview is helpful for identifying trends, spotting anomalies, and performing data cleaning tasks.

drop() function

The drop() function in Pandas is used to remove specified labels from rows or columns. This function can be applied to both DataFrames and Series and allows for alterations to be performed either in place or by returning a new object.

Usage

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Parameters

  • labels: The index or column labels to drop.

  • axis: Default is 0. The axis to drop labels from (0 for index and 1 for columns).

  • index: Alternative to specifying labels, useful for dropping rows.

  • columns: Alternative to specifying labels, useful for dropping columns.

  • level: The level from which to drop labels in a multi-index.

  • inplace: If True, modifies the DataFrame/Series in place.

  • errors: Specifies handling for errors: 'raise' (default) to throw errors, 'ignore' to suppress them.

Detailed Output

  • New DataFrame/Series: If inplace=False, a new DataFrame or Series is returned with the specified labels removed.

  • Modified In Place: If inplace=True, the original DataFrame or Series is modified directly, and no new object is returned.

Example

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Gender': ['F', 'M', 'M']
})

# Dropping a column
df_dropped = df.drop('Age', axis=1)

# Dropping a row
df_row_dropped = df.drop(1, axis=0)

# Dropping a column in place
df.drop('Gender', axis=1, inplace=True)

In this example, df_dropped has the 'Age' column removed, df_row_dropped has the row at index 1 removed, and df itself has the 'Gender' column removed directly with inplace=True.

Last updated