4.Missing Values In Pandas

How To Handle Missing Values In Pandas ?

Missing Data values can occur when no information is provided and it also can be referred as NA. These values can interfere in processing the data.

Imagine a survey where few people did not share their income , It is known as missing values. We need to handle missing values with the help of pandas so there is no obstacle.

Let us start with our dataset named miss.csv file.

In [1]:

import pandas as pd

df = pd.read_csv('miss.csv')
df

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

NaN

rahul

raj

pune

NaN

3.0

NaN

kumar

NaN

4.0

dipanshu

gupta

mumbai

34.0

5.0

NaN

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

7.0

priya

verma

NaN

9.0

ravi

ranjan

kochi

43.0

NaN

11.0

atul

verma

rishikesh

29.0

As you can see , there are lots of missing values NAN , Before dealing with these null values , Let's count them all in a go.

Count Null and Non-Null Values:

In [2]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      8 non-null      float64
 1   fname   7 non-null      object 
 2   lname   8 non-null      object 
 3   city    7 non-null      object 
 4   state   9 non-null      object 
 5   age     6 non-null      float64
dtypes: float64(2), object(4)
memory usage: 656.0+ bytes

isnull

isnull() function returns all the null values as True and notnull values as False.

In [3]:

df.isnull()

fname

lname

city

state

age

False

True

False

True

False

True

False

True

False

True

False

True

False

True

False

True

False

True

False

You can also apply sum() function to count all the null values for a particular column.

In [4]:

df.isnull().sum()

Out[4]:

id       3
fname    4
lname    3
city     4
state    2
age      5
dtype: int64

notnull

notnull() function returns all the null values as False and notnull values as True.

In [5]:

df.notnull()

fname

lname

city

state

age

True

False

True

False

True

False

True

False

True

False

True

False

True

False

True

False

True

False

True

You can also apply sum() function to count all the not null values for a particular column.

In [6]:

df.notnull().sum()

id       8
fname    7
lname    8
city     7
state    9
age      6
dtype: int64

Delete All rows with Any missing value:

dropna:

dropna() function deletes all the rows with any missing values.

how: how is a parameter that tells how do you want to delete data. There are two values that we pass in how.

how='any': This is default. It deletes all the rows with any null values in it.

how='all': It deletes rows where all the values are null values.

In [7]:

df.dropna()

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

4.0

dipanshu

gupta

mumbai

34.0

6.0

alok

agnihotri

amritsar

40.0

9.0

ravi

ranjan

kochi

43.0

11.0

atul

verma

rishikesh

29.0

df.dropna(how='any')

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

4.0

dipanshu

gupta

mumbai

34.0

6.0

alok

agnihotri

amritsar

40.0

9.0

ravi

ranjan

kochi

43.0

11.0

atul

verma

rishikesh

29.0

df.dropna(how='all')

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

NaN

rahul

raj

pune

NaN

3.0

NaN

kumar

NaN

4.0

dipanshu

gupta

mumbai

34.0

5.0

NaN

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

7.0

priya

verma

NaN

9.0

ravi

ranjan

kochi

43.0

11.0

atul

verma

rishikesh

29.0

thresh:

thresh is known as threshold and it takes integer value,it will drop any row with less than 4 values in it.

In [10]:

df.dropna(how='any',thresh=4)

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

NaN

rahul

raj

pune

NaN

4.0

dipanshu

gupta

mumbai

34.0

5.0

NaN

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

9.0

ravi

ranjan

kochi

43.0

11.0

atul

verma

rishikesh

29.0

subset

subset stores columns in which you want to look for missing values. Like in this example if there is any city and state is NA, it will remove that row only.

In [11]:

df.dropna(how='any',subset=['city','state'])

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

NaN

rahul

raj

pune

NaN

4.0

dipanshu

gupta

mumbai

34.0

5.0

NaN

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

9.0

ravi

ranjan

kochi

43.0

11.0

atul

verma

rishikesh

29.0

df.dropna(how='all',subset=['city','state'])

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

NaN

rahul

raj

pune

NaN

3.0

NaN

kumar

NaN

4.0

dipanshu

gupta

mumbai

34.0

5.0

NaN

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

NaN

9.0

ravi

ranjan

kochi

43.0

11.0

atul

verma

rishikesh

29.0

Say if you want to look for atleast two values in city state and age.

df.dropna(how='all',subset=['city','state','age'],thresh=2)

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

NaN

rahul

raj

pune

NaN

4.0

dipanshu

gupta

mumbai

34.0

5.0

NaN

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

9.0

ravi

ranjan

kochi

43.0

11.0

atul

verma

rishikesh

29.0

What if there is no NaN values but - and spaces or some other values.

In [14]:

ef = pd.read_csv('miss2.csv')
ef

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

NaN

rahul

raj

pune

NaN

3.0

NaN

kumar

NaN

4.0

dipanshu

gupta

mumbai

5.0

NaN

gurugram

6.0

agnihotri

amritsar

7.0

priya

verma

NaN

9.0

ravi

ranjan

NaN

11.0

atul

verma

rishikesh

We can use numpy functions to replace all useless values from NaN.

In [15]:

import pandas as pd
import numpy as np

ef = pd.read_csv('miss2.csv')
ef = ef.replace('-',np.NAN).replace(' ',np.NAN)
ef

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

NaN

rahul

raj

pune

NaN

3.0

NaN

kumar

NaN

4.0

dipanshu

gupta

mumbai

NaN

5.0

NaN

gurugram

6.0

NaN

agnihotri

amritsar

7.0

priya

verma

NaN

9.0

ravi

ranjan

NaN

11.0

atul

verma

rishikesh

ef.dropna()

fname

lname

city

state

age

11.0

atul

verma

rishikesh

fillna():

fillna() method is used to replace all NaN values with something else. For example:

In [17]:

df.fillna(0)

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

0.0

rahul

raj

pune

0.0

3.0

kumar

0.0

4.0

dipanshu

gupta

mumbai

34.0

5.0

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

7.0

priya

verma

0.0

9.0

ravi

ranjan

kochi

43.0

0.0

11.0

atul

verma

rishikesh

29.0

But 0 as a value is not right for columns like city or state, so you can fill values according to column data type by using dictionary.

df.fillna(
    {
        'id':0,
        'age':0,
        'fname':"no name",
        'lname':"no name",
        'city':"no city",
        'state':"no state"
    }
)

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

0.0

rahul

raj

pune

0.0

3.0

no name

kumar

no city

0.0

4.0

dipanshu

gupta

mumbai

34.0

5.0

no name

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

7.0

priya

verma

no city

no state

0.0

no name

no city

0.0

9.0

ravi

ranjan

kochi

43.0

0.0

no name

no city

no state

0.0

11.0

atul

verma

rishikesh

29.0

ffill and bfill

you can use ffill and bfill property to carry next or previous value to NaN Values.

In [25]:

df.fillna(method='ffill')

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

1.0

rahul

raj

pune

26.0

3.0

rahul

kumar

pune

26.0

4.0

dipanshu

gupta

mumbai

34.0

5.0

dipanshu

gupta

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

7.0

priya

verma

amritsar

40.0

7.0

priya

verma

amritsar

40.0

9.0

ravi

ranjan

kochi

43.0

9.0

ravi

ranjan

kochi

43.0

11.0

atul

verma

rishikesh

29.0

df.fillna(method='bfill')

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

3.0

rahul

raj

pune

34.0

3.0

dipanshu

kumar

mumbai

34.0

4.0

dipanshu

gupta

mumbai

34.0

5.0

alok

agnihotri

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

7.0

priya

verma

kochi

43.0

9.0

ravi

ranjan

kochi

43.0

9.0

ravi

ranjan

kochi

43.0

11.0

atul

verma

rishikesh

29.0

11.0

atul

verma

rishikesh

29.0

You can also fill values horizontally with the help of axis.

In [31]:

df.fillna(method='ffill',axis='columns')

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

NaN

rahul

raj

pune

3.0

kumar

4.0

dipanshu

gupta

mumbai

34.0

5.0

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

7.0

priya

verma

NaN

9.0

ravi

ranjan

kochi

43.0

NaN

11.0

atul

verma

rishikesh

29.0

You can also limit filling values by using limit property.

In [32]:

df.fillna(method='ffill',limit=1)

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.0

1.0

rahul

raj

pune

26.0

3.0

rahul

kumar

pune

NaN

4.0

dipanshu

gupta

mumbai

34.0

5.0

dipanshu

gupta

gurugram

29.0

6.0

alok

agnihotri

amritsar

40.0

7.0

priya

verma

amritsar

40.0

7.0

priya

verma

NaN

9.0

ravi

ranjan

kochi

43.0

9.0

ravi

ranjan

kochi

43.0

11.0

atul

verma

rishikesh

29.0

Interpolation - Linear Interpolation

Interpolation is a technique in Python used to estimate unknown data points between two known data points. Interpolation is mostly used to impute missing values in the dataframe or series while preprocessing data.

In [35]:

df.interpolate()

fname

lname

city

state

age

1.0

nihal

jaiswal

noida

26.000000

2.0

rahul

raj

pune

28.666667

3.0

NaN

kumar

NaN

31.333333

4.0

dipanshu

gupta

mumbai

34.000000

5.0

NaN

gurugram

29.000000

6.0

alok

agnihotri

amritsar

40.000000

7.0

priya

verma

NaN

41.000000

8.0

NaN

42.000000

9.0

ravi

ranjan

kochi

43.000000

10.0

NaN

36.000000

11.0

atul

verma

rishikesh

29.000000

Previous3.Joining In Pandas Next5.Outliers In Pandas

Last updated 2 years ago