8.Validation In Pandas

You must have once created a password and suddenly a pop up happens. Telling you , your password doesn not match our criterias. These are validations. Validations are really helpful in many cases. Sometimes it may happen, that we need to apply validation scripts on our data , eliminate data not matching these criterias , in a nutshell , We are going to cover applying Validation on our Datasets.

import pandas as pd

data = pd.read_csv('mydata.csv')
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

733884567$88

SATHISHRAAJ

-13047

8896763523

G.K.

Gupta

12087

34567

7338848678

nihal

-13047

8890763523

rahul88jain

Gupta

12087

34567

8890763853

yogesh.@jain

Gupta

12087

34567

6278032123

Ravi.raj1

Kumar$

11098

23456

6278.098123

Rlavi

Kumar

11098

23456

8890763599

rahul

Gupta

12087

34567

8890763523

rahulk

Gupta

12087

34567

So there are lots of things that is wrong with data and we need to fix it. Let's solve each problem step by step.

Removing Duplicate Data:

As you can see there are a few duplicate data that we need to remove. Now there is no validation here, but let's first solve it anyway.

We can use drop duplicates here to remove duplicate data.

In [2]:

data = data.drop_duplicates(subset='S.No')
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

733884567$88

SATHISHRAAJ

-13047

8896763523

G.K.

Gupta

12087

34567

7338848678

nihal

-13047

8890763523

rahul88jain

Gupta

12087

34567

8890763853

yogesh.@jain

Gupta

12087

34567

6278032123

Ravi.raj1

Kumar$

11098

23456

6278.098123

Rlavi

Kumar

11098

23456

8890763599

rahul

Gupta

12087

34567

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 0 to 7
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   S.No        8 non-null      int64 
 1   FS0 Number  8 non-null      object
 2   First Name  8 non-null      object
 3   Last Name   8 non-null      object
 4   Store Id    8 non-null      int64 
 5   Circle      8 non-null      object
 6   Code        8 non-null      int64 
dtypes: int64(3), object(4)
memory usage: 512.0+ bytes

As you can see most of them are objects, so we can apply string methods that we have learned in python. Pandas offer numerous string methods in series that we can use to perform Validation.

You can refer this page while applying validation :Click Here

Validation In Phone Number :

Validation1 : It should be numeric Validation2 : It should have 10 digits

Validation1 : It should be numeric

So, to check if our phone number has only digits in it , we will use series.str.isdigit() method and store its result in a new column.

In [4]:

data['only digit'] = data['FS0 Number'].str.isdigit()
data[['FS0 Number','only digit']]

C:\Users\abhis\AppData\Local\Temp/ipykernel_7664/1497754001.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['only digit'] = data['FS0 Number'].str.isdigit()

FS0 Number

only digit

733884567$88

False

8896763523

True

7338848678

True

8890763523

True

8890763853

True

6278032123

True

6278.098123

False

8890763599

True

Validation2 : It should have 10 digits

So, to check if our phone number has only 10 digits in it , we will use series.str.len() method and store its result in a new column.

In [5]:

data['ten digit'] = data['FS0 Number'].str.len() 
data[['FS0 Number','ten digit']]

C:\Users\abhis\AppData\Local\Temp/ipykernel_7664/3592959975.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ten digit'] = data['FS0 Number'].str.len()

FS0 Number

ten digit

733884567$88

8896763523

7338848678

8890763523

8890763853

6278032123

6278.098123

8890763599

Now that we have enough information about FS0 Number, we can perform multiple conditions and filter right data. Something like this:

In [6]:

data[['FS0 Number','only digit','ten digit']]

FS0 Number

only digit

ten digit

733884567$88

False

8896763523

True

7338848678

True

8890763523

True

8890763853

True

6278032123

True

6278.098123

False

8890763599

True

data = data.loc[(data['only digit']==True) & (data['ten digit']==10)]
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

only digit

ten digit

8896763523

G.K.

Gupta

12087

34567

True

7338848678

nihal

-13047

True

8890763523

rahul88jain

Gupta

12087

34567

True

8890763853

yogesh.@jain

Gupta

12087

34567

True

6278032123

Ravi.raj1

Kumar$

11098

23456

True

8890763599

rahul

Gupta

12087

34567

True

Now we can delete temporary columns , as now we do not need them.

In [8]:

data = data.drop(columns = ['only digit','ten digit'])
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

8896763523

G.K.

Gupta

12087

34567

7338848678

nihal

-13047

8890763523

rahul88jain

Gupta

12087

34567

8890763853

yogesh.@jain

Gupta

12087

34567

6278032123

Ravi.raj1

Kumar$

11098

23456

8890763599

rahul

Gupta

12087

34567

A Useful Example :

Suppose we want to keep all the Phone Numbers and eliminate special characters or any non-numeric value , we can use replace method of series.

In [9]:

import pandas as pd

df = pd.read_csv('mydata.csv')
df['FS0 Number'] = df['FS0 Number'].str.replace('[^0-9]','')
df

C:\Users\abhis\AppData\Local\Temp/ipykernel_7664/3965905300.py:4: FutureWarning: The default value of regex will change from True to False in a future version.
  df['FS0 Number'] = df['FS0 Number'].str.replace('[^0-9]','')

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

73388456788

SATHISHRAAJ

-13047

8896763523

G.K.

Gupta

12087

34567

7338848678

nihal

-13047

8890763523

rahul88jain

Gupta

12087

34567

8890763853

yogesh.@jain

Gupta

12087

34567

6278032123

Ravi.raj1

Kumar$

11098

23456

6278098123

Rlavi

Kumar

11098

23456

8890763599

rahul

Gupta

12087

34567

8890763523

rahulk

Gupta

12087

3456

Validation In First Name :

Validation1 : It should have only alphabets and dots are allowed

We need to remove spaces

first thing that we need to do is to remove useless spaces if there are any in names. By useless space we mean, left and right spaces of names. We can do that by using strip() function.

In [10]:

data['First Name'] = data['First Name'].str.strip()
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

8896763523

G.K.

Gupta

12087

34567

7338848678

nihal

-13047

8890763523

rahul88jain

Gupta

12087

34567

8890763853

yogesh.@jain

Gupta

12087

34567

6278032123

Ravi.raj1

Kumar$

11098

23456

8890763599

rahul

Gupta

12087

34567

Second thing is to check if they are alphabets, so we need to create a column to check if they are alphabets or not.

In [11]:

data['check_alpha'] = data['First Name'].str.isalpha()
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

check_alpha

8896763523

G.K.

Gupta

12087

34567

False

7338848678

nihal

-13047

True

8890763523

rahul88jain

Gupta

12087

34567

False

8890763853

yogesh.@jain

Gupta

12087

34567

False

6278032123

Ravi.raj1

Kumar$

11098

23456

False

8890763599

rahul

Gupta

12087

34567

True

Now As you can see, First name (G.K.) should be True , but isalpha method won't allow it, so we need to find a workaround .

Why don't we replace (.) and if we replace it and our name falls true in isalpha , it will not be removed.

Now, while replacing , do remember to make a copy of the column and make all changes in it , so that we don't change original data.

In [12]:

data['temp_name'] = data['First Name']

In [13]:

data['temp_name'] = data['First Name'].str.replace('.','')
data

C:\Users\abhis\AppData\Local\Temp/ipykernel_7664/4143232602.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  data['temp_name'] = data['First Name'].str.replace('.','')

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

check_alpha

temp_name

8896763523

G.K.

Gupta

12087

34567

False

7338848678

nihal

-13047

True

nihal

8890763523

rahul88jain

Gupta

12087

34567

False

rahul88jain

8890763853

yogesh.@jain

Gupta

12087

34567

False

yogesh@jain

6278032123

Ravi.raj1

Kumar$

11098

23456

False

Raviraj1

8890763599

rahul

Gupta

12087

34567

True

rahul

data['check_alpha'] = data['temp_name'].str.isalpha()
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

check_alpha

temp_name

8896763523

G.K.

Gupta

12087

34567

True

7338848678

nihal

-13047

True

nihal

8890763523

rahul88jain

Gupta

12087

34567

False

rahul88jain

8890763853

yogesh.@jain

Gupta

12087

34567

False

yogesh@jain

6278032123

Ravi.raj1

Kumar$

11098

23456

False

Raviraj1

8890763599

rahul

Gupta

12087

34567

True

rahul

Now we need to remove data where names also do not fall in validation criteria.

In [15]:

data = data.loc[data['check_alpha']==True]
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

check_alpha

temp_name

8896763523

G.K.

Gupta

12087

34567

True

7338848678

nihal

-13047

True

nihal

8890763599

rahul

Gupta

12087

34567

True

rahul

Now we do not need check_alpha and temp_name , so let's remove them

In [16]:

data = data.drop(columns=['check_alpha','temp_name'])
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

8896763523

G.K.

Gupta

12087

34567

7338848678

nihal

-13047

8890763599

rahul

Gupta

12087

34567

We can also use title method to capitalize first letter of name.

In [17]:

data['First Name'] = data['First Name'].str.title()
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

8896763523

G.K.

Gupta

12087

34567

7338848678

Nihal

-13047

8890763599

Rahul

Gupta

12087

34567

Validation In Circle:

Our company has already a list of circle , if any circle that is not present in our list we should remove it.

In [18]:

c = ['UP','HP','RE','KP','KE','HH','TR']

In [19]:

data = data.loc[data['Circle'].isin(c)]
data

S.No

FS0 Number

First Name

Last Name

Store Id

Circle

Code

8896763523

G.K.

Gupta

12087

34567

7338848678

Nihal

-13047

Let us say , we have a data of city wise sales like below:

In [30]:

import pandas as pd

ef = pd.read_csv('city.csv')
print(ef)

             state           city  total_sales
0            delhi        Gharoli    34533.200
1            delhi  delhi-central    37373.373
2            delhi       Gokalpur     8322.430
3            delhi           delh   976858.430
4            delhi           delh    48483.230
5            delhi   delhi-110034     5757.670
6            delhi        Jaitpur    89786.120
7            delhi       new delh      574.340
8            delhi            del      575.120
9            delhi      Mukandpur    56744.120
10  andhra pradesh          Adoni      300.560
11  andhra pradesh      Amaravati      400.780
12  andhra pradesh      Anantapur      800.980
13  andhra pradesh    Chandragiri      120.450
14  andhra pradesh       Chittoor      345.230
15  andhra pradesh   Dowlaiswaram     1234.670
16  andhra pradesh          Eluru    75464.230
17  andhra pradesh         Guntur      232.400
18  andhra pradesh         Kadapa      343.100
19  andhra pradesh       Kakinada     3432.000

Can you spot any problem in data ?

You can see cities mentioned in delhi is not right. There are many irregularities.

In [28]:

sf = ef.loc[ef['state']=='delhi']
sf

state

city

total_sales

delhi

Gharoli

34533.200

delhi

delhi-central

37373.373

delhi

Gokalpur

8322.430

delhi

delh

976858.430

delhi

delh

48483.230

delhi

delhi-110034

5757.670

delhi

Jaitpur

89786.120

delhi

new delh

574.340

delhi

del

575.120

delhi

Mukandpur

56744.120

delhi

Mundka

4854.230

delhi

Mitraon

574.320

delhi

Nilothi

8766.340

delhi

Nangloi Jat

5868.120

delhi

Nithari

5855.450

delhi

Neb Sarai

5858.450

delhi

Nangli Sakrawati

575.450

delhi

Pooth Kalan

797976.230

delhi

Pooth Khurd

44.340

delhi

Pul Pehlad

123.000

delhi

Pehlad Pur Bangar

455.120

delhi

Qadipur

484.230

Previous7.DateTime In Pandas Next9.Fetching Data From SQL

Last updated 2 years ago