8.Validation In Pandas
Last updated
Last updated
You must have once created a password and suddenly a pop up happens. Telling you , your password doesn not match our criterias. These are validations. Validations are really helpful in many cases. Sometimes it may happen, that we need to apply validation scripts on our data , eliminate data not matching these criterias , in a nutshell , We are going to cover applying Validation on our Datasets.
0
1
733884567$88
SATHISHRAAJ
SM
-13047
RI
-13047
1
2
8896763523
G.K.
Gupta
12087
UP
34567
2
7
7338848678
nihal
SM
-13047
KE
-13047
3
8
8890763523
rahul88jain
Gupta
12087
JK
34567
4
9
8890763853
yogesh.@jain
Gupta
12087
PL
34567
5
3
6278032123
Ravi.raj1
Kumar$
11098
BH
23456
6
4
6278.098123
Rlavi
Kumar
11098
MP
23456
7
5
8890763599
rahul
Gupta
12087
IP
34567
8
5
8890763523
rahulk
Gupta
12087
GH
34567
So there are lots of things that is wrong with data and we need to fix it. Let's solve each problem step by step.
Removing Duplicate Data:
As you can see there are a few duplicate data that we need to remove. Now there is no validation here, but let's first solve it anyway.
We can use drop duplicates here to remove duplicate data.
In [2]:
0
1
733884567$88
SATHISHRAAJ
SM
-13047
RI
-13047
1
2
8896763523
G.K.
Gupta
12087
UP
34567
2
7
7338848678
nihal
SM
-13047
KE
-13047
3
8
8890763523
rahul88jain
Gupta
12087
JK
34567
4
9
8890763853
yogesh.@jain
Gupta
12087
PL
34567
5
3
6278032123
Ravi.raj1
Kumar$
11098
BH
23456
6
4
6278.098123
Rlavi
Kumar
11098
MP
23456
7
5
8890763599
rahul
Gupta
12087
IP
34567
As you can see most of them are objects, so we can apply string methods that we have learned in python. Pandas offer numerous string methods in series that we can use to perform Validation.
Validation1 : It should be numeric Validation2 : It should have 10 digits
Validation1 : It should be numeric
So, to check if our phone number has only digits in it , we will use series.str.isdigit() method and store its result in a new column.
In [4]:
0
733884567$88
False
1
8896763523
True
2
7338848678
True
3
8890763523
True
4
8890763853
True
5
6278032123
True
6
6278.098123
False
7
8890763599
True
Validation2 : It should have 10 digits
So, to check if our phone number has only 10 digits in it , we will use series.str.len() method and store its result in a new column.
In [5]:
0
733884567$88
12
1
8896763523
10
2
7338848678
10
3
8890763523
10
4
8890763853
10
5
6278032123
10
6
6278.098123
11
7
8890763599
10
Now that we have enough information about FS0 Number, we can perform multiple conditions and filter right data. Something like this:
In [6]:
0
733884567$88
False
12
1
8896763523
True
10
2
7338848678
True
10
3
8890763523
True
10
4
8890763853
True
10
5
6278032123
True
10
6
6278.098123
False
11
7
8890763599
True
10
1
2
8896763523
G.K.
Gupta
12087
UP
34567
True
10
2
7
7338848678
nihal
SM
-13047
KE
-13047
True
10
3
8
8890763523
rahul88jain
Gupta
12087
JK
34567
True
10
4
9
8890763853
yogesh.@jain
Gupta
12087
PL
34567
True
10
5
3
6278032123
Ravi.raj1
Kumar$
11098
BH
23456
True
10
7
5
8890763599
rahul
Gupta
12087
IP
34567
True
10
Now we can delete temporary columns , as now we do not need them.
In [8]:
1
2
8896763523
G.K.
Gupta
12087
UP
34567
2
7
7338848678
nihal
SM
-13047
KE
-13047
3
8
8890763523
rahul88jain
Gupta
12087
JK
34567
4
9
8890763853
yogesh.@jain
Gupta
12087
PL
34567
5
3
6278032123
Ravi.raj1
Kumar$
11098
BH
23456
7
5
8890763599
rahul
Gupta
12087
IP
34567
Suppose we want to keep all the Phone Numbers and eliminate special characters or any non-numeric value , we can use replace method of series.
In [9]:
0
1
73388456788
SATHISHRAAJ
SM
-13047
RI
-13047
1
2
8896763523
G.K.
Gupta
12087
UP
34567
2
7
7338848678
nihal
SM
-13047
KE
-13047
3
8
8890763523
rahul88jain
Gupta
12087
JK
34567
4
9
8890763853
yogesh.@jain
Gupta
12087
PL
34567
5
3
6278032123
Ravi.raj1
Kumar$
11098
BH
23456
6
4
6278098123
Rlavi
Kumar
11098
MP
23456
7
5
8890763599
rahul
Gupta
12087
IP
34567
8
5
8890763523
rahulk
Gupta
12087
GH
3456
Validation1 : It should have only alphabets and dots are allowed
We need to remove spaces
first thing that we need to do is to remove useless spaces if there are any in names. By useless space we mean, left and right spaces of names. We can do that by using strip() function.
In [10]:
1
2
8896763523
G.K.
Gupta
12087
UP
34567
2
7
7338848678
nihal
SM
-13047
KE
-13047
3
8
8890763523
rahul88jain
Gupta
12087
JK
34567
4
9
8890763853
yogesh.@jain
Gupta
12087
PL
34567
5
3
6278032123
Ravi.raj1
Kumar$
11098
BH
23456
7
5
8890763599
rahul
Gupta
12087
IP
34567
Second thing is to check if they are alphabets, so we need to create a column to check if they are alphabets or not.
In [11]:
1
2
8896763523
G.K.
Gupta
12087
UP
34567
False
2
7
7338848678
nihal
SM
-13047
KE
-13047
True
3
8
8890763523
rahul88jain
Gupta
12087
JK
34567
False
4
9
8890763853
yogesh.@jain
Gupta
12087
PL
34567
False
5
3
6278032123
Ravi.raj1
Kumar$
11098
BH
23456
False
7
5
8890763599
rahul
Gupta
12087
IP
34567
True
Now As you can see, First name (G.K.) should be True , but isalpha method won't allow it, so we need to find a workaround .
Why don't we replace (.) and if we replace it and our name falls true in isalpha , it will not be removed.
Now, while replacing , do remember to make a copy of the column and make all changes in it , so that we don't change original data.
In [12]:
In [13]:
1
2
8896763523
G.K.
Gupta
12087
UP
34567
False
GK
2
7
7338848678
nihal
SM
-13047
KE
-13047
True
nihal
3
8
8890763523
rahul88jain
Gupta
12087
JK
34567
False
rahul88jain
4
9
8890763853
yogesh.@jain
Gupta
12087
PL
34567
False
yogesh@jain
5
3
6278032123
Ravi.raj1
Kumar$
11098
BH
23456
False
Raviraj1
7
5
8890763599
rahul
Gupta
12087
IP
34567
True
rahul
1
2
8896763523
G.K.
Gupta
12087
UP
34567
True
GK
2
7
7338848678
nihal
SM
-13047
KE
-13047
True
nihal
3
8
8890763523
rahul88jain
Gupta
12087
JK
34567
False
rahul88jain
4
9
8890763853
yogesh.@jain
Gupta
12087
PL
34567
False
yogesh@jain
5
3
6278032123
Ravi.raj1
Kumar$
11098
BH
23456
False
Raviraj1
7
5
8890763599
rahul
Gupta
12087
IP
34567
True
rahul
Now we need to remove data where names also do not fall in validation criteria.
In [15]:
1
2
8896763523
G.K.
Gupta
12087
UP
34567
True
GK
2
7
7338848678
nihal
SM
-13047
KE
-13047
True
nihal
7
5
8890763599
rahul
Gupta
12087
IP
34567
True
rahul
Now we do not need check_alpha and temp_name , so let's remove them
In [16]:
1
2
8896763523
G.K.
Gupta
12087
UP
34567
2
7
7338848678
nihal
SM
-13047
KE
-13047
7
5
8890763599
rahul
Gupta
12087
IP
34567
We can also use title method to capitalize first letter of name.
In [17]:
1
2
8896763523
G.K.
Gupta
12087
UP
34567
2
7
7338848678
Nihal
SM
-13047
KE
-13047
7
5
8890763599
Rahul
Gupta
12087
IP
34567
Our company has already a list of circle , if any circle that is not present in our list we should remove it.
In [18]:
In [19]:
1
2
8896763523
G.K.
Gupta
12087
UP
34567
2
7
7338848678
Nihal
SM
-13047
KE
-13047
In [30]:
Can you spot any problem in data ?
You can see cities mentioned in delhi is not right. There are many irregularities.
In [28]:
10
delhi
Gharoli
34533.200
11
delhi
delhi-central
37373.373
12
delhi
Gokalpur
8322.430
13
delhi
delh
976858.430
14
delhi
delh
48483.230
15
delhi
delhi-110034
5757.670
16
delhi
Jaitpur
89786.120
17
delhi
new delh
574.340
18
delhi
del
575.120
19
delhi
Mukandpur
56744.120
20
delhi
Mundka
4854.230
21
delhi
Mitraon
574.320
22
delhi
Nilothi
8766.340
23
delhi
Nangloi Jat
5868.120
24
delhi
Nithari
5855.450
25
delhi
Neb Sarai
5858.450
26
delhi
Nangli Sakrawati
575.450
27
delhi
Pooth Kalan
797976.230
28
delhi
Pooth Khurd
44.340
29
delhi
Pul Pehlad
123.000
30
delhi
Pehlad Pur Bangar
455.120
31
delhi
Qadipur
484.230
You can refer this page while applying validation :