Suppose Big Bazaar wants to move to E-commerce industry and wants to collect data of multiple products from flipkart. Let us start with the collection of laptops data from flipkart.
Data to collect :
Step 1 : Get Html document of the Webpage.
Introduction to requests:
Requests allow you to send HTTP requests extremely easily. HTTP (Hypertext transfer protocol) request is used to send requests to the server and access information.
When a search engine or website visitor makes a request to a web server, a three digit HTTP Response Status Code is returned. This code indicates what is about to happen. A response code of 200 means "OK, here is the content you were asking for. It means the request is accepted.
If the status_code is 200, request is succeeded. Most common status code is 200. if the request is succeded , we can get the HTML Document of the web page and start scraping.
Let the Scraping Begins :
1. Go to flipkart.com and search for laptop and copy the web page url.
So first thing that we need to do is to import all necessary libraries and then send a HTTP request to our website and if the request is succeeded , we can access to the html text of the particular web page for scraping.
# Importing necessary Libraries
from bs4 import BeautifulSoup
import requests
# url of the webpage
url = "https://www.flipkart.com/search?q=laptop&sid=6bo%2Cb5g&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_2_2_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_2_2_na_na_na&as-pos=2&as-type=RECENT&suggestionId=laptop%7CLaptops&requestId=483bcb46-7458-4c48-869c-773583f9e774&as-searchtext=laptop"
# sending HTTP Request to the website and get the response
response = requests.get(url)
# if response code is 200, get content of Webpage using content property.
if(response.status_code == 200):
html_text = response.content
print(html_text)
Output :
<!doctype html><html lang="en"><head>
<link href="https://rukminim1.flixcart.com" rel="preconnect"/>
<link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.905c37.css"/>
<link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.104e9a.css"/>
<meta http-equiv="Content.........
.................</html>
If we need to scrape laptops details, we need to scrape first instance of the laptop or first post. If we can scrape first post , we can use that post as anchor and scrape information for all the laptops.
As we have learned previously, same kind of data lies in same class. So let us find class for our Laptop post.
As you can see, all of the contents of our laptop is in <a>tag, with class name _1fQZEK , So let's scrape this.
Scraping Laptop Details using <a> tag :
laptop = soup.find('a',"_1fQZEK")
print(laptop)
Now that we have scraped our Laptop post, let's scrape other details.
Scraping Laptop Model :
So the tag name is <div> class name for the model is _4rR01T.
Now that we have Scraped everything , let us combine everything and see the result.
from bs4 import BeautifulSoup
import requests
url = "https://www.flipkart.com/search?q=laptop&sid=6bo%2Cb5g&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_2_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_2_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=laptop%7CLaptops&requestId=483bcb46-7458-4c48-869c-773583f9e774&as-backfill=on"
response = requests.get(url)
if(response.status_code==200):
html_text = response.content
soup = BeautifulSoup(html_text,"lxml")
laptop = soup.find('a',"_1fQZEK")
model = laptop.find('div','_4rR01T').get_text()
brand = model.split(' ')[0].title()
price = int(laptop.find('div','_30jeq3').get_text().replace('₹','').replace(',',''))
link = "https://www.flipkart.com" + laptop['href']
image = laptop.find('img')['src']
specification_list = []
lists = laptop.find('ul')
specifications = lists.find_all('li')
for specification in specifications:
specification_list.append(specification.get_text())
rating = float(laptop.find('div', '_3LWZlK').get_text())
print(f'''Brand : {brand}
Price : {price}
Link : {link}
Image Link : {image}
Rating : {rating}
Specifications : {specification_list}''')
Output :
Brand : Acer
Price : 21890
Link : https://www.flipkart.com/acer-aspire-3-dual-core-3020e-4-gb-256-gb-ssd-windows-11-home-a314-22-laptop/p/itm07e718a865608?pid=COMGBNUMF8CDJZFH&lid=LSTCOMGBNUMF8CDJZFHVNFSHE&marketplace=FLIPKART&q=laptop&store=6bo%2Fb5g&srno=s_1_1&otracker=AS_QueryStore_OrganicAutoSuggest_1_2_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_2_na_na_na&fm=organic&iid=en_Gbo3777pGH7PAcF2H26oAIXyFRd0RPuu3u9xDpwameJZkT0HUiVYAfbkaioxeA9TFXuZ9FCqywavLb%2B8qN9sag%3D%3D&ppt=None&ppn=None&ssid=7yr62k7ir40000001664969271008&qH=312f91285e048e09
Image Link : https://rukminim1.flixcart.com/image/312/312/xif0q/computer/h/u/b/-original-imagg2vnsrgrfkmz.jpeg?q=70
Rating : 4.0
Specifications :
['AMD Dual Core Processor',
'4 GB DDR4 RAM',
'64 bit Windows 11 Operating System',
'256 GB SSD', '35.56 cm (14 Inch) Display',
'Acer Care Center,
Quick Access,
Acer Product Registration',
'1 Year International Travelers Warranty (ITW)']
Section 2
We will scrape details of all the phones
Create a Dictionary to store all the details.
Scrape details of all the Laptops in our current Web page :
We have scraped a single Laptops detail , now if we want to scrape details of all the Laptops all we need to do is to find all the occurences of <a> tag with class name <_1fQZEK>.
Doing this is pretty simple.All you need to do is to use <find_all> , Like this :
laptops = soup.find_all('a','_1fQZEK')
for laptop in laptops :
# use above written script for single laptop data scraping
Like this :
from bs4 import BeautifulSoup
import requests
url = "https://www.flipkart.com/search?q=laptop&sid=6bo%2Cb5g&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_2_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_2_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=laptop%7CLaptops&requestId=483bcb46-7458-4c48-869c-773583f9e774&as-backfill=on"
response = requests.get(url)
if(response.status_code==200):
html_text = response.content
soup = BeautifulSoup(html_text,"lxml")
laptops = soup.find_all('a',"_1fQZEK")
for laptop in laptops:
model = laptop.find('div','_4rR01T').get_text()
brand = model.split(' ')[0].title()
price = int(laptop.find('div','_30jeq3').get_text().replace('₹','').replace(',',''))
link = "https://www.flipkart.com" + laptop['href']
image = laptop.find('img')['src']
specification_list = []
lists = laptop.find('ul')
specifications = lists.find_all('li')
for specification in specifications:
specification_list.append(specification.get_text())
rating = float(laptop.find('div', '_3LWZlK').get_text())
print(f'''Brand : {brand}
Price : {price}
Link : {link}
Image Link : {image}
Rating : {rating}
Specifications : {specification_list}''')