1. Web Scrapping with Python
As we know, Python is has various applications and there are different libraries for different purposes. In our further demonstration, we will be using the following libraries:
- Selenium: Selenium is a web testing library. It is used to automate browser activities.
- BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that is helpful to extract the data easily.
- Pandas: Pandas is a library used for data manipulation and analysis. It is used to extract the data and store it in the desired format.
- Find the URL that you want to scrape :
For this example, we are going scrape IMDb website to extract the Price, Name, and Rating of Laptops. The URL for this page is https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht
2. Inspecting the Page :
The data is usually nested in tags. So, we inspect the page to see, under which tag the data we want to scrape is nested. To inspect the page, just right click on the element and click on “Inspect”.
3. Find the data you want to extract :
Let’s extract data inside movies Title name, weekly growth, overall gross and number of weeks for that, which is in the “div” tag respectively.
4. Write the code :
I am using Google Colab for this code.
Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
Creating empty arrays
TitleName=[]
Gross=[]
Weekend=[]
Week=[]
5. Run the code and extract the data
open the URL and extract the data
url = "https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht"
r = requests.get(url).content
Using the Find and Find All methods in BeautifulSoup. We extract the data and store into the variable.
soup = BeautifulSoup(r, "html.parser")
list = soup.find("tbody", {"class":""}).find_all("tr")
x = 1
for i in list:
title = i.find("td",{"class":"titleColumn"})
gross = i.find("span",{"class":"secondaryInfo"})
weekend = i.find("td",{"class":"ratingColumn"})
week=i.find("td",{"class":"weeksColumn"}
Using append we store the details in the Array that we have created before
TitleName.append(title.text)
Gross.append(gross.text)
Weekend.append(weekend.text)
Week.append(week.text)
6. Store the data in the required format
store the data in Comma-separated values (CSV format)
df=pd.DataFrame({'Movie Title':TitleName, 'Weekend':Weekend, 'Gross':Gross, 'Week':Week})
df.to_csv('DS-PR1-18IT012.csv', index=False, encoding='utf-8')
Github Link