1. Web Scrapping with Python

Katha Vachhani
2 min readJul 23, 2021

--

As we know, Python is has various applications and there are different libraries for different purposes. In our further demonstration, we will be using the following libraries:

  • Selenium: Selenium is a web testing library. It is used to automate browser activities.
  • BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that is helpful to extract the data easily.
  • Pandas: Pandas is a library used for data manipulation and analysis. It is used to extract the data and store it in the desired format.
  1. Find the URL that you want to scrape :

For this example, we are going scrape IMDb website to extract the Price, Name, and Rating of Laptops. The URL for this page is https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht

2. Inspecting the Page :

The data is usually nested in tags. So, we inspect the page to see, under which tag the data we want to scrape is nested. To inspect the page, just right click on the element and click on “Inspect”.

3. Find the data you want to extract :

Let’s extract data inside movies Title name, weekly growth, overall gross and number of weeks for that, which is in the “div” tag respectively.

4. Write the code :

I am using Google Colab for this code.

Import libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

Creating empty arrays

TitleName=[]
Gross=[]
Weekend=[]
Week=[]

5. Run the code and extract the data

open the URL and extract the data

url = "https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht"
r = requests.get(url).content

Using the Find and Find All methods in BeautifulSoup. We extract the data and store into the variable.

soup = BeautifulSoup(r, "html.parser")
list = soup.find("tbody", {"class":""}).find_all("tr")
x = 1
for i in list:
title = i.find("td",{"class":"titleColumn"})
gross = i.find("span",{"class":"secondaryInfo"})
weekend = i.find("td",{"class":"ratingColumn"})
week=i.find("td",{"class":"weeksColumn"}

Using append we store the details in the Array that we have created before

TitleName.append(title.text)
Gross.append(gross.text)
Weekend.append(weekend.text)
Week.append(week.text)

6. Store the data in the required format

store the data in Comma-separated values (CSV format)

df=pd.DataFrame({'Movie Title':TitleName, 'Weekend':Weekend, 'Gross':Gross, 'Week':Week})
df.to_csv('DS-PR1-18IT012.csv', index=False, encoding='utf-8')

Github Link

--

--

No responses yet