Web Scraping with BeautifulSoup & Requests
Extract data from websites using BeautifulSoup, Requests, and Selenium.
The Scraping Toolkit
- Requests: For making HTTP calls and getting raw HTML.
- BeautifulSoup: For parsing and navigating the HTML tree.
- Selenium: For scraping dynamic, JavaScript-rendered sites.
Basic Scraping Flow
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
title = soup.find('h1').text
print(title)
from bs4 import BeautifulSoup
url = "https://example.com"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
title = soup.find('h1').text
print(title)
Navigating the Tree
Find elements by tag, class, or ID.
# Find all links
links = soup.find_all('a')
# Find by class
articles = soup.find_all('div', class_='article-content')
# CSS Selectors
header_links = soup.select('nav ul li a')
links = soup.find_all('a')
# Find by class
articles = soup.find_all('div', class_='article-content')
# CSS Selectors
header_links = soup.select('nav ul li a')
Respecting Robots.txt
Always check a site's /robots.txt file to see what they allow you to scrape. Be a good bot: add delays between requests and set a proper User-Agent.
โ Practice (30 minutes)
- Install dependencies:
pip install requests beautifulsoup4. - Scrape a news website and print the titles of the top 5 articles.
- Extract all image URLs from a landing page.
- Save your scraped data to a JSON file.