
What is Scrapy?
Scrapy is an open-source web crawling framework written in Python, designed to extract structured data from websites efficiently. It is widely used for web scraping, data mining, automated testing, and various other data extraction tasks. Scrapy is built on top of the Twisted networking engine, making it highly efficient and scalable due to its asynchronous architecture.
Why Use Scrapy?
-
Speed and Efficiency: Scrapy’s asynchronous nature allows it to fetch multiple pages concurrently.
-
Modular and Extensible: Built-in middleware, pipelines, and extensions provide a flexible framework.
-
Supports Multiple Data Formats: Extracted data can be stored in JSON, CSV, XML, and databases.
-
Built-in Handling for Common Scraping Challenges: Handles cookies, sessions, authentication, and proxies.
Installing Scrapy
Scrapy can be installed using pip
.
pip install scrapy
To verify that Scrapy is installed successfully, run
scrapy version
Setting Up a Scrapy Project
To create a new Scrapy project, use the following command
scrapy startproject myproject
This creates the following directory structure
myproject/
scrapy.cfg # Project configuration file
myproject/ # Main project directory
__init__.py
items.py # Defines data structure for scraped items
middlewares.py # Custom middlewares for processing requests/responses
pipelines.py # Processes scraped data before storing it
settings.py # Project settings (e.g., user agents, proxies, delays)
spiders/ # Stores the spider scripts
Creating and Running a Scrapy Spider
A Spider is a class in Scrapy that defines how a website should be scraped.
To generate a spider:
scrapy genspider myspider example.com
This creates myspider.py
inside the spiders/
directory.
Example Spider
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
allowed_domains = ["example.com"]
start_urls = ["https://example.com"]
def parse(self, response):
title = response.xpath('//title/text()').get()
yield {"title": title}
Running the Spider
scrapy crawl myspider
Extracting Data with Scrapy
Scrapy provides powerful tools for extracting data:
Using XPath
response.xpath('//h1/text()').get()
Using CSS Selectors
response.css('h1::text').get()
To extract multiple elements:
response.xpath('//p/text()').getall()
response.css('p::text').getall()
Saving Scraped Data
Scrapy allows exporting data in multiple formats:
scrapy crawl myspider -o output.json
scrapy crawl myspider -o output.csv
Supported formats: JSON, CSV, XML, JSONL.
Scrapy Middleware and Advanced Configurations
Scrapy provides middlewares for handling advanced tasks like rotating user agents, using proxies, and handling captchas.
Changing User-Agent
Modify settings.py
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
Using Proxies
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
}
PROXY_LIST = [
"http://proxy1.com",
"http://proxy2.com"
]
Handling JavaScript-Rendered Pages with Selenium
Some websites rely heavily on JavaScript. In such cases, Scrapy alone may not be sufficient, and Selenium can be used.
Install Selenium
pip install selenium
Example: Scrapy with Selenium
from selenium import webdriver
from scrapy.selector import Selector
import scrapy
class SeleniumSpider(scrapy.Spider):
name = "selenium_spider"
def start_requests(self):
driver = webdriver.Chrome()
driver.get("https://example.com")
page_source = driver.page_source
driver.quit()
response = Selector(text=page_source)
title = response.xpath('//title/text()').get()
yield {"title": title}
This technique allows scraping JavaScript-rendered content.
Optimizing Scrapy Performance
Enable AutoThrottle (Adjusts request rate dynamically):
Modify settings.py
:
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
Use Concurrent Requests (Increases speed while being respectful to websites)
CONCURRENT_REQUESTS = 16
Set Request Delay (Prevents getting banned)
DOWNLOAD_DELAY = 2
Storing Data in Databases
Scrapy supports storing data in databases such as MySQL, PostgreSQL, and MongoDB.
Example: Storing Data in MongoDB
Install pymongo
pip install pymongo
Modify pipelines.py
import pymongo
class MongoPipeline:
def open_spider(self, spider):
self.client = pymongo.MongoClient("mongodb://localhost:27017/")
self.db = self.client["scrapy_data"]
def process_item(self, item, spider):
self.db["scraped_items"].insert_one(dict(item))
return item
Enable the pipeline in settings.py
ITEM_PIPELINES = {
'myproject.pipelines.MongoPipeline': 300,
}
Conclusion
Scrapy is an incredibly powerful, scalable, and flexible framework for web scraping.
By mastering Scrapy’s advanced features like middlewares, proxies, and Selenium integration, you can build robust crawlers that efficiently extract and store data.
Stay tuned for more advanced Scrapy tutorials and real-world applications!
Mesh Network: The Next-Generation Networking Technology
[…] Scrapy: A Powerful Python Web Crawling Framework […]
[…] Scrapy: A Powerful Python Web Crawling Framework […]