如何使用Scrapy抓取所有网页中的链接?附现有文本提取代码
How to Modify Your Scrapy Spider to Extract All Web Links
Alright, let's tweak your existing Scrapy spider to capture every link from the pages you've listed. Here's a straightforward breakdown with the updated code:
Key Basics to Keep in Mind
To pull links, we'll target the <a> HTML tags (the ones that create hyperlinks) and extract their href attributes. We also need to convert relative URLs (like /about) to full absolute URLs (like https://www.domo.com/about) so they're usable outside the context of the original page.
Updated Full Code
import scrapy class QuotesSpider(scrapy.Spider): name = "dialpad" # Optional: Restrict crawling to your target sites (prevents wandering to unrelated pages) allowed_domains = [ "help.dialpad.com", "domo.com", "zenreach.com", "trendkite.com", "peloton.com", "ting.com", "cedar.com", "tophat.com", "bambora.com", "hoteltonight.com" ] def start_requests(self): urls = [ 'https://help.dialpad.com/hc/en-us/categories/201278063-User-Support', 'https://www.domo.com/', 'https://www.zenreach.com/', 'https://www.trendkite.com/', 'https://peloton.com/', 'https://ting.com/', 'https://www.cedar.com/', 'https://tophat.com/', 'https://www.bambora.com/en/ca/', 'https://www.hoteltonight.com/' ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): # Grab every href attribute from <a> tags on the page raw_links = response.css('a::attr(href)').getall() # Convert relative paths to full absolute URLs for link in raw_links: absolute_link = response.urljoin(link) # Output the link as a structured item (Scrapy will save this in your chosen format) yield { 'extracted_link': absolute_link } # Optional: Uncomment below to recursively crawl each extracted link # yield scrapy.Request(url=absolute_link, callback=self.parse)
What Changed & Why
- Added
allowed_domains: This optional setting keeps your spider focused on your target websites, making it more efficient and helping you comply with site crawling rules. - Revised
parseMethod:response.css('a::attr(href)').getall(): Uses a CSS selector to fetch everyhrefvalue from<a>tags.getall()returns a list of all matches.response.urljoin(link): Automatically fixes relative URLs. For example, if the page ishttps://www.peloton.com/and a link is/shop, this turns it intohttps://www.peloton.com/shop.- Yielding Items: The spider outputs each link as a dictionary, which you can export to JSON, CSV, or other formats with a simple command (like
scrapy crawl dialpad -o links.json).
- Recursive Crawling Option: If you want to follow every extracted link and crawl those pages too, uncomment the
yield scrapy.Request(...)line. Scrapy handles URL deduplication automatically, so you won't crawl the same page twice.
Quick Pro Tips
- Always check a site's
robots.txtfile (e.g.,https://www.domo.com/robots.txt) to make sure you're allowed to crawl it. - Add a
DOWNLOAD_DELAYin your Scrapy settings to avoid overwhelming target servers with too many requests at once.
内容的提问来源于stack exchange,提问作者Pranav Barot




