You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何使用Scrapy抓取所有网页中的链接?附现有文本提取代码

Alright, let's tweak your existing Scrapy spider to capture every link from the pages you've listed. Here's a straightforward breakdown with the updated code:

Key Basics to Keep in Mind

To pull links, we'll target the <a> HTML tags (the ones that create hyperlinks) and extract their href attributes. We also need to convert relative URLs (like /about) to full absolute URLs (like https://www.domo.com/about) so they're usable outside the context of the original page.

Updated Full Code

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "dialpad"
    # Optional: Restrict crawling to your target sites (prevents wandering to unrelated pages)
    allowed_domains = [
        "help.dialpad.com",
        "domo.com",
        "zenreach.com",
        "trendkite.com",
        "peloton.com",
        "ting.com",
        "cedar.com",
        "tophat.com",
        "bambora.com",
        "hoteltonight.com"
    ]

    def start_requests(self):
        urls = [
            'https://help.dialpad.com/hc/en-us/categories/201278063-User-Support',
            'https://www.domo.com/',
            'https://www.zenreach.com/',
            'https://www.trendkite.com/',
            'https://peloton.com/',
            'https://ting.com/',
            'https://www.cedar.com/',
            'https://tophat.com/',
            'https://www.bambora.com/en/ca/',
            'https://www.hoteltonight.com/'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Grab every href attribute from <a> tags on the page
        raw_links = response.css('a::attr(href)').getall()
        
        # Convert relative paths to full absolute URLs
        for link in raw_links:
            absolute_link = response.urljoin(link)
            # Output the link as a structured item (Scrapy will save this in your chosen format)
            yield {
                'extracted_link': absolute_link
            }
            
            # Optional: Uncomment below to recursively crawl each extracted link
            # yield scrapy.Request(url=absolute_link, callback=self.parse)

What Changed & Why

  1. Added allowed_domains: This optional setting keeps your spider focused on your target websites, making it more efficient and helping you comply with site crawling rules.
  2. Revised parse Method:
    • response.css('a::attr(href)').getall(): Uses a CSS selector to fetch every href value from <a> tags. getall() returns a list of all matches.
    • response.urljoin(link): Automatically fixes relative URLs. For example, if the page is https://www.peloton.com/ and a link is /shop, this turns it into https://www.peloton.com/shop.
    • Yielding Items: The spider outputs each link as a dictionary, which you can export to JSON, CSV, or other formats with a simple command (like scrapy crawl dialpad -o links.json).
  3. Recursive Crawling Option: If you want to follow every extracted link and crawl those pages too, uncomment the yield scrapy.Request(...) line. Scrapy handles URL deduplication automatically, so you won't crawl the same page twice.

Quick Pro Tips

  • Always check a site's robots.txt file (e.g., https://www.domo.com/robots.txt) to make sure you're allowed to crawl it.
  • Add a DOWNLOAD_DELAY in your Scrapy settings to avoid overwhelming target servers with too many requests at once.

内容的提问来源于stack exchange,提问作者Pranav Barot

火山引擎 最新活动