Scrapy爬虫开发：如何移除URL中的/shop/#ticket后缀

阿华AIGC实验室

2026-5-28

解决Scrapy爬虫中URL后缀的问题

嘿，很高兴看到你已经上手Scrapy开始爬取Festicket啦！针对你遇到的部分URL带有/shop/#ticket后缀导致无法正常爬取的问题，咱们可以在生成请求前对URL做个简单处理，轻松移除这个多余的后缀。

具体解决思路

我们只需要在获取并拼接完整URL之后，判断该URL是否包含目标后缀，若存在就将其移除，保留前面的有效部分。这里用Python的字符串split()方法就能轻松实现，操作简单直接。

修改后的完整代码

import scrapy

class AuthorsSpider(scrapy.Spider):
    name = "festicket"
    start_urls = ['https://www.festicket.com/festivals/']
    npages = 20
    # 模拟翻页添加URL，改用f-string让代码更简洁
    for i in range(2, npages + 2):
        start_urls.append(f"https://www.festicket.com/festivals/?page={i}")

    # 解析列表页并跟进详情页
    def parse(self, response):
        urls = response.xpath(
            "//h3[@class='festival-title heading-3ry notranslate']//@href").extract()
        for url in urls:
            url = response.urljoin(url)
            # 关键处理：移除/shop/#ticket后缀
            if '/shop/#ticket' in url:
                url = url.split('/shop/#ticket')[0]
            yield scrapy.Request(url=url, callback=self.parse_details)

    def parse_details(self, response):
        yield {
            'title': response.xpath("//h1[@class='sc-jzJRlG gbLQoU']/text()").extract_first(),
            'festival_url': response.xpath("//meta[@property='og:url']/@content").extract_first(),
            'location': response.xpath("//ul[contains(@class,'styles__StyledList')][1]/li[contains(@class,'styles__DotSeparatorSpan-h0jg7b')][1]/descendant::text()").extract_first(),
            'address': response.xpath("//div[@class='sc-gzVnrw bpJeJY'][2]/section[@class='sc-gZMcBi gDrvBk']/div/p[@class='sc-chPdSV hifsJb']/descendant::text()").extract_first(),
            'date': response.xpath("//ul[contains(@class,'styles__StyledList')][1]/li[contains(@class,'styles__DotSeparatorSpan-h0jg7b')][2]/descendant::text()").extract_first(),
            'genre1': response.xpath("//ul[contains(@class,'styles__StyledList')][2]/li[contains(@class,'styles__DotSeparatorSpan-h0jg7b')][1]/descendant::text()").extract_first(),
            'genre2': response.xpath("//ul[contains(@class,'styles__StyledList')][2]/li[contains(@class,'styles__DotSeparatorSpan-h0jg7b')][2]/descendant::text()").extract_first(),
            'genre3': response.xpath("//ul[contains(@class,'styles__StyledList')][2]/li[contains(@class,'styles__DotSeparatorSpan-h0jg7b')][3]/descendant::text()").extract_first(),
            'subtitle2': response.xpath(
                "//span[@class='styles__StyledHtmlWrapper-l0qhyk-0 cUaVYv sc-jAaTju jlDUtI']/p/descendant::text()").extract_first(),
            'subtitle1': response.xpath("//h2[@class='sc-cSHVUG gCeeYI']/descendant::text()").extract_first(),
            'para1': response.xpath("//span[@class='styles__StyledHtmlWrapper-s1eywhsl-0 cJBjEA sc-jAaTju jlDUtI']/p[1]/descendant::text()").extract_first(),
            'para2': response.xpath("//span[@class='styles__StyledHtmlWrapper-s1eywhsl-0 cJBjEA sc-jAaTju jlDUtI']/p[2]/descendant::text()").extract_first(),
            'para3': response.xpath("//span[@class='styles__StyledHtmlWrapper-s1eywhsl-0 cJBjEA sc-jAaTju jlDUtI']/p[3]/descendant::text()").extract_first(),
            'flyer': response.xpath("//img[contains(@class,'styles__Artwork')]/@src").extract_first(),
            'banner_image_1': response.xpath("//div[@class='styles__PhotoWrapper-s1brd5dy-2 cpnBtx'][1]/div[@class='styles__PhotoInnerWrapper-s1brd5dy-3 gVsbNY']/img[@class='styles__PhotoImage-s1brd5dy-4 cqQHmb']/@src").extract_first(),
            'banner_image_2': response.xpath("//div[@class='styles__PhotoWrapper-s1brd5dy-2 cpnBtx'][2]/div[@class='styles__PhotoInnerWrapper-s1brd5dy-3 gVsbNY']/img[@class='styles__PhotoImage-s1brd5dy-4 cqQHmb']/@src").extract_first(),
            'banner_image_3': response.xpath("//div[@class='styles__PhotoWrapper-s1brd5dy-2 cpnBtx'][3]/div[@class='styles__PhotoInnerWrapper-s1brd5dy-3 gVsbNY']/img[@class='styles__PhotoImage-s1brd5dy-4 cqQHmb']/@src").extract_first(),
        }

关键修改说明

我在parse方法里添加了几行核心处理代码：

# 关键处理：移除/shop/#ticket后缀
if '/shop/#ticket' in url:
    url = url.split('/shop/#ticket')[0]

这段代码会检查每个拼接后的URL，一旦发现包含/shop/#ticket，就会将这部分及后续内容截断，只保留前面的有效URL部分，这样就能正常请求到对应的详情页了。另外还把你原来的字符串拼接改成了f-string，让代码更简洁易读～

内容的提问来源于stack exchange，提问作者Matt Helden

火山引擎最新活动

方舟 Coding Plan

HOT

模型自由，工具不限，最新支持 DeepSeek-V4 系列与 GLM-5.1，受邀下单叠加9.5折

查看详情

ArkClaw

7×24在线专属智能伙伴

查看详情

Seedance 2.0 全面开放 API

创作无限可能，一键生成电影级 AI 视频

查看详情

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠

查看详情

方舟 Agent Plan