Scrapy图片管道:如何将图片重命名为抓取的title?
如何用Scrapy抓取的title字段重命名保存图片?
我来帮你搞定这个需求!要让Scrapy用抓取到的title作为图片文件名保存,咱们需要对现有代码做几处关键修改,核心是自定义图片管道来处理重命名逻辑,具体步骤如下:
1. 完善Items定义(items.py)
首先得把title字段加到Item类里,这样才能把标题数据传递到后续的管道中:
import scrapy class ImagetofilesystemcheckItem(scrapy.Item): image_urls = scrapy.Field() images = scrapy.Field() title = scrapy.Field() # 添加字段存储标题 url = scrapy.Field() # 按需保留url字段
2. 调整Spider代码(spider1.py)
原来的代码直接返回字典,咱们要改成返回Item实例,这样管道才能正确读取到title字段;另外给title加个strip()去掉多余空格,避免文件名里有无效字符:
from imageToFileSystemCheck.items import ImagetofilesystemcheckItem import scrapy class TestSpider(scrapy.Spider): name = 'imagecheck' def start_requests(self): searchterms=['keyword1','keyword2',] for item in searchterms: yield scrapy.Request('http://www.example.com/s?=%s' % item,callback=self.parse, meta={'item': item}) def parse(self,response): start_urls=[] item = response.meta.get('item') for i in range(0,2): link=str(response.css("div.tt a.chek::attr(href)")[i].extract()) start_urls.append(link) for url in start_urls: print(url) yield scrapy.Request(url=url, callback=self.parse_info ,meta={'item': item}) def parse_info(self, response): url=response.url # 清理title的多余空格 title=str(response.xpath('//*[@id="Title"]/text()').extract_first()).strip() img_url_1=response.xpath("//img[@id='images']/@src").extract_first() # 创建Item实例并赋值 item = ImagetofilesystemcheckItem() item['url'] = url item['title'] = title item['image_urls'] = [img_url_1] yield item
3. 自定义图片管道(pipelines.py)
这是最关键的一步!咱们继承Scrapy自带的ImagesPipeline,重写图片保存路径的逻辑,用title作为文件名:
from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem from scrapy.http import Request class CustomImagePipeline(ImagesPipeline): def get_media_requests(self, item, info): # 把title传递到请求的meta中,方便后续重命名 for image_url in item['image_urls']: yield Request(image_url, meta={'title': item['title']}) def file_path(self, request, response=None, info=None, *, item=None): # 处理title里的非法字符(比如/、\、:等),避免保存失败 title = request.meta['title'] safe_title = "".join([c for c in title if c not in r'\/:*?"<>|']) # 返回最终的文件名格式:safe_title.jpg return f"{safe_title}.jpg" def item_completed(self, results, item, info): # 检查图片是否下载成功,失败则丢弃该Item image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no valid images") item['images'] = image_paths return item
4. 更新Settings配置(settings.py)
把默认的图片管道换成咱们自定义的:
BOT_NAME = 'imageToFileSystemCheck' SPIDER_MODULES = ['imageToFileSystemCheck.spiders'] NEWSPIDER_MODULE = 'imageToFileSystemCheck.spiders' # 替换为自定义管道 ITEM_PIPELINES = {'imageToFileSystemCheck.pipelines.CustomImagePipeline': 1} IMAGES_STORE = '/home/imageToFileSystemCheck/images/' ROBOTSTXT_OBEY = True
额外注意点:
- 确保
title不为空:可以在spider里加判断,如果title为空就跳过该Item或者给个默认名称(比如default_image.jpg)。 - 权限问题:要保证
IMAGES_STORE路径存在且有写入权限,不然Scrapy会报错。 - 特殊字符处理:上面的代码已经做了基础的非法字符过滤,你可以根据需求调整过滤规则。
内容的提问来源于stack exchange,提问作者Sagar Singh Verma




