You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Scrapy导出CSV格式异常问题求助及学习资源咨询

Hey there, let's work through your Scrapy CSV export issues together—I see two main problems to fix, and I'll walk you through each step with code adjustments.

Problem 1: Only the Part field has data, while Quantity and Price are empty

Root Causes

  1. Incorrect loop structure: Your code targets <tbody#lnkPart> inside the product table, but in most cases, a table only has one <tbody> container, and each product is represented by a <tr> (table row) inside that <tbody>. Your current loop isn't actually iterating over individual product rows.
  2. Misaligned CSS selectors: The selectors for Quantity and Price might not match the actual page structure (e.g., the span.desktop class might not exist, or the text is directly inside the <td> instead of a child span).
  3. Using extract() instead of get(): extract() returns a list of matches, which can lead to empty lists showing as blank in CSV—using get() gives you a single string (or a default value if no match is found).

Fixed Spider Code

name = 'digi'
allowed_domains = ['digikey.com']
start_urls = ['https://www.digikey.com/products/en/integrated-circuits-ics/memory/774?FV=-1%7C428%2C-8%7C774%2C7%7C1&quantity=0&ColumnSort=0&page=1&k=cy621&pageSize=500&pkeyword=cy621']

def parse(self, response):
    # Target all product rows directly in the table's tbody
    product_rows = response.css('table#productTable.productTable tbody tr')
    for row in product_rows:
        # Use get() with default to ensure consistent field values
        yield {
            'Part': row.css('td.tr-mfgPartNumber span::text').get(default=''),
            # Adjust selector to target td text directly if span.desktop doesn't exist
            'Quantity': row.css('td.tr-minQty.ptable-param::text').get(default='').strip(),
            'Price': row.css('td.tr-unitPrice.ptable-param span::text').get(default='')
        }

Note: If the Quantity selector still doesn't work, try inspecting the page with browser dev tools to confirm the exact CSS path for the minimum quantity text.

Problem 2: Column headers repeat on every row

Root Causes

  1. Mismatched field names in config: Your FEED_EXPORT_FIELDS was commented out, and the field names you had listed (lowercase parts, quantity, price) didn't match the camelCase keys in your yield statement (Part, Quantity, Price). Scrapy can't lock in a consistent header when field names are inconsistent.
  2. Custom settings escape issue: The escaped double quotes in your spider's custom_settings could cause unexpected behavior—better to set the user agent in the main config file.

Fixed Config File

BOT_NAME = 'website1'
SPIDER_MODULES = ['website1.spiders']
NEWSPIDER_MODULE = 'website1.spiders'

# Export as CSV Feed (match field names exactly to your yield keys)
FEED_EXPORT_FIELDS = ['Part', 'Quantity', 'Price']
FEED_FORMAT = 'csv'
FEED_URI = 'parts.csv'

# Crawl responsibly with a valid user agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

Additional Tips for Smooth Scraping

  • Debug selectors with Scrapy Shell: Run scrapy shell "your-target-url" to test CSS selectors in real time—this helps you validate if your selectors are actually matching elements.
  • Ensure consistent fields: Always include all three fields in every yield statement (even with empty values), so Scrapy knows to maintain a single header row.
  • Learning Resources:
    • Scrapy官方文档的Feed导出章节:详细讲解CSV导出的配置项和最佳实践
    • Scrapy选择器指南:学习如何精准定位页面元素的CSS和XPath语法

内容的提问来源于stack exchange,提问作者Jtenc

火山引擎 最新活动