Scrapy导出CSV格式异常问题求助及学习资源咨询
Hey there, let's work through your Scrapy CSV export issues together—I see two main problems to fix, and I'll walk you through each step with code adjustments.
Problem 1: Only the Part field has data, while Quantity and Price are empty
Root Causes
- Incorrect loop structure: Your code targets
<tbody#lnkPart>inside the product table, but in most cases, a table only has one<tbody>container, and each product is represented by a<tr>(table row) inside that<tbody>. Your current loop isn't actually iterating over individual product rows. - Misaligned CSS selectors: The selectors for
QuantityandPricemight not match the actual page structure (e.g., thespan.desktopclass might not exist, or the text is directly inside the<td>instead of a child span). - Using
extract()instead ofget():extract()returns a list of matches, which can lead to empty lists showing as blank in CSV—usingget()gives you a single string (or a default value if no match is found).
Fixed Spider Code
name = 'digi' allowed_domains = ['digikey.com'] start_urls = ['https://www.digikey.com/products/en/integrated-circuits-ics/memory/774?FV=-1%7C428%2C-8%7C774%2C7%7C1&quantity=0&ColumnSort=0&page=1&k=cy621&pageSize=500&pkeyword=cy621'] def parse(self, response): # Target all product rows directly in the table's tbody product_rows = response.css('table#productTable.productTable tbody tr') for row in product_rows: # Use get() with default to ensure consistent field values yield { 'Part': row.css('td.tr-mfgPartNumber span::text').get(default=''), # Adjust selector to target td text directly if span.desktop doesn't exist 'Quantity': row.css('td.tr-minQty.ptable-param::text').get(default='').strip(), 'Price': row.css('td.tr-unitPrice.ptable-param span::text').get(default='') }
Note: If the Quantity selector still doesn't work, try inspecting the page with browser dev tools to confirm the exact CSS path for the minimum quantity text.
Problem 2: Column headers repeat on every row
Root Causes
- Mismatched field names in config: Your
FEED_EXPORT_FIELDSwas commented out, and the field names you had listed (lowercaseparts,quantity,price) didn't match the camelCase keys in youryieldstatement (Part,Quantity,Price). Scrapy can't lock in a consistent header when field names are inconsistent. - Custom settings escape issue: The escaped double quotes in your spider's
custom_settingscould cause unexpected behavior—better to set the user agent in the main config file.
Fixed Config File
BOT_NAME = 'website1' SPIDER_MODULES = ['website1.spiders'] NEWSPIDER_MODULE = 'website1.spiders' # Export as CSV Feed (match field names exactly to your yield keys) FEED_EXPORT_FIELDS = ['Part', 'Quantity', 'Price'] FEED_FORMAT = 'csv' FEED_URI = 'parts.csv' # Crawl responsibly with a valid user agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = True
Additional Tips for Smooth Scraping
- Debug selectors with Scrapy Shell: Run
scrapy shell "your-target-url"to test CSS selectors in real time—this helps you validate if your selectors are actually matching elements. - Ensure consistent fields: Always include all three fields in every
yieldstatement (even with empty values), so Scrapy knows to maintain a single header row. - Learning Resources:
- Scrapy官方文档的Feed导出章节:详细讲解CSV导出的配置项和最佳实践
- Scrapy选择器指南:学习如何精准定位页面元素的CSS和XPath语法
内容的提问来源于stack exchange,提问作者Jtenc




