Python爬虫脚本仅提取单个产品数据,循环异常求助排查
Python爬虫仅写入单条数据的问题排查与修复
我编写的Python爬虫脚本仅能读取并写入单个产品名称与价格,无法处理全部条目。检查发现
len(containers)返回20,说明标签选择正确,但使用p.text和strong.text提取内容并写入文件时仅得到一组数据,怀疑循环语句或选择器存在问题,附上脚本代码:
from urllib.request import urlopen as Ureq from bs4 import BeautifulSoup as soup my_url = 'https://laptopparts.ca/collections/types?q=Accessories' Uclient = Ureq(my_url) page_html = Uclient.read() Uclient.close() page_soup = soup(page_html, "html.parser") containers = page_soup.findAll("div",{"class":"grid__item large--one-quarter medium-down--one-half"}) filename = "products.csv" f = open(filename, "w") headers = "Title, Price\n" f.write(headers) for container in containers: title_container = container.findAll("p") product_name = title_container[0].text price_container = container.findAll("strong") price = price_container[0].text print("product_name " + product_name) print("price " + price) f.write(product_name.replace(",","|") + "," + price + "\n") f.close()
问题根源
一眼就看到问题所在了——你把f.close()放在了for循环的内部!这意味着第一次循环处理完第一条产品后,文件就被关闭了,后面的19条数据根本没机会写入文件。虽然len(containers)是20,但循环执行一次后文件就关闭了,自然只得到一组数据。
修复后的代码(含优化)
我不仅修复了核心问题,还优化了几个细节,让代码更健壮:
from urllib.request import urlopen as Ureq from bs4 import BeautifulSoup as soup my_url = 'https://laptopparts.ca/collections/types?q=Accessories' Uclient = Ureq(my_url) page_html = Uclient.read() Uclient.close() page_soup = soup(page_html, "html.parser") containers = page_soup.findAll("div",{"class":"grid__item large--one-quarter medium-down--one-half"}) filename = "products.csv" # 使用with语句自动管理文件生命周期,无需手动close with open(filename, "w") as f: headers = "Title, Price\n" f.write(headers) for container in containers: # 提取产品名称并清理多余空白字符 title_container = container.findAll("p") product_name = title_container[0].text.strip().replace(",", "|") if title_container else "No Title" # 提取价格并清理,同时增加异常保护 price_container = container.findAll("strong") price = price_container[0].text.strip() if price_container else "No Price" print(f"product_name: {product_name}") print(f"price: {price}") f.write(f"{product_name},{price}\n")
关键改进点
- 移走文件关闭操作:改用
with语句,它会在代码块结束后自动关闭文件,彻底避免了循环内关闭文件的错误 - 增加空值判断:防止某些产品没有
p或strong标签时出现索引越界错误 - 文本清理:用
strip()去除文本中的换行、空格等冗余内容,让CSV文件更整洁 - 格式化字符串:用f-string让代码更易读
这样修改后,你的爬虫就能正确写入全部20条产品数据了!
内容的提问来源于stack exchange,提问作者Abdullah Virk




