基于BeautifulSoup4的爬虫开发：如何从产品标题类（prod

基于BeautifulSoup4的爬虫开发：如何从产品标题类（prod_name）中提取href属性？

阿华AIGC实验室

2026-4-30

解决BS4爬虫提取商品链接的问题

嘿，作为刚接触爬虫的新手，你遇到的这个问题其实是BS4使用中很常见的小坑，咱们一步步来梳理和修正：

先说说你代码里的核心问题

链式调用逻辑错误：soup.find_all(...)返回的是一个ResultSet（类似标签列表），你不能直接在它后面链式调用find_all("a")，得逐个遍历集合里的元素才行。
标签选择方向不对：商品链接基本都是直接嵌套在带prod_name类的<a>标签里的，没必要先搜一堆标签再去挖<a>，直接定位带目标类的<a>标签更高效。
get('href')使用错误：这个方法只能作用在单个Tag对象上，不能直接用在标签集合上，得先拿到单个标签再调用它。
拼写小失误：你的webshop_dict里pcshop的prod_name写成了prdocutname，应该是productname，这会导致完全找不到对应元素。

修正后的完整代码

from bs4 import BeautifulSoup
import requests

item = input("Insert the name of the item you are searching for: ")
webshop_dict = [
    {'url': 'https://h2-shop.com/filterSearch?advs=true&cid=0&mid=0&vid=0&q=', 'prod_name': 'product-title'},
    {'url': 'https://www.instar-informatika.hr/search.asp?upit=', 'prod_name': 'name'},
    {'url': 'https://www.links.hr/hr/search?q=', 'prod_name': 'product-title'},
    {'url': 'https://www.mall.hr/trazenje?s=', 'prod_name': 'product-box-category__title'},
    {'url': 'https://mi.hr/filterSearch?q=', 'prod_name': 'product-title'},
    {'url': 'https://www.pcshop.hr/search.asp?upit=', 'prod_name': 'productname'}  # 修正拼写错误
]

headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}

# 可以用字典存储每个网站的结果，方便后续处理
shop_links = {}

for shop in webshop_dict:
    shop_domain = shop['url'].split('/')[2]
    print(f"\n=== 正在抓取 {shop_domain} 的商品链接 ===")
    try:
        page = requests.get(url=shop['url'] + item, headers=headers, timeout=10)
        page.raise_for_status()  # 捕获HTTP请求错误（比如404、503）
        soup = BeautifulSoup(page.content, 'lxml')
        
        # 直接定位带目标类的a标签，精准高效
        product_a_tags = soup.find_all('a', class_=shop['prod_name'])
        
        shop_links[shop_domain] = []
        for tag in product_a_tags:
            href = tag.get('href')
            # 处理相对链接，自动补全域名
            if href:
                if not href.startswith('http'):
                    href = f"https://{shop_domain}{href}"
                shop_links[shop_domain].append(href)
                print(href)
                
    except Exception as e:
        print(f"访问 {shop_domain} 时出错: {str(e)}")

# 最后可以打印整理好的结果字典
print("\n=== 所有网站的商品链接汇总 ===")
print(shop_links)