You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

使用Python批量爬取宝莱坞电影IMDb ID时出现AttributeError(NoneType无a属性)的问题求助

解决批量爬取宝莱坞电影IMDb ID时的NoneType AttributeError问题

Hey there, let's fix that annoying AttributeError you're hitting when scraping multiple pages of Bollywood movies!

错误原因分析

The error AttributeError: 'NoneType' object has no attribute 'a' occurs because some movie entries on IMDb don't have the rating-cancel span element your code is trying to access. When you scrape a single page, you might get lucky and all entries have this element—but once you scale to multiple pages, you'll inevitably hit entries where store.find('span','rating-cancel') returns None, and trying to call .a on None throws the error.

修复方案与优化代码

We'll fix the IMDb ID extraction logic first, then add proper multi-page scraping functionality and improve overall code robustness:

# 导入所需库
import pandas as pd
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint

# 声明空列表用于存储数据
movie_name = []
year = []
time=[]
rating=[]
votes = []
description = []
director_s = []
starList= []
imdb_id = []

# 设置请求头,模拟浏览器访问以避开IMDb反爬拦截
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

# 批量爬取多页(示例爬取前3页,可自行修改range上限)
for page_num in range(1, 4):
    start = (page_num - 1) * 50 + 1  # IMDb每页固定显示50条数据,分页参数按1、51、101递增
    url = f"https://www.imdb.com/search/title/?title_type=feature&primary_language=hi&sort=num_votes,desc&start={start}&ref_=adv_nxt"
    
    # 随机延迟2-5秒,避免请求频率过高被封IP
    sleep(randint(2, 5))
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    movie_data = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'})
    
    for store in movie_data:
        # 电影标题
        name = store.h3.a.text
        movie_name.append(name)
        
        # 上映年份
        year_of_release = store.h3.find('span', class_="lister-item-year text-muted unbold").text
        year.append(year_of_release)
        
        # 时长(空值处理)
        runtime = store.p.find("span", class_='runtime').text if store.p.find("span", class_='runtime') else "N/A"
        time.append(runtime)
        
        # IMDb评分(空值处理)
        rate = store.find('div', class_="inline-block ratings-imdb-rating").text.replace('\n', '') if store.find('div', class_="inline-block ratings-imdb-rating") else "N/A"
        rating.append(rate)
        
        # 投票数(空值处理)
        value = store.find_all('span', attrs={'name': "nv"})
        vote = value[0].text if value else "N/A"
        votes.append(vote)
        
        # 电影描述(空值处理)
        describe = store.find_all('p', class_='text-muted')
        description_ = describe[1].text.replace('\n', '') if len(describe) > 1 else 'N/A'
        description.append(description_)
        
        # 导演信息(空值处理)
        director = 'N/A'
        ps = store.find_all('p')
        for p in ps:
            if 'Director' in p.text:
                director = p.find('a').text
                break  # 找到第一个导演后停止遍历
        director_s.append(director)
        
        # 获取IMDb ID(核心修复:空值判断)
        rating_cancel_elem = store.find('span', 'rating-cancel')
        imdbID = rating_cancel_elem.a['href'].split('/')[2] if rating_cancel_elem else 'N/A'
        imdb_id.append(imdbID)
        
        # 演员信息(空值处理)
        star_elem = store.find("p", attrs={"class":""})
        star = star_elem.text.replace("Stars:", "").replace("\n", "").replace("Director:", "").strip() if star_elem else 'N/A'
        starList.append(star)

# 将数据转为DataFrame便于后续处理
movie_df = pd.DataFrame({
    'Movie Name': movie_name,
    'Year': year,
    'Runtime': time,
    'IMDb Rating': rating,
    'Votes': votes,
    'Description': description,
    'Director': director_s,
    'IMDb ID': imdb_id,
    'Stars': starList
})

print(movie_df.head())

额外优化说明

  • 反爬规避:添加User-Agent请求头模拟浏览器,配合随机延迟降低被IMDb封IP的风险。
  • 全局空值处理:所有字段都增加了空值判断,用N/A替代空字符串,保证数据规整性。
  • 分页逻辑:按照IMDb的分页规则自动计算start参数,支持批量爬取任意页数。

内容的提问来源于stack exchange,提问作者Mustafa Anandwala

火山引擎 最新活动