使用Python批量爬取宝莱坞电影IMDb ID时出现AttributeError（NoneType无a属性）的问题求助

阿华AIGC实验室

2026-4-28

解决批量爬取宝莱坞电影IMDb ID时的NoneType AttributeError问题

Hey there, let's fix that annoying AttributeError you're hitting when scraping multiple pages of Bollywood movies!

错误原因分析

The error AttributeError: 'NoneType' object has no attribute 'a' occurs because some movie entries on IMDb don't have the rating-cancel span element your code is trying to access. When you scrape a single page, you might get lucky and all entries have this element—but once you scale to multiple pages, you'll inevitably hit entries where store.find('span','rating-cancel') returns None, and trying to call .a on None throws the error.

修复方案与优化代码

We'll fix the IMDb ID extraction logic first, then add proper multi-page scraping functionality and improve overall code robustness:

# 导入所需库
import pandas as pd
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint

# 声明空列表用于存储数据
movie_name = []
year = []
time=[]
rating=[]
votes = []
description = []
director_s = []
starList= []
imdb_id = []

# 设置请求头，模拟浏览器访问以避开IMDb反爬拦截
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

# 批量爬取多页（示例爬取前3页，可自行修改range上限）
for page_num in range(1, 4):
    start = (page_num - 1) * 50 + 1  # IMDb每页固定显示50条数据，分页参数按1、51、101递增
    url = f"https://www.imdb.com/search/title/?title_type=feature&primary_language=hi&sort=num_votes,desc&start={start}&ref_=adv_nxt"
    
    # 随机延迟2-5秒，避免请求频率过高被封IP
    sleep(randint(2, 5))
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    movie_data = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'})
    
    for store in movie_data:
        # 电影标题
        name = store.h3.a.text
        movie_name.append(name)
        
        # 上映年份
        year_of_release = store.h3.find('span', class_="lister-item-year text-muted unbold").text
        year.append(year_of_release)
        
        # 时长（空值处理）
        runtime = store.p.find("span", class_='runtime').text if store.p.find("span", class_='runtime') else "N/A"
        time.append(runtime)
        
        # IMDb评分（空值处理）
        rate = store.find('div', class_="inline-block ratings-imdb-rating").text.replace('\n', '') if store.find('div', class_="inline-block ratings-imdb-rating") else "N/A"
        rating.append(rate)
        
        # 投票数（空值处理）
        value = store.find_all('span', attrs={'name': "nv"})
        vote = value[0].text if value else "N/A"
        votes.append(vote)
        
        # 电影描述（空值处理）
        describe = store.find_all('p', class_='text-muted')
        description_ = describe[1].text.replace('\n', '') if len(describe) > 1 else 'N/A'
        description.append(description_)
        
        # 导演信息（空值处理）
        director = 'N/A'
        ps = store.find_all('p')
        for p in ps:
            if 'Director' in p.text:
                director = p.find('a').text
                break  # 找到第一个导演后停止遍历
        director_s.append(director)
        
        # 获取IMDb ID（核心修复：空值判断）
        rating_cancel_elem = store.find('span', 'rating-cancel')
        imdbID = rating_cancel_elem.a['href'].split('/')[2] if rating_cancel_elem else 'N/A'
        imdb_id.append(imdbID)
        
        # 演员信息（空值处理）
        star_elem = store.find("p", attrs={"class":""})
        star = star_elem.text.replace("Stars:", "").replace("\n", "").replace("Director:", "").strip() if star_elem else 'N/A'
        starList.append(star)

# 将数据转为DataFrame便于后续处理
movie_df = pd.DataFrame({
    'Movie Name': movie_name,
    'Year': year,
    'Runtime': time,
    'IMDb Rating': rating,
    'Votes': votes,
    'Description': description,
    'Director': director_s,
    'IMDb ID': imdb_id,
    'Stars': starList
})

print(movie_df.head())