使用Python批量爬取宝莱坞电影IMDb ID时出现AttributeError(NoneType无a属性)的问题求助
Hey there, let's fix that annoying AttributeError you're hitting when scraping multiple pages of Bollywood movies!
错误原因分析
The error AttributeError: 'NoneType' object has no attribute 'a' occurs because some movie entries on IMDb don't have the rating-cancel span element your code is trying to access. When you scrape a single page, you might get lucky and all entries have this element—but once you scale to multiple pages, you'll inevitably hit entries where store.find('span','rating-cancel') returns None, and trying to call .a on None throws the error.
修复方案与优化代码
We'll fix the IMDb ID extraction logic first, then add proper multi-page scraping functionality and improve overall code robustness:
# 导入所需库 import pandas as pd import requests from bs4 import BeautifulSoup from time import sleep from random import randint # 声明空列表用于存储数据 movie_name = [] year = [] time=[] rating=[] votes = [] description = [] director_s = [] starList= [] imdb_id = [] # 设置请求头,模拟浏览器访问以避开IMDb反爬拦截 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36' } # 批量爬取多页(示例爬取前3页,可自行修改range上限) for page_num in range(1, 4): start = (page_num - 1) * 50 + 1 # IMDb每页固定显示50条数据,分页参数按1、51、101递增 url = f"https://www.imdb.com/search/title/?title_type=feature&primary_language=hi&sort=num_votes,desc&start={start}&ref_=adv_nxt" # 随机延迟2-5秒,避免请求频率过高被封IP sleep(randint(2, 5)) page = requests.get(url, headers=headers) soup = BeautifulSoup(page.text, 'html.parser') movie_data = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'}) for store in movie_data: # 电影标题 name = store.h3.a.text movie_name.append(name) # 上映年份 year_of_release = store.h3.find('span', class_="lister-item-year text-muted unbold").text year.append(year_of_release) # 时长(空值处理) runtime = store.p.find("span", class_='runtime').text if store.p.find("span", class_='runtime') else "N/A" time.append(runtime) # IMDb评分(空值处理) rate = store.find('div', class_="inline-block ratings-imdb-rating").text.replace('\n', '') if store.find('div', class_="inline-block ratings-imdb-rating") else "N/A" rating.append(rate) # 投票数(空值处理) value = store.find_all('span', attrs={'name': "nv"}) vote = value[0].text if value else "N/A" votes.append(vote) # 电影描述(空值处理) describe = store.find_all('p', class_='text-muted') description_ = describe[1].text.replace('\n', '') if len(describe) > 1 else 'N/A' description.append(description_) # 导演信息(空值处理) director = 'N/A' ps = store.find_all('p') for p in ps: if 'Director' in p.text: director = p.find('a').text break # 找到第一个导演后停止遍历 director_s.append(director) # 获取IMDb ID(核心修复:空值判断) rating_cancel_elem = store.find('span', 'rating-cancel') imdbID = rating_cancel_elem.a['href'].split('/')[2] if rating_cancel_elem else 'N/A' imdb_id.append(imdbID) # 演员信息(空值处理) star_elem = store.find("p", attrs={"class":""}) star = star_elem.text.replace("Stars:", "").replace("\n", "").replace("Director:", "").strip() if star_elem else 'N/A' starList.append(star) # 将数据转为DataFrame便于后续处理 movie_df = pd.DataFrame({ 'Movie Name': movie_name, 'Year': year, 'Runtime': time, 'IMDb Rating': rating, 'Votes': votes, 'Description': description, 'Director': director_s, 'IMDb ID': imdb_id, 'Stars': starList }) print(movie_df.head())
额外优化说明
- 反爬规避:添加
User-Agent请求头模拟浏览器,配合随机延迟降低被IMDb封IP的风险。 - 全局空值处理:所有字段都增加了空值判断,用
N/A替代空字符串,保证数据规整性。 - 分页逻辑:按照IMDb的分页规则自动计算
start参数,支持批量爬取任意页数。
内容的提问来源于stack exchange,提问作者Mustafa Anandwala




