使用BeautifulSoup爬取Indeed多页职位完整描述时出现AttributeError问题求助
Fixing the AttributeError & Empty Job Descriptions in Your Indeed Scraper
Hey Alina, let's break down what's causing your issues and fix them step by step.
Why You're Seeing the AttributeError
The error pops up because job_posts.find(name="a", class_="jcs-JobTitle") returns None for some job postings. This can happen for a few common reasons:
- Dynamic page updates: Indeed frequently tweaks its HTML class names—your
jcs-JobTitleselector might not work for all listings (like sponsored posts). - Anti-scraping blocks: If you're sending requests too quickly, Indeed might return incomplete or altered HTML, so your selector can't find the element.
- Inconsistent listing layouts: Some job cards have a different structure that doesn't use that specific anchor tag.
Why the Try-Except Led to Empty Descriptions
When you caught the error and set job_link = "", you still made a request to https://uk.indeed.com (the homepage), not a job detail page. That page doesn't have the jobsearch-jobDescriptionText div, so your code returned empty descriptions. You should only fetch the detail page if you actually find a valid job link.
Fixed & Optimized Code
Here's a revised version of your code with robust error handling, anti-scrape safeguards, and more reliable selectors:
import requests from bs4 import BeautifulSoup import pandas as pd import time # Reusable headers (critical for avoiding anti-scraping blocks) HEADERS = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36" } def extract(page): url = f"https://uk.indeed.com/jobs?q=data+analyst+%C2%A330%2C000&l=London%2C+Greater+London&jt=fulltime&start={page}" try: r = requests.get(url, headers=HEADERS) r.raise_for_status() # Catch non-200 status codes early soup = BeautifulSoup(r.text, "html.parser") return soup except requests.exceptions.RequestException as e: print(f"Failed to fetch page {page}: {str(e)}") return None def transform(soup): if not soup: return job_postings = soup.find_all(name="div", class_="slider_item") for job_posts in job_postings: # Safely extract basic job info with None checks job_title_elem = job_posts.select_one("a span[title]") job_title = job_title_elem.text.strip() if job_title_elem else "Unknown Job Title" company_name_elem = job_posts.find(name="span", class_="companyName") company_name = company_name_elem.text.strip() if company_name_elem else "Unknown Company" salary = "n/a" salary_snippet = job_posts.find(name="div", class_="salary-snippet") if salary_snippet: salary_span = salary_snippet.find("span") if salary_span: salary = salary_span.getText().strip() summary_text_elem = job_posts.find(name="div", class_="job-snippet") summary_text = summary_text_elem.text.replace("\n", "").strip() if summary_text_elem else "No summary available" # Use a more stable selector for job links (targets unique data-jk attribute) job_a = job_posts.select_one("a[data-jk]") full_description = [] if job_a: job_link = job_a.get("href") absolute_link = 'https://uk.indeed.com' + job_link # Add delay to avoid triggering anti-scraping measures time.sleep(1.5) try: job_desc_r = requests.get(absolute_link, headers=HEADERS) job_desc_r.raise_for_status() job_desc_soup = BeautifulSoup(job_desc_r.text, "html.parser") desc_div = job_desc_soup.find(name="div", class_="jobsearch-jobDescriptionText") if desc_div: # Extract clean list items, skip empty entries full_description = [ item.text.strip() for item in desc_div.find_all("li") if item.text.strip() ] # Fallback if no list items exist if not full_description: full_description = [desc_div.get_text(strip=True)] else: full_description = ["No detailed job description found"] except requests.exceptions.RequestException as e: print(f"Failed to fetch job details: {str(e)}") full_description = ["Error loading job description"] else: full_description = ["No valid job link found"] # Append the cleaned job entry to the list job_list.append({ 'Job Title': job_title, 'Company': company_name, 'Salary': salary, 'Summary': summary_text, 'Full Descriptions': full_description }) job_list = [] # Loop across pages (adjust range as needed) for page_num in range(0, 40, 10): print(f"Scraping page {page_num//10 + 1}...") extract_output = extract(page_num) transform(extract_output) # Convert to DataFrame and save (optional) df = pd.DataFrame(job_list) print(df.head()) # df.to_csv("indeed_data_analyst_jobs.csv", index=False)
Key Improvements
- Reusable Headers: Ensures consistent browser identification across all requests, reducing anti-scrape flags.
- Status Code Checking: Uses
raise_for_status()to catch failed requests early instead of processing invalid HTML. - Stable Selectors: Targets
a[data-jk]for job links (this attribute is far more reliable than class names for Indeed listings). - Conditional Detail Requests: Only fetches the job detail page if a valid link is found, eliminating empty descriptions from bad URLs.
- Request Delays: Adds a 1.5-second delay between detail page requests to avoid overwhelming Indeed's servers.
- Fallback Content: Provides meaningful messages when data is missing instead of empty strings/lists.
- Whitespace Cleaning: Strips unnecessary whitespace from text fields for cleaner, more usable data.
Additional Tips
- Rotate User-Agents: If you still get blocked, try rotating between different user-agent strings to mimic different browsers.
- Use Proxies: For large-scale scraping, consider using proxies to avoid IP bans.
- Inspect Page Source: Periodically check Indeed's HTML structure to update selectors if they change.
内容的提问来源于stack exchange,提问作者AlinaAZ




