使用BeautifulSoup爬取Indeed多页职位完整描述时出现AttributeError问题求助

阿华AIGC实验室

2026-4-28

Fixing the AttributeError & Empty Job Descriptions in Your Indeed Scraper

Hey Alina, let's break down what's causing your issues and fix them step by step.

Why You're Seeing the `AttributeError`

The error pops up because job_posts.find(name="a", class_="jcs-JobTitle") returns None for some job postings. This can happen for a few common reasons:

Dynamic page updates: Indeed frequently tweaks its HTML class names—your jcs-JobTitle selector might not work for all listings (like sponsored posts).
Anti-scraping blocks: If you're sending requests too quickly, Indeed might return incomplete or altered HTML, so your selector can't find the element.
Inconsistent listing layouts: Some job cards have a different structure that doesn't use that specific anchor tag.

Why the Try-Except Led to Empty Descriptions

When you caught the error and set job_link = "", you still made a request to https://uk.indeed.com (the homepage), not a job detail page. That page doesn't have the jobsearch-jobDescriptionText div, so your code returned empty descriptions. You should only fetch the detail page if you actually find a valid job link.

Fixed & Optimized Code

Here's a revised version of your code with robust error handling, anti-scrape safeguards, and more reliable selectors:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Reusable headers (critical for avoiding anti-scraping blocks)
HEADERS = { 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}

def extract(page):
    url = f"https://uk.indeed.com/jobs?q=data+analyst+%C2%A330%2C000&l=London%2C+Greater+London&jt=fulltime&start={page}"
    try:
        r = requests.get(url, headers=HEADERS)
        r.raise_for_status()  # Catch non-200 status codes early
        soup = BeautifulSoup(r.text, "html.parser")
        return soup
    except requests.exceptions.RequestException as e:
        print(f"Failed to fetch page {page}: {str(e)}")
        return None

def transform(soup):
    if not soup:
        return
    
    job_postings = soup.find_all(name="div", class_="slider_item")
    for job_posts in job_postings:
        # Safely extract basic job info with None checks
        job_title_elem = job_posts.select_one("a span[title]")
        job_title = job_title_elem.text.strip() if job_title_elem else "Unknown Job Title"
        
        company_name_elem = job_posts.find(name="span", class_="companyName")
        company_name = company_name_elem.text.strip() if company_name_elem else "Unknown Company"
        
        salary = "n/a"
        salary_snippet = job_posts.find(name="div", class_="salary-snippet")
        if salary_snippet:
            salary_span = salary_snippet.find("span")
            if salary_span:
                salary = salary_span.getText().strip()
        
        summary_text_elem = job_posts.find(name="div", class_="job-snippet")
        summary_text = summary_text_elem.text.replace("\n", "").strip() if summary_text_elem else "No summary available"
        
        # Use a more stable selector for job links (targets unique data-jk attribute)
        job_a = job_posts.select_one("a[data-jk]")
        full_description = []
        
        if job_a:
            job_link = job_a.get("href")
            absolute_link = 'https://uk.indeed.com' + job_link
            
            # Add delay to avoid triggering anti-scraping measures
            time.sleep(1.5)
            
            try:
                job_desc_r = requests.get(absolute_link, headers=HEADERS)
                job_desc_r.raise_for_status()
                job_desc_soup = BeautifulSoup(job_desc_r.text, "html.parser")
                
                desc_div = job_desc_soup.find(name="div", class_="jobsearch-jobDescriptionText")
                if desc_div:
                    # Extract clean list items, skip empty entries
                    full_description = [
                        item.text.strip() 
                        for item in desc_div.find_all("li") 
                        if item.text.strip()
                    ]
                    # Fallback if no list items exist
                    if not full_description:
                        full_description = [desc_div.get_text(strip=True)]
                else:
                    full_description = ["No detailed job description found"]
                    
            except requests.exceptions.RequestException as e:
                print(f"Failed to fetch job details: {str(e)}")
                full_description = ["Error loading job description"]
        else:
            full_description = ["No valid job link found"]
        
        # Append the cleaned job entry to the list
        job_list.append({
            'Job Title': job_title,
            'Company': company_name,
            'Salary': salary,
            'Summary': summary_text,
            'Full Descriptions': full_description
        })

job_list = []
# Loop across pages (adjust range as needed)
for page_num in range(0, 40, 10):
    print(f"Scraping page {page_num//10 + 1}...")
    extract_output = extract(page_num)
    transform(extract_output)

# Convert to DataFrame and save (optional)
df = pd.DataFrame(job_list)
print(df.head())
# df.to_csv("indeed_data_analyst_jobs.csv", index=False)

Key Improvements

Reusable Headers: Ensures consistent browser identification across all requests, reducing anti-scrape flags.
Status Code Checking: Uses raise_for_status() to catch failed requests early instead of processing invalid HTML.
Stable Selectors: Targets a[data-jk] for job links (this attribute is far more reliable than class names for Indeed listings).
Conditional Detail Requests: Only fetches the job detail page if a valid link is found, eliminating empty descriptions from bad URLs.
Request Delays: Adds a 1.5-second delay between detail page requests to avoid overwhelming Indeed's servers.
Fallback Content: Provides meaningful messages when data is missing instead of empty strings/lists.
Whitespace Cleaning: Strips unnecessary whitespace from text fields for cleaner, more usable data.

Additional Tips

Rotate User-Agents: If you still get blocked, try rotating between different user-agent strings to mimic different browsers.
Use Proxies: For large-scale scraping, consider using proxies to avoid IP bans.
Inspect Page Source: Periodically check Indeed's HTML structure to update selectors if they change.

内容的提问来源于stack exchange，提问作者AlinaAZ