You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

使用BeautifulSoup爬取Indeed多页职位完整描述时出现AttributeError问题求助

Fixing the AttributeError & Empty Job Descriptions in Your Indeed Scraper

Hey Alina, let's break down what's causing your issues and fix them step by step.

Why You're Seeing the AttributeError

The error pops up because job_posts.find(name="a", class_="jcs-JobTitle") returns None for some job postings. This can happen for a few common reasons:

  • Dynamic page updates: Indeed frequently tweaks its HTML class names—your jcs-JobTitle selector might not work for all listings (like sponsored posts).
  • Anti-scraping blocks: If you're sending requests too quickly, Indeed might return incomplete or altered HTML, so your selector can't find the element.
  • Inconsistent listing layouts: Some job cards have a different structure that doesn't use that specific anchor tag.

Why the Try-Except Led to Empty Descriptions

When you caught the error and set job_link = "", you still made a request to https://uk.indeed.com (the homepage), not a job detail page. That page doesn't have the jobsearch-jobDescriptionText div, so your code returned empty descriptions. You should only fetch the detail page if you actually find a valid job link.

Fixed & Optimized Code

Here's a revised version of your code with robust error handling, anti-scrape safeguards, and more reliable selectors:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Reusable headers (critical for avoiding anti-scraping blocks)
HEADERS = { 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}

def extract(page):
    url = f"https://uk.indeed.com/jobs?q=data+analyst+%C2%A330%2C000&l=London%2C+Greater+London&jt=fulltime&start={page}"
    try:
        r = requests.get(url, headers=HEADERS)
        r.raise_for_status()  # Catch non-200 status codes early
        soup = BeautifulSoup(r.text, "html.parser")
        return soup
    except requests.exceptions.RequestException as e:
        print(f"Failed to fetch page {page}: {str(e)}")
        return None

def transform(soup):
    if not soup:
        return
    
    job_postings = soup.find_all(name="div", class_="slider_item")
    for job_posts in job_postings:
        # Safely extract basic job info with None checks
        job_title_elem = job_posts.select_one("a span[title]")
        job_title = job_title_elem.text.strip() if job_title_elem else "Unknown Job Title"
        
        company_name_elem = job_posts.find(name="span", class_="companyName")
        company_name = company_name_elem.text.strip() if company_name_elem else "Unknown Company"
        
        salary = "n/a"
        salary_snippet = job_posts.find(name="div", class_="salary-snippet")
        if salary_snippet:
            salary_span = salary_snippet.find("span")
            if salary_span:
                salary = salary_span.getText().strip()
        
        summary_text_elem = job_posts.find(name="div", class_="job-snippet")
        summary_text = summary_text_elem.text.replace("\n", "").strip() if summary_text_elem else "No summary available"
        
        # Use a more stable selector for job links (targets unique data-jk attribute)
        job_a = job_posts.select_one("a[data-jk]")
        full_description = []
        
        if job_a:
            job_link = job_a.get("href")
            absolute_link = 'https://uk.indeed.com' + job_link
            
            # Add delay to avoid triggering anti-scraping measures
            time.sleep(1.5)
            
            try:
                job_desc_r = requests.get(absolute_link, headers=HEADERS)
                job_desc_r.raise_for_status()
                job_desc_soup = BeautifulSoup(job_desc_r.text, "html.parser")
                
                desc_div = job_desc_soup.find(name="div", class_="jobsearch-jobDescriptionText")
                if desc_div:
                    # Extract clean list items, skip empty entries
                    full_description = [
                        item.text.strip() 
                        for item in desc_div.find_all("li") 
                        if item.text.strip()
                    ]
                    # Fallback if no list items exist
                    if not full_description:
                        full_description = [desc_div.get_text(strip=True)]
                else:
                    full_description = ["No detailed job description found"]
                    
            except requests.exceptions.RequestException as e:
                print(f"Failed to fetch job details: {str(e)}")
                full_description = ["Error loading job description"]
        else:
            full_description = ["No valid job link found"]
        
        # Append the cleaned job entry to the list
        job_list.append({
            'Job Title': job_title,
            'Company': company_name,
            'Salary': salary,
            'Summary': summary_text,
            'Full Descriptions': full_description
        })

job_list = []
# Loop across pages (adjust range as needed)
for page_num in range(0, 40, 10):
    print(f"Scraping page {page_num//10 + 1}...")
    extract_output = extract(page_num)
    transform(extract_output)

# Convert to DataFrame and save (optional)
df = pd.DataFrame(job_list)
print(df.head())
# df.to_csv("indeed_data_analyst_jobs.csv", index=False)

Key Improvements

  1. Reusable Headers: Ensures consistent browser identification across all requests, reducing anti-scrape flags.
  2. Status Code Checking: Uses raise_for_status() to catch failed requests early instead of processing invalid HTML.
  3. Stable Selectors: Targets a[data-jk] for job links (this attribute is far more reliable than class names for Indeed listings).
  4. Conditional Detail Requests: Only fetches the job detail page if a valid link is found, eliminating empty descriptions from bad URLs.
  5. Request Delays: Adds a 1.5-second delay between detail page requests to avoid overwhelming Indeed's servers.
  6. Fallback Content: Provides meaningful messages when data is missing instead of empty strings/lists.
  7. Whitespace Cleaning: Strips unnecessary whitespace from text fields for cleaner, more usable data.

Additional Tips

  • Rotate User-Agents: If you still get blocked, try rotating between different user-agent strings to mimic different browsers.
  • Use Proxies: For large-scale scraping, consider using proxies to avoid IP bans.
  • Inspect Page Source: Periodically check Indeed's HTML structure to update selectors if they change.

内容的提问来源于stack exchange,提问作者AlinaAZ

火山引擎 最新活动