You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

网页爬虫如何遍历结果分页并解析详情页获取额外信息?

Alright, let's get your Craigslist car crawler updated with those two key features—pagination traversal and detail page parsing. Here's how to implement both step by step:

1. Adding Pagination Traversal

Craigslist uses the s query parameter to handle pagination (each page typically shows 120 results, so s=0 is page 1, s=120 is page 2, s=240 page 3, etc.). We'll loop through these pages until we hit a page with no more listings.

Key changes:

  • Wrap your scraping logic in a loop that increments the s parameter
  • Add a check to break the loop when no results are found
  • Handle potential missing elements to avoid crashes
2. Parsing Detail Pages for Extra Data

For each listing's URL, we'll fetch the detail page and extract:

  • The full vehicle description (from the postingbody section)
  • Vehicle condition and other attributes (from the attrgroup divs)

We'll add these fields to your dataset to make it more complete.

Full Updated Code

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Base URL without pagination parameter
BASE_URL = 'https://orlando.craigslist.org/search/cto?auto_title_status=1&max_auto_miles=50000&nearbyArea=125&nearbyArea=20&nearbyArea=219&nearbyArea=237&nearbyArea=238&nearbyArea=330&nearbyArea=331&nearbyArea=332&nearbyArea=333&nearbyArea=37&nearbyArea=376&nearbyArea=557&nearbyArea=638&nearbyArea=639&nearbyArea=80&searchNearby=2'
craigs_list = []
page_number = 0
results_per_page = 120  # Craigslist's typical result count per page

while True:
    # Add pagination parameter to the URL
    paginated_url = f"{BASE_URL}&s={page_number * results_per_page}"
    page = requests.get(paginated_url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Check if there are any results on the page
    results = soup.find(class_='rows')
    if not results:
        print("No more pages to scrape. Exiting loop.")
        break
    
    car_elems = results.find_all('li', class_='result-row')
    if not car_elems:
        break
    
    # Parse each listing tile
    for car_elem in car_elems:
        # Extract basic info from the list page
        price_elem = car_elem.find('span', class_='result-price')
        # Fallback to title link if gallery link doesn't exist
        url_elem = car_elem.find('a', {"class": "result-image gallery"}) or car_elem.find('a', class_='result-title hdrlnk')
        title_elem = car_elem.find('a', class_='result-title hdrlnk')
        date_elem = car_elem.find('time', class_='result-date')
        
        # Skip if critical elements are missing
        if not all([url_elem, title_elem]):
            continue
        
        detail_url = url_elem['href']
        # Fetch and parse the detail page
        detail_page = requests.get(detail_url)
        detail_soup = BeautifulSoup(detail_page.content, 'html.parser')
        
        # Extract description (clean up extra QR code text)
        description_elem = detail_soup.find('section', id='postingbody')
        description = description_elem.get_text(strip=True).replace('QR Code Link to This Post', '') if description_elem else "No description available"
        
        # Extract vehicle condition and attributes
        condition = "No condition listed"
        attributes = {}
        attr_groups = detail_soup.find_all('div', class_='attrgroup')
        for group in attr_groups:
            attrs = group.find_all('span', class_='attr')
            for attr in attrs:
                key_value = attr.get_text(strip=True).split(':', 1)
                if len(key_value) == 2:
                    key, value = key_value
                    attributes[key.strip()] = value.strip()
                    if key.lower() == 'condition':
                        condition = value.strip()
        
        # Build the item dictionary with all data
        craigs_item = {
            'title': title_elem.get_text(strip=True),
            'price': price_elem.get_text(strip=True) if price_elem else "No price listed",
            'links': detail_url,
            'date_posted': date_elem['datetime'] if date_elem else "No date available",
            'description': description,
            'condition': condition,
            'vehicle_attributes': attributes
        }
        craigs_list.append(craigs_item)
    
    # Move to the next page
    page_number += 1
    print(f"Scraped page {page_number}")

# Convert to DataFrame and print
craigsList_df = pd.DataFrame(craigs_list)
print(craigsList_df)

Key Notes:

  • Pagination Handling: The loop increments the s parameter by 120 each time, and breaks when no results are found.
  • Robust Element Checks: We added checks for missing elements (like price or date) to avoid crashes, and fall back to default text when elements aren't found.
  • Detail Page Parsing: We fetch each detail URL, extract the full description (cleaning out the "QR Code Link" text), and scrape all vehicle attributes (like mileage, fuel type, condition) into a dictionary for easy access.
  • Flexible URL Extraction: Some listings might not have the result-image gallery class, so we fall back to using the title link for the detail URL.

内容的提问来源于stack exchange,提问作者ks989

火山引擎 最新活动