网页爬虫如何遍历结果分页并解析详情页获取额外信息?
Alright, let's get your Craigslist car crawler updated with those two key features—pagination traversal and detail page parsing. Here's how to implement both step by step:
1. Adding Pagination Traversal
Craigslist uses the s query parameter to handle pagination (each page typically shows 120 results, so s=0 is page 1, s=120 is page 2, s=240 page 3, etc.). We'll loop through these pages until we hit a page with no more listings.
Key changes:
- Wrap your scraping logic in a loop that increments the
sparameter - Add a check to break the loop when no results are found
- Handle potential missing elements to avoid crashes
2. Parsing Detail Pages for Extra Data
For each listing's URL, we'll fetch the detail page and extract:
- The full vehicle description (from the
postingbodysection) - Vehicle condition and other attributes (from the
attrgroupdivs)
We'll add these fields to your dataset to make it more complete.
Full Updated Code
import requests from bs4 import BeautifulSoup import pandas as pd # Base URL without pagination parameter BASE_URL = 'https://orlando.craigslist.org/search/cto?auto_title_status=1&max_auto_miles=50000&nearbyArea=125&nearbyArea=20&nearbyArea=219&nearbyArea=237&nearbyArea=238&nearbyArea=330&nearbyArea=331&nearbyArea=332&nearbyArea=333&nearbyArea=37&nearbyArea=376&nearbyArea=557&nearbyArea=638&nearbyArea=639&nearbyArea=80&searchNearby=2' craigs_list = [] page_number = 0 results_per_page = 120 # Craigslist's typical result count per page while True: # Add pagination parameter to the URL paginated_url = f"{BASE_URL}&s={page_number * results_per_page}" page = requests.get(paginated_url) soup = BeautifulSoup(page.content, 'html.parser') # Check if there are any results on the page results = soup.find(class_='rows') if not results: print("No more pages to scrape. Exiting loop.") break car_elems = results.find_all('li', class_='result-row') if not car_elems: break # Parse each listing tile for car_elem in car_elems: # Extract basic info from the list page price_elem = car_elem.find('span', class_='result-price') # Fallback to title link if gallery link doesn't exist url_elem = car_elem.find('a', {"class": "result-image gallery"}) or car_elem.find('a', class_='result-title hdrlnk') title_elem = car_elem.find('a', class_='result-title hdrlnk') date_elem = car_elem.find('time', class_='result-date') # Skip if critical elements are missing if not all([url_elem, title_elem]): continue detail_url = url_elem['href'] # Fetch and parse the detail page detail_page = requests.get(detail_url) detail_soup = BeautifulSoup(detail_page.content, 'html.parser') # Extract description (clean up extra QR code text) description_elem = detail_soup.find('section', id='postingbody') description = description_elem.get_text(strip=True).replace('QR Code Link to This Post', '') if description_elem else "No description available" # Extract vehicle condition and attributes condition = "No condition listed" attributes = {} attr_groups = detail_soup.find_all('div', class_='attrgroup') for group in attr_groups: attrs = group.find_all('span', class_='attr') for attr in attrs: key_value = attr.get_text(strip=True).split(':', 1) if len(key_value) == 2: key, value = key_value attributes[key.strip()] = value.strip() if key.lower() == 'condition': condition = value.strip() # Build the item dictionary with all data craigs_item = { 'title': title_elem.get_text(strip=True), 'price': price_elem.get_text(strip=True) if price_elem else "No price listed", 'links': detail_url, 'date_posted': date_elem['datetime'] if date_elem else "No date available", 'description': description, 'condition': condition, 'vehicle_attributes': attributes } craigs_list.append(craigs_item) # Move to the next page page_number += 1 print(f"Scraped page {page_number}") # Convert to DataFrame and print craigsList_df = pd.DataFrame(craigs_list) print(craigsList_df)
Key Notes:
- Pagination Handling: The loop increments the
sparameter by 120 each time, and breaks when no results are found. - Robust Element Checks: We added checks for missing elements (like price or date) to avoid crashes, and fall back to default text when elements aren't found.
- Detail Page Parsing: We fetch each detail URL, extract the full description (cleaning out the "QR Code Link" text), and scrape all vehicle attributes (like mileage, fuel type, condition) into a dictionary for easy access.
- Flexible URL Extraction: Some listings might not have the
result-image galleryclass, so we fall back to using the title link for the detail URL.
内容的提问来源于stack exchange,提问作者ks989




