如何用R高效遍历已爬取的亚洲开发银行项目链接并抓取单个项目详情?
Hey there! Let's figure out how to efficiently scrape those ADB project details from your list of links. RSelenium works, but it's slow for large batches—so let's switch to faster methods that fit the project page structure.
First: Check if Project Pages Are Static (Most Likely!)
Looking at the sample link you shared, the core project details are static HTML, which means we can use rvest (way faster than Selenium) to pull data without launching a browser. Here's a step-by-step solution:
Step 1: Load Required Packages
library(rvest) library(dplyr) library(purrr)
Step 2: Build a Function to Scrape Single Project Details
This function will handle one project URL at a time, extract key details, and handle errors gracefully (like broken links):
scrape_project <- function(url) { # Add a 1-second delay to avoid overwhelming ADB's servers (critical to avoid being blocked!) Sys.sleep(1) tryCatch({ # Fetch the page page <- read_html(url) # Extract details—adjust these selectors based on what you need (use browser dev tools to find them) project_title <- page %>% html_node("h1.page-title") %>% html_text(trim = TRUE) project_id <- page %>% html_node(".project-id") %>% html_text(trim = TRUE) country <- page %>% html_node(".project-country") %>% html_text(trim = TRUE) total_cost <- page %>% html_node(".project-cost") %>% html_text(trim = TRUE) description <- page %>% html_node(".project-description") %>% html_text(trim = TRUE) # Return as a tidy tibble tibble( url = url, title = project_title, project_id = project_id, country = country, total_cost = total_cost, description = description ) }, error = function(e) { # Log errors and return NA values for failed pages message(paste("Oops, failed to scrape", url, ":", e$message)) tibble( url = url, title = NA, project_id = NA, country = NA, total_cost = NA, description = NA ) }) }
Step 3: Scrape All Links in Batch
Use purrr::map_df() to apply the function to every URL in your full dataframe:
# Scrape all project details project_details <- full$pp_url %>% map_df(scrape_project) # Merge with your original dataframe to have all data in one place full_project_data <- full %>% left_join(project_details, by = c("pp_url" = "url"))
If Some Pages Are Dynamic (Rare for ADB)
If a few pages load content dynamically (like hidden tabs that need clicking), use a headless browser tool like chromote—it's lighter and faster than RSelenium:
library(chromote) # Start a headless Chrome session (no visible browser window) chrome_session <- ChromoteSession$new() scrape_dynamic_project <- function(url) { Sys.sleep(1) tryCatch({ # Navigate to the page and wait for it to load chrome_session$Page$navigate(url) chrome_session$Page$loadEventFired(wait_ = TRUE) # Get the full page source (including dynamically loaded content) page_source <- chrome_session$Runtime$evaluate("document.documentElement.outerHTML")$result$value page <- read_html(page_source) # Extract details same as before project_title <- page %>% html_node("h1.page-title") %>% html_text(trim = TRUE) # ... add other fields you need tibble(url = url, title = project_title) }, error = function(e) { message(paste("Failed to scrape", url, ":", e$message)) tibble(url = url, title = NA) }) } # Run the scrape dynamic_details <- full$pp_url %>% map_df(scrape_dynamic_project) # Close the Chrome session when done chrome_session$close()
Pro Tips for Respectful Scraping
- Always add delays:
Sys.sleep(1)is a minimum—ADB might block your IP if you hit their server too fast. - Identify yourself: Use the
politepackage to let the server know who you are (reduces block risk):library(polite) session <- bow("https://www.adb.org/projects", user_agent = "Your Name/Your Project (your.email@example.com)") page <- scrape(session, url = "your-project-url") - Test selectors: Use your browser's dev tools (right-click > Inspect) to find the exact CSS selectors for the data you want.
- Extract tables: For project tables (like budget breakdowns), use
html_table():project_tables <- page %>% html_table(header = TRUE) budget_table <- project_tables[[1]] # Adjust index based on which table you need
内容的提问来源于stack exchange,提问作者truehm2




