You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用R高效遍历已爬取的亚洲开发银行项目链接并抓取单个项目详情?

Hey there! Let's figure out how to efficiently scrape those ADB project details from your list of links. RSelenium works, but it's slow for large batches—so let's switch to faster methods that fit the project page structure.

First: Check if Project Pages Are Static (Most Likely!)

Looking at the sample link you shared, the core project details are static HTML, which means we can use rvest (way faster than Selenium) to pull data without launching a browser. Here's a step-by-step solution:

Step 1: Load Required Packages

library(rvest)
library(dplyr)
library(purrr)

Step 2: Build a Function to Scrape Single Project Details

This function will handle one project URL at a time, extract key details, and handle errors gracefully (like broken links):

scrape_project <- function(url) {
  # Add a 1-second delay to avoid overwhelming ADB's servers (critical to avoid being blocked!)
  Sys.sleep(1)
  
  tryCatch({
    # Fetch the page
    page <- read_html(url)
    
    # Extract details—adjust these selectors based on what you need (use browser dev tools to find them)
    project_title <- page %>% html_node("h1.page-title") %>% html_text(trim = TRUE)
    project_id <- page %>% html_node(".project-id") %>% html_text(trim = TRUE)
    country <- page %>% html_node(".project-country") %>% html_text(trim = TRUE)
    total_cost <- page %>% html_node(".project-cost") %>% html_text(trim = TRUE)
    description <- page %>% html_node(".project-description") %>% html_text(trim = TRUE)
    
    # Return as a tidy tibble
    tibble(
      url = url,
      title = project_title,
      project_id = project_id,
      country = country,
      total_cost = total_cost,
      description = description
    )
  }, error = function(e) {
    # Log errors and return NA values for failed pages
    message(paste("Oops, failed to scrape", url, ":", e$message))
    tibble(
      url = url,
      title = NA,
      project_id = NA,
      country = NA,
      total_cost = NA,
      description = NA
    )
  })
}

Step 3: Scrape All Links in Batch

Use purrr::map_df() to apply the function to every URL in your full dataframe:

# Scrape all project details
project_details <- full$pp_url %>% map_df(scrape_project)

# Merge with your original dataframe to have all data in one place
full_project_data <- full %>% left_join(project_details, by = c("pp_url" = "url"))

If Some Pages Are Dynamic (Rare for ADB)

If a few pages load content dynamically (like hidden tabs that need clicking), use a headless browser tool like chromote—it's lighter and faster than RSelenium:

library(chromote)

# Start a headless Chrome session (no visible browser window)
chrome_session <- ChromoteSession$new()

scrape_dynamic_project <- function(url) {
  Sys.sleep(1)
  tryCatch({
    # Navigate to the page and wait for it to load
    chrome_session$Page$navigate(url)
    chrome_session$Page$loadEventFired(wait_ = TRUE)
    
    # Get the full page source (including dynamically loaded content)
    page_source <- chrome_session$Runtime$evaluate("document.documentElement.outerHTML")$result$value
    page <- read_html(page_source)
    
    # Extract details same as before
    project_title <- page %>% html_node("h1.page-title") %>% html_text(trim = TRUE)
    # ... add other fields you need
    
    tibble(url = url, title = project_title)
  }, error = function(e) {
    message(paste("Failed to scrape", url, ":", e$message))
    tibble(url = url, title = NA)
  })
}

# Run the scrape
dynamic_details <- full$pp_url %>% map_df(scrape_dynamic_project)

# Close the Chrome session when done
chrome_session$close()

Pro Tips for Respectful Scraping

  • Always add delays: Sys.sleep(1) is a minimum—ADB might block your IP if you hit their server too fast.
  • Identify yourself: Use the polite package to let the server know who you are (reduces block risk):
    library(polite)
    session <- bow("https://www.adb.org/projects", user_agent = "Your Name/Your Project (your.email@example.com)")
    page <- scrape(session, url = "your-project-url")
    
  • Test selectors: Use your browser's dev tools (right-click > Inspect) to find the exact CSS selectors for the data you want.
  • Extract tables: For project tables (like budget breakdowns), use html_table():
    project_tables <- page %>% html_table(header = TRUE)
    budget_table <- project_tables[[1]] # Adjust index based on which table you need
    

内容的提问来源于stack exchange,提问作者truehm2

火山引擎 最新活动