如何用R高效遍历已爬取的亚洲开发银行项目链接并抓取单个项目详情？

阿华AIGC实验室

2026-4-28

Hey there! Let's figure out how to efficiently scrape those ADB project details from your list of links. RSelenium works, but it's slow for large batches—so let's switch to faster methods that fit the project page structure.

First: Check if Project Pages Are Static (Most Likely!)

Looking at the sample link you shared, the core project details are static HTML, which means we can use rvest (way faster than Selenium) to pull data without launching a browser. Here's a step-by-step solution:

Step 1: Load Required Packages

library(rvest)
library(dplyr)
library(purrr)

Step 2: Build a Function to Scrape Single Project Details

This function will handle one project URL at a time, extract key details, and handle errors gracefully (like broken links):

scrape_project <- function(url) {
  # Add a 1-second delay to avoid overwhelming ADB's servers (critical to avoid being blocked!)
  Sys.sleep(1)
  
  tryCatch({
    # Fetch the page
    page <- read_html(url)
    
    # Extract details—adjust these selectors based on what you need (use browser dev tools to find them)
    project_title <- page %>% html_node("h1.page-title") %>% html_text(trim = TRUE)
    project_id <- page %>% html_node(".project-id") %>% html_text(trim = TRUE)
    country <- page %>% html_node(".project-country") %>% html_text(trim = TRUE)
    total_cost <- page %>% html_node(".project-cost") %>% html_text(trim = TRUE)
    description <- page %>% html_node(".project-description") %>% html_text(trim = TRUE)
    
    # Return as a tidy tibble
    tibble(
      url = url,
      title = project_title,
      project_id = project_id,
      country = country,
      total_cost = total_cost,
      description = description
    )
  }, error = function(e) {
    # Log errors and return NA values for failed pages
    message(paste("Oops, failed to scrape", url, ":", e$message))
    tibble(
      url = url,
      title = NA,
      project_id = NA,
      country = NA,
      total_cost = NA,
      description = NA
    )
  })
}

Step 3: Scrape All Links in Batch

Use purrr::map_df() to apply the function to every URL in your full dataframe:

# Scrape all project details
project_details <- full$pp_url %>% map_df(scrape_project)

# Merge with your original dataframe to have all data in one place
full_project_data <- full %>% left_join(project_details, by = c("pp_url" = "url"))

If Some Pages Are Dynamic (Rare for ADB)

If a few pages load content dynamically (like hidden tabs that need clicking), use a headless browser tool like chromote—it's lighter and faster than RSelenium:

library(chromote)

# Start a headless Chrome session (no visible browser window)
chrome_session <- ChromoteSession$new()

scrape_dynamic_project <- function(url) {
  Sys.sleep(1)
  tryCatch({
    # Navigate to the page and wait for it to load
    chrome_session$Page$navigate(url)
    chrome_session$Page$loadEventFired(wait_ = TRUE)
    
    # Get the full page source (including dynamically loaded content)
    page_source <- chrome_session$Runtime$evaluate("document.documentElement.outerHTML")$result$value
    page <- read_html(page_source)
    
    # Extract details same as before
    project_title <- page %>% html_node("h1.page-title") %>% html_text(trim = TRUE)
    # ... add other fields you need
    
    tibble(url = url, title = project_title)
  }, error = function(e) {
    message(paste("Failed to scrape", url, ":", e$message))
    tibble(url = url, title = NA)
  })
}

# Run the scrape
dynamic_details <- full$pp_url %>% map_df(scrape_dynamic_project)

# Close the Chrome session when done
chrome_session$close()

Pro Tips for Respectful Scraping

Always add delays: Sys.sleep(1) is a minimum—ADB might block your IP if you hit their server too fast.

Identify yourself: Use the polite package to let the server know who you are (reduces block risk):

library(polite)
session <- bow("https://www.adb.org/projects", user_agent = "Your Name/Your Project (your.email@example.com)")
page <- scrape(session, url = "your-project-url")

Test selectors: Use your browser's dev tools (right-click > Inspect) to find the exact CSS selectors for the data you want.

Extract tables: For project tables (like budget breakdowns), use html_table():

project_tables <- page %>% html_table(header = TRUE)
budget_table <- project_tables[[1]] # Adjust index based on which table you need

内容的提问来源于stack exchange，提问作者truehm2

火山引擎最新活动

方舟 Coding Plan

HOT

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

查看详情

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

查看详情

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

查看详情

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠

查看详情

ArkClaw 专属智能伙伴