如何使用Selenium高效抓取美国游泳协会地图中带模态框的游泳俱乐部/队伍数据?
如何使用Selenium高效抓取美国游泳协会地图中带模态框的游泳俱乐部/队伍数据?
一、模态框检测失败的原因与修复方案
你遇到的模态框找不到的问题,我之前做类似地图爬虫时也踩过坑,主要是这几个原因导致的:
- 等待条件选得不对:
presence_of_element_located只判断元素在DOM中存在,但模态框可能还在做显示动画,这时候元素虽然存在但不可交互/不可见,直接定位就会失败。 - 定位器不够精准:页面上可能残留之前打开的模态框DOM,或者有其他同名class的弹出层,导致定位到错误元素。
- Stale Element异常:点击标记点后DOM结构更新,之前绑定的元素引用失效。
修复代码的关键点:
- 把等待条件换成
visibility_of_element_located,确保模态框真正显示出来:modal = WebDriverWait(driver, 10).until( EC.visibility_of_element_located((By.CLASS_NAME, "popup-content-container")) ) - 用更精准的定位器,结合模态框内部的特征元素避免误定位:
modal = WebDriverWait(driver, 10).until( EC.visibility_of_element_located((By.XPATH, "//div[@class='popup-content-container' and .//div[@class='popupTitle']]")) ) - 彻底替换固定
time.sleep,全部用WebDriverWait等待元素状态,既稳定又高效:driver.execute_script("arguments[0].click();", pin) # 直接等模态框可见,不用硬等3秒 modal = WebDriverWait(driver, 10).until( EC.visibility_of_element_located((By.CLASS_NAME, "popup-content-container")) ) - 给标记点处理逻辑加上
StaleElementReferenceException捕获,避免元素引用失效导致脚本中断。
二、高效遍历全美所有标记点的方案
手动平移缩放完全不现实,核心问题是地图只会渲染当前视口内的标记点,直接获取所有maplibregl-marker只能拿到屏幕上可见的部分。这里给你两个可行思路:
思路1:用JS控制地图自动遍历区域
因为页面用的是MapLibre地图,你可以通过执行JavaScript直接控制地图平移/缩放,每次移动后等待新标记点加载,重复直到没有新内容:
- 先在浏览器控制台输入
map确认地图实例的变量名(一般是map) - 记录已处理过的标记点唯一标识(比如
aria-label属性),避免重复处理 - 循环执行「获取当前可见标记点→处理→平移地图→等待新标记点加载」的流程:
while True: pins = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "maplibregl-marker"))) new_pins = [p for p in pins if p.get_attribute("aria-label") not in processed_ids] if not new_pins: # 没有新标记点,向右下平移500像素(可调整步长) driver.execute_script("map.panBy(500, 500)") try: wait.until(EC.staleness_of(pins[0])) # 等待新标记点加载 except: # 无法平移,说明所有标记点已处理 break
思路2:用页面筛选器分区域处理(更可控)
页面顶部应该有按州/地区筛选的功能,你可以先通过Selenium选择每个州,这样该州的标记点会集中显示,处理完一个州再切换到下一个,比遍历整个地图更高效,也不容易漏数据。
三、直接通过API获取数据的最优方案
这才是最高效的方法,比用Selenium快10倍以上!你之前查看Network tab的方向完全正确,我帮你梳理下操作步骤:
- 打开页面的DevTools>Network面板,刷新页面后筛选「XHR」请求,找包含
team/club的请求(比如类似/api/find-teams的端点) - 复制该请求的URL、请求头(比如
User-Agent、Referer),用requests库直接请求,解析返回的JSON即可,完全不需要Selenium - 示例代码(假设找到的API端点可用):
import requests import pandas as pd headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36", "Referer": "https://www.usaswimming.org/home/find-a-team" } # 替换为你找到的实际API端点 response = requests.get("https://www.usaswimming.org/api/find-teams", headers=headers) teams_data = response.json() swim_club_data = [] for team in teams_data["features"]: props = team["properties"] swim_club_data.append({ "Name": props.get("name"), "Email": props.get("email"), "Phone": props.get("phone"), "Website": props.get("website"), "Club Size": props.get("size"), "Address": props.get("address") }) pd.DataFrame(swim_club_data).to_csv("swim_clubs.csv", index=False)
优化后的完整Selenium代码(如果坚持用浏览器渲染方案)
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import StaleElementReferenceException import pandas as pd def main(): options = webdriver.ChromeOptions() # options.add_argument('--headless=new') # 调试时可注释 driver = webdriver.Chrome(options=options) wait = WebDriverWait(driver, 15) url = "https://www.usaswimming.org/home/find-a-team" driver.get(url) swim_club_data = [] processed_pins = set() try: # 等待地图加载完成 wait.until(EC.presence_of_element_located((By.CLASS_NAME, "maplibregl-map"))) while True: # 获取当前可见的所有标记点 pins = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "maplibregl-marker"))) new_pins = [pin for pin in pins if pin.get_attribute("aria-label") not in processed_pins] if not new_pins: # 无新标记点,平移地图 try: driver.execute_script("map.panBy(500, 500)") wait.until(EC.staleness_of(pins[0])) # 等待新标记点加载 continue except: # 无法平移,说明所有标记点已处理 break for pin in new_pins: pin_id = pin.get_attribute("aria-label") processed_pins.add(pin_id) try: driver.execute_script("arguments[0].click();", pin) # 等待模态框可见 modal = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "popup-content-container"))) # 用相对路径定位俱乐部链接,避免全局匹配 club_links = modal.find_elements(By.XPATH, ".//ul/li/a") for link in club_links: link_text = link.text try: driver.execute_script("arguments[0].click();", link) # 等待详情模态框 details_modal = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "popup-content-container"))) # 提取数据 club_name = details_modal.find_element(By.CLASS_NAME, "popupTitle").text email = details_modal.find_element(By.CSS_SELECTOR, "a[href^='mailto:']").get_attribute("href").replace("mailto:", "") phone = details_modal.find_element(By.CSS_SELECTOR, "a[href^='tel:']").get_attribute("href").replace("tel:", "") website = details_modal.find_element(By.CSS_SELECTOR, "a[target='_blank']").get_attribute("href") club_size = details_modal.find_element(By.XPATH, ".//li[contains(text(), 'Club Size')]").text.split(": ")[1] address = details_modal.find_element(By.XPATH, ".//ul[@class='popupSubTitle']/following-sibling::text()").strip() swim_club_data.append({ "Name": club_name, "Email": email, "Phone": phone, "Website": website, "Club Size": club_size, "Address": address }) print(f"已提取: {club_name}") # 关闭详情模态框 close_btn = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "popup-close"))) close_btn.click() wait.until(EC.invisibility_of_element_located((By.CLASS_NAME, "popup-content-container"))) except Exception as e: print(f"处理俱乐部 {link_text} 失败: {str(e)}") # 强制关闭模态框 try: close_btn = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "popup-close"))) close_btn.click() except: pass continue # 关闭标记点模态框 close_btn = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "popup-close"))) close_btn.click() wait.until(EC.invisibility_of_element_located((By.CLASS_NAME, "popup-content-container"))) except StaleElementReferenceException: continue except Exception as e: print(f"处理标记点 {pin_id} 失败: {str(e)}") continue # 保存数据 if swim_club_data: pd.DataFrame(swim_club_data).to_csv("swim_clubs.csv", index=False) print("数据已保存到 swim_clubs.csv") else: print("未提取到任何数据") finally: driver.quit() if __name__ == "__main__": main()
备注:内容来源于stack exchange,提问作者Guill T




