如何用Python抓取mbasic.facebook.com指定class的a标签

阿华AIGC实验室

2026-5-7

Fixing Your Facebook mBasic Link Scraping Issue

Hey there! Let's break down why your regex attempts aren't working and walk through a couple of reliable solutions to grab those <a class="cf"> tags you need.

What's Wrong With Your Regex Code?

First, let's look at the issues in your existing attempts:

First Regex Attempt:
- You used re.compile(driver.page_source) which is backwards—re.compile() takes a regex pattern, not the page source.
- The pattern "<a class=\"cf\" href=\"*\">" uses * incorrectly. In regex, * is a quantifier (matches 0+ of the previous character), not a wildcard. You'd need .*? to match any characters inside the href, but even then, HTML tags often have extra attributes (like target) or whitespace that would break this rigid pattern.
Second Regex Attempt:
- Your pattern is targeting <td> elements containing an <a> with any class, not specifically the cf class. It's also too broad and will fail if the HTML has line breaks or unexpected spacing between attributes.

Regex is rarely a good choice for parsing HTML—HTML is structured, and dedicated parsers handle things like whitespace, attribute order, and nested elements way better.

Solution 1: Use Selenium's Built-in Locators (Simplest)

Since you're already using Selenium, you can directly locate the elements you need without regex. Selenium has built-in methods to find elements by CSS selectors, which are perfect for this:

from selenium import webdriver
from selenium.webdriver.common.by import By

# Initialize your driver (adjust for your browser, e.g., Firefox, Edge)
driver = webdriver.Chrome()

# Navigate to the target page
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')

# Find all <a> tags with class "cf" using CSS selector
target_links = driver.find_elements(By.CSS_SELECTOR, "a.cf")

# Print each full <a> tag HTML
for link in target_links:
    print(link.get_attribute('outerHTML'))

# Clean up
driver.quit()

Solution 2: Use BeautifulSoup for HTML Parsing

If you prefer parsing the page source directly, BeautifulSoup is a fantastic library for HTML manipulation. Pair it with Selenium to get the page source:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')

# Get the page source and parse it with BeautifulSoup
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')

# Find all <a> tags with class "cf"
target_links = soup.find_all('a', class_='cf')

# Print each full <a> tag HTML
for link in target_links:
    print(str(link))

driver.quit()

Why These Methods Work Better

Both approaches handle variations in HTML formatting (like line breaks, extra spaces, or attribute order) that would break regex.
They're more maintainable—if Facebook tweaks the page slightly (e.g., adds a new attribute to the <a> tag), these methods will still work as long as the cf class remains.

内容的提问来源于stack exchange，提问作者Ben Daggers