如何用Python抓取mbasic.facebook.com指定class的a标签
Hey there! Let's break down why your regex attempts aren't working and walk through a couple of reliable solutions to grab those <a class="cf"> tags you need.
What's Wrong With Your Regex Code?
First, let's look at the issues in your existing attempts:
First Regex Attempt:
- You used
re.compile(driver.page_source)which is backwards—re.compile()takes a regex pattern, not the page source. - The pattern
"<a class=\"cf\" href=\"*\">"uses*incorrectly. In regex,*is a quantifier (matches 0+ of the previous character), not a wildcard. You'd need.*?to match any characters inside the href, but even then, HTML tags often have extra attributes (liketarget) or whitespace that would break this rigid pattern.
- You used
Second Regex Attempt:
- Your pattern is targeting
<td>elements containing an<a>with any class, not specifically thecfclass. It's also too broad and will fail if the HTML has line breaks or unexpected spacing between attributes.
- Your pattern is targeting
Regex is rarely a good choice for parsing HTML—HTML is structured, and dedicated parsers handle things like whitespace, attribute order, and nested elements way better.
Solution 1: Use Selenium's Built-in Locators (Simplest)
Since you're already using Selenium, you can directly locate the elements you need without regex. Selenium has built-in methods to find elements by CSS selectors, which are perfect for this:
from selenium import webdriver from selenium.webdriver.common.by import By # Initialize your driver (adjust for your browser, e.g., Firefox, Edge) driver = webdriver.Chrome() # Navigate to the target page driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends') # Find all <a> tags with class "cf" using CSS selector target_links = driver.find_elements(By.CSS_SELECTOR, "a.cf") # Print each full <a> tag HTML for link in target_links: print(link.get_attribute('outerHTML')) # Clean up driver.quit()
Solution 2: Use BeautifulSoup for HTML Parsing
If you prefer parsing the page source directly, BeautifulSoup is a fantastic library for HTML manipulation. Pair it with Selenium to get the page source:
from selenium import webdriver from bs4 import BeautifulSoup driver = webdriver.Chrome() driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends') # Get the page source and parse it with BeautifulSoup page_source = driver.page_source soup = BeautifulSoup(page_source, 'html.parser') # Find all <a> tags with class "cf" target_links = soup.find_all('a', class_='cf') # Print each full <a> tag HTML for link in target_links: print(str(link)) driver.quit()
Why These Methods Work Better
- Both approaches handle variations in HTML formatting (like line breaks, extra spaces, or attribute order) that would break regex.
- They're more maintainable—if Facebook tweaks the page slightly (e.g., adds a new attribute to the
<a>tag), these methods will still work as long as thecfclass remains.
内容的提问来源于stack exchange,提问作者Ben Daggers




