You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Python爬虫技术求助:如何爬取HTML中所有class并输出?

Hey there! Let's get this sorted out for you—you're already close, so let's break down exactly how to scrape those elements and their associated data. Based on the HTML snippet you shared, I'll walk you through solutions using two common scraping tools: Python's BeautifulSoup (for static content) and Node.js's Cheerio, plus a note on handling dynamic content if that's the issue.

Solution 1: Python with BeautifulSoup (Static HTML)

If the page loads all the .sizedata divs and .selectData links directly in the initial HTML, this will work perfectly. We'll parse the HTML, target the elements, and extract their classes/attributes:

from bs4 import BeautifulSoup

# Replace this with your actual page HTML (use requests.get() to fetch it first if needed)
html_content = """
<div class="sizedata"> <a class="selectData" data-branch-on="1" data-size="11" data-ifno="105124" id="25096"> </a> </div>
<div class="sizedata"> <a class="selectData" data-branch-on="1" data-size="12" data-ifno="173445" id="25097"> </a>
"""

# Parse the HTML
soup = BeautifulSoup(html_content, "html.parser")

# Option 1: Target .sizedata divs first, then their child .selectData links
for sizedata_div in soup.find_all("div", class_="sizedata"):
    select_link = sizedata_div.find("a", class_="selectData")
    if select_link:
        # Print the link's class
        print(f"Link class: {select_link.get('class')}")
        # Print all the data attributes and ID
        print(f"Data attributes:")
        print(f"  data-branch-on: {select_link.get('data-branch-on')}")
        print(f"  data-size: {select_link.get('data-size')}")
        print(f"  data-ifno: {select_link.get('data-ifno')}")
        print(f"  id: {select_link.get('id')}")
        print("---")

# Option 2: Directly target all .selectData links (faster if you don't need the parent div)
for select_link in soup.find_all("a", class_="selectData"):
    print(f"All attributes for link:")
    for attr_name, attr_value in select_link.attrs.items():
        print(f"  {attr_name}: {attr_value}")
    print("---")

Solution 2: Node.js with Cheerio

If you prefer working in JavaScript, Cheerio works like jQuery for server-side scraping:

const cheerio = require('cheerio');

// Your page HTML content
const htmlContent = `
<div class="sizedata"> <a class="selectData" data-branch-on="1" data-size="11" data-ifno="105124" id="25096"> </a> </div>
<div class="sizedata"> <a class="selectData" data-branch-on="1" data-size="12" data-ifno="173445" id="25097"> </a>
`;

// Load the HTML into Cheerio
const $ = cheerio.load(htmlContent);

// Traverse all .sizedata divs and their child links
$('div.sizedata a.selectData').each((index, element) => {
    const link = $(element);
    console.log(`Link class: ${link.attr('class')}`);
    console.log(`Data details:`);
    console.log(`  data-branch-on: ${link.data('branch-on')}`);
    console.log(`  data-size: ${link.data('size')}`);
    console.log(`  data-ifno: ${link.data('ifno')}`);
    console.log(`  id: ${link.attr('id')}`);
    console.log('---');
});

What if the content is dynamically loaded?

If you've been using static scraping tools but the .sizedata elements only appear after JavaScript runs (e.g., after scrolling or an AJAX call), you'll need a browser automation tool like Selenium (Python) to render the page first:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

# Initialize the browser (Chrome in this case)
driver = webdriver.Chrome()
driver.get("YOUR_TARGET_PAGE_URL")

# Wait up to 10 seconds for the .sizedata elements to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "sizedata")))

# Get the fully rendered HTML
rendered_html = driver.page_source

# Now parse with BeautifulSoup like before
soup = BeautifulSoup(rendered_html, "html.parser")
# ... (use the same parsing code from Solution 1 here)

# Clean up
driver.quit()

The most likely sticking points were either using the wrong syntax to target classes (e.g., forgetting class_ in BeautifulSoup instead of class) or not accounting for dynamic content. Give these a try, and you should have your scraped data in no time!

内容的提问来源于stack exchange,提问作者CDNthe2nd

火山引擎 最新活动