如何使用Jsoup（Java）获取指定tbody下的表格td文本

阿华AIGC实验室

2026-5-25

Extract Specific Table TD Text from Servientrega Tracking Page

Got it, let's fix this scraping issue for you. You're trying to pull only the text from <td> elements inside the <tbody> with ID form01:tableEx4_data on that Servientrega tracking page, and your previous attempts picked up unwanted content from linked <td>s. Here are two straightforward, reliable methods to get exactly what you need:

Python Solution with BeautifulSoup

If you're using Python for automated scraping, this approach will isolate the target table data perfectly:

First, fetch the page HTML and parse it with BeautifulSoup:

import requests
from bs4 import BeautifulSoup

# Target URL
url = "https://www.servientrega.com/RastreoContado/RastreoContado2.faces?idGuia=2003159909&idPais=1"

# Add a user-agent to avoid being blocked
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

Locate the target <tbody> and extract clean TD text:

# Find the exact tbody we need
target_tbody = soup.find('tbody', id='form01:tableEx4_data')

if target_tbody:
    # Loop through each row in the tbody
    for row in target_tbody.find_all('tr'):
        # Extract text from each TD, stripping extra whitespace
        td_contents = [td.get_text(strip=True) for td in row.find_all('td')]
        print(td_contents)
else:
    print("Couldn't find the target tbody element. Double-check the ID or page structure!")

The get_text(strip=True) method ignores any nested links or HTML tags inside the <td>s—you'll only get the plain text content you want.

Quick Browser Console (JavaScript) Method

If you just need to extract the data once without writing a full script, use your browser's developer tools:

Open the target page in Chrome/Firefox, right-click anywhere and select "Inspect" to open DevTools.
Go to the "Console" tab and paste this code:

// Grab the target tbody element
const targetTableBody = document.getElementById('form01:tableEx4_data');

if (targetTableBody) {
    // Get all rows in the tbody
    const tableRows = targetTableBody.querySelectorAll('tr');
    
    // Loop through rows and extract TD text
    tableRows.forEach(row => {
        const tdTexts = Array.from(row.querySelectorAll('td')).map(td => td.textContent.trim());
        console.log(tdTexts);
    });
} else {
    console.log("Target tbody not found—verify the element ID matches the page's current structure.");
}

Hit enter, and you'll see each row's TD text printed in the console, free of unwanted link-related content.

Quick Note

If the page loads content dynamically (though this specific Servientrega page appears static), you might need to use tools like Selenium for Python, or wait for elements to load in the browser console. But for this URL, the methods above should work out of the box.

内容的提问来源于stack exchange，提问作者Germán