You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Python Beautiful Soup提取文本及HTML中<p>标签内的摘要?

Hey there! Let's tackle your two Beautiful Soup questions with clear, practical examples—right up your alley based on the HTML snippet you shared.

1. General Approach: Extracting Target Text with Beautiful Soup

First, make sure you've got the library installed: run pip install beautifulsoup4 (you'll also need requests if you're pulling HTML from a live website, but we'll focus on local/already fetched HTML here).

The core idea is to parse the HTML into a Beautiful Soup object, then target the elements you want using tag names, classes, IDs, or other attributes. Here are common scenarios:

  • Extract text from a specific single element: Use soup.find() to locate the element, then .get_text() to pull its content. For example, to get the "Summary" text from your HTML:
    summary_label = soup.find('span', class_='text-bolder text-larger').get_text(strip=True)
    
  • Extract text from multiple elements: Use soup.find_all() to grab all matching elements, then loop through them to collect text:
    all_paragraphs = soup.find_all('p')
    for p in all_paragraphs:
        print(p.get_text(strip=True))
    
  • Extract all text from a parent element: If you want all text inside a container (like a <div>), call .get_text() directly on that parent element—this will combine text from all its child tags.
2. Extracting Summary Text Between

and

Tags (Using Your HTML Example)

Your HTML snippet has a <p> tag with the summary content you want. Here's exactly how to pull that text:

from bs4 import BeautifulSoup

# Your provided HTML content
html = """
<div> 
  <div class="o-media__body"> 
    <span class="text-bolder text-larger">Summary</span> 
  </div> 
  <div> 
    <p>Hello, I m from Europe Macedonia, I came to USA 12 years ago, i got my citizenship 7 years ago, I m very happy person, i like to help people, I don't like to change jobs.In my life I worked only 3 jobs, First job I worked as a Nurse in Macedonia ...</p>
  </div>
</div>
"""

# Parse the HTML
soup = BeautifulSoup(html, 'html.parser')

# Locate the <p> tag and extract its text
# Use find() for the first matching <p>, or find_all() if there are multiple
summary_text = soup.find('p').get_text(strip=True)

# Print or use the extracted text
print(summary_text)

What's happening here?

  • soup.find('p') finds the first <p> tag in your HTML. If there were multiple <p> tags and you wanted all of them, swap this with soup.find_all('p') and loop through the results.
  • .get_text(strip=True) removes leading/trailing whitespace and collapses any extra spaces inside the text, giving you clean, readable content.

If you need to target a <p> tag that's nested inside a specific parent (like the second <div> in your example), you can narrow it down further:

# Target the <p> inside the second div
summary_text = soup.find('div', recursive=False).find_next('div').find('p').get_text(strip=True)

Content sourced from Stack Exchange, question author Rakesh moorthy

火山引擎 最新活动