如何用Python Beautiful Soup提取文本及HTML中<p>标签内的摘要?
Hey there! Let's tackle your two Beautiful Soup questions with clear, practical examples—right up your alley based on the HTML snippet you shared.
First, make sure you've got the library installed: run pip install beautifulsoup4 (you'll also need requests if you're pulling HTML from a live website, but we'll focus on local/already fetched HTML here).
The core idea is to parse the HTML into a Beautiful Soup object, then target the elements you want using tag names, classes, IDs, or other attributes. Here are common scenarios:
- Extract text from a specific single element: Use
soup.find()to locate the element, then.get_text()to pull its content. For example, to get the "Summary" text from your HTML:summary_label = soup.find('span', class_='text-bolder text-larger').get_text(strip=True) - Extract text from multiple elements: Use
soup.find_all()to grab all matching elements, then loop through them to collect text:all_paragraphs = soup.find_all('p') for p in all_paragraphs: print(p.get_text(strip=True)) - Extract all text from a parent element: If you want all text inside a container (like a
<div>), call.get_text()directly on that parent element—this will combine text from all its child tags.
Your HTML snippet has a <p> tag with the summary content you want. Here's exactly how to pull that text:
from bs4 import BeautifulSoup # Your provided HTML content html = """ <div> <div class="o-media__body"> <span class="text-bolder text-larger">Summary</span> </div> <div> <p>Hello, I m from Europe Macedonia, I came to USA 12 years ago, i got my citizenship 7 years ago, I m very happy person, i like to help people, I don't like to change jobs.In my life I worked only 3 jobs, First job I worked as a Nurse in Macedonia ...</p> </div> </div> """ # Parse the HTML soup = BeautifulSoup(html, 'html.parser') # Locate the <p> tag and extract its text # Use find() for the first matching <p>, or find_all() if there are multiple summary_text = soup.find('p').get_text(strip=True) # Print or use the extracted text print(summary_text)
What's happening here?
soup.find('p')finds the first<p>tag in your HTML. If there were multiple<p>tags and you wanted all of them, swap this withsoup.find_all('p')and loop through the results..get_text(strip=True)removes leading/trailing whitespace and collapses any extra spaces inside the text, giving you clean, readable content.
If you need to target a <p> tag that's nested inside a specific parent (like the second <div> in your example), you can narrow it down further:
# Target the <p> inside the second div summary_text = soup.find('div', recursive=False).find_next('div').find('p').get_text(strip=True)
Content sourced from Stack Exchange, question author Rakesh moorthy




