Python中使用BS4或Newspaper3k爬取meta标签失败求助

阿华AIGC实验室

2026-5-7

解决BeautifulSoup无法抓取Meta标签的问题

我太懂你这种抓耳挠腮的感觉了——翻遍资料、试遍方法，换了解析库、甚至用上了Newspaper3k，可就是抓不到明明存在的meta标签，每次输出都是"No ... given"，实在太挫败了！

先理下你的情况：你要抓取的meta标签都是正确闭合的，也确认过它们肯定存在，但不管用循环还是不同的查找写法，结果都不对。

你的目标HTML片段：

<meta name="description" content="Here is an exclusive we just got in regarding toda...." />
<meta property="og:description" content="Here is an exclusive we just got in regarding toda...." />
<meta property="article:section" content="Breaking News" />

你的原代码：

# Import requisite libraries
from bs4 import BeautifulSoup
# Start it up (and note I have also tried lxml and html.parser)
soup = BeautifulSoup(corpus, 'html5lib')
# corpus is holding data from Newspaper3k. This aspect works.
# Following is just me trying different ways to find the same 2 things:
# Retrieve description AKA summary
description = soup.find("meta", property="og:description") # 1st way
summary = soup.find("meta", attrs={'name': "description"}) # 2nd way
# Retrieve category AKA section
category = soup.find("meta", property='article:section') # 1st way
section = soup.find("meta", attrs={'article': "section"}) # 2nd way
# Test and return result
print(description["content"] if description else "No description given")
print(summary["content"] if summary else "No summary given")
print(category["content"] if category else "No category given")
print(section["content"] if section else "No section given")

问题根源分析

我猜最大的坑出在**corpus的数据来源**上！你说corpus来自Newspaper3k，但要注意：Newspaper3k默认会提取页面的正文文本，如果你直接用它返回的处理后文本（比如article.text）来初始化BeautifulSoup，那里面根本没有头部的meta标签——这些标签本来就不在正文区域里！

另外，你的最后一个查找写法attrs={'article': "section"}是错误的，目标meta标签的属性是property="article:section"，不是article="section"，这也是找不到section的原因之一。

解决方案

1. 改用Newspaper3k的原始HTML

Newspaper3k的Article对象有个html属性，它存储的是页面的完整原始HTML，你应该用这个来初始化BeautifulSoup，而不是处理后的文本。

2. 修正错误的查找写法

把attrs={'article': "section"}改成attrs={'property': "article:section"}，或者直接用property参数查找。

修改后的代码示例

from bs4 import BeautifulSoup
from newspaper import Article

# 假设你是这样获取文章的
target_url = "你的目标网页URL"
article = Article(target_url)
article.download()
article.parse()

# 关键：用原始HTML初始化BeautifulSoup
soup = BeautifulSoup(article.html, 'html5lib')  # 这里替换成article.html，而不是corpus

# 正确查找各个meta标签
description = soup.find("meta", property="og:description")
summary = soup.find("meta", attrs={'name': "description"})
category = soup.find("meta", property='article:section')
section = soup.find("meta", attrs={'property': "article:section"})  # 修正这里的写法

# 输出结果
print(description["content"] if description else "No description given")
print(summary["content"] if summary else "No summary given")
print(category["content"] if category else "No category given")
print(section["content"] if section else "No section given")

额外排查建议

先打印corpus（或者article.html）的内容，确认里面确实包含你要找的meta标签。如果没有，那问题出在HTML获取环节，不是解析的问题。
可以用print(soup.prettify())输出解析后的完整HTML结构，看看meta标签是否被解析器正确识别。如果某个解析器不行，换lxml试试——它对HTML的兼容性更好。

内容的提问来源于stack exchange，提问作者Sage