Python中使用BS4或Newspaper3k爬取meta标签失败求助
解决BeautifulSoup无法抓取Meta标签的问题
我太懂你这种抓耳挠腮的感觉了——翻遍资料、试遍方法,换了解析库、甚至用上了Newspaper3k,可就是抓不到明明存在的meta标签,每次输出都是"No ... given",实在太挫败了!
先理下你的情况:你要抓取的meta标签都是正确闭合的,也确认过它们肯定存在,但不管用循环还是不同的查找写法,结果都不对。
你的目标HTML片段:
<meta name="description" content="Here is an exclusive we just got in regarding toda...." /> <meta property="og:description" content="Here is an exclusive we just got in regarding toda...." /> <meta property="article:section" content="Breaking News" />
你的原代码:
# Import requisite libraries from bs4 import BeautifulSoup # Start it up (and note I have also tried lxml and html.parser) soup = BeautifulSoup(corpus, 'html5lib') # corpus is holding data from Newspaper3k. This aspect works. # Following is just me trying different ways to find the same 2 things: # Retrieve description AKA summary description = soup.find("meta", property="og:description") # 1st way summary = soup.find("meta", attrs={'name': "description"}) # 2nd way # Retrieve category AKA section category = soup.find("meta", property='article:section') # 1st way section = soup.find("meta", attrs={'article': "section"}) # 2nd way # Test and return result print(description["content"] if description else "No description given") print(summary["content"] if summary else "No summary given") print(category["content"] if category else "No category given") print(section["content"] if section else "No section given")
问题根源分析
我猜最大的坑出在**corpus的数据来源**上!你说corpus来自Newspaper3k,但要注意:Newspaper3k默认会提取页面的正文文本,如果你直接用它返回的处理后文本(比如article.text)来初始化BeautifulSoup,那里面根本没有头部的meta标签——这些标签本来就不在正文区域里!
另外,你的最后一个查找写法attrs={'article': "section"}是错误的,目标meta标签的属性是property="article:section",不是article="section",这也是找不到section的原因之一。
解决方案
1. 改用Newspaper3k的原始HTML
Newspaper3k的Article对象有个html属性,它存储的是页面的完整原始HTML,你应该用这个来初始化BeautifulSoup,而不是处理后的文本。
2. 修正错误的查找写法
把attrs={'article': "section"}改成attrs={'property': "article:section"},或者直接用property参数查找。
修改后的代码示例
from bs4 import BeautifulSoup from newspaper import Article # 假设你是这样获取文章的 target_url = "你的目标网页URL" article = Article(target_url) article.download() article.parse() # 关键:用原始HTML初始化BeautifulSoup soup = BeautifulSoup(article.html, 'html5lib') # 这里替换成article.html,而不是corpus # 正确查找各个meta标签 description = soup.find("meta", property="og:description") summary = soup.find("meta", attrs={'name': "description"}) category = soup.find("meta", property='article:section') section = soup.find("meta", attrs={'property': "article:section"}) # 修正这里的写法 # 输出结果 print(description["content"] if description else "No description given") print(summary["content"] if summary else "No summary given") print(category["content"] if category else "No category given") print(section["content"] if section else "No section given")
额外排查建议
- 先打印
corpus(或者article.html)的内容,确认里面确实包含你要找的meta标签。如果没有,那问题出在HTML获取环节,不是解析的问题。 - 可以用
print(soup.prettify())输出解析后的完整HTML结构,看看meta标签是否被解析器正确识别。如果某个解析器不行,换lxml试试——它对HTML的兼容性更好。
内容的提问来源于stack exchange,提问作者Sage




