Python HTML Parser无法获取RSS Feed链接与发布日期求助

Python HTML Parser无法获取RSS Feed链接与发布日期求助

阿华AIGC实验室

2026-5-29

问题：无法从谷歌新闻RSS获取文章链接与发布日期

我正在尝试解析谷歌新闻的RSS源，已经成功获取了标题、发布者等字段，但无法正确获取文章的实际链接和发布日期。以下是我的代码：

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
import re
#import xml.etree.ElementTree as ET
rss_url="https://news.google.com/news/rss/search/section/q/australia/australia?hl=en-AU&gl=AU&ned=au"
Client=urlopen(rss_url)
xml_page=Client.read()
Client.close()
soup_page=soup(xml_page,"html.parser")
#soup_page=ET.parse(xml_page)
news_list=soup_page.findAll("item")
# Print news title, url and publish date
for news in news_list:
    #text=news.text
    title=news.title.text
    link=news.link.text
    pubdate=news.pubDate.text
    description=news.description.text
    publisher = re.findall('<font color="#6f6f6f">(.*?)</font>', description)
    article_link=link
    article_info=[title,publisher,link,pubdate]
    print(article_info)

解决方案

核心问题分析

你的代码主要存在两个关键问题：

使用HTML解析器处理XML格式的RSS，导致对大小写敏感的XML标签（如<pubDate>）识别不准确；
谷歌新闻RSS的<link>标签内容是谷歌的中转跳转链接，并非文章的原始链接，真实链接嵌套在<description>的HTML内容里。

修改后的代码

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
import re

rss_url = "https://news.google.com/news/rss/search/section/q/australia/australia?hl=en-AU&gl=AU&ned=au"
Client = urlopen(rss_url)
xml_page = Client.read()
Client.close()

# 改用XML解析器处理RSS内容，确保标签识别精准
soup_page = soup(xml_page, "xml")
news_list = soup_page.findAll("item")

for news in news_list:
    title = news.title.text
    
    # 安全获取发布日期：增加空值判断，避免标签缺失时报错
    pubdate_tag = news.find("pubDate")
    pubdate = pubdate_tag.text.strip() if pubdate_tag else "无法获取日期"
    
    # 解析description中的HTML，提取文章真实链接
    description = news.description.text
    desc_soup = soup(description, "html.parser")
    article_link = desc_soup.find("a")["href"] if desc_soup.find("a") else "无法获取链接"
    
    # 优化发布者提取的空值处理
    publisher_match = re.findall('<font color="#6f6f6f">(.*?)</font>', description)
    publisher = publisher_match[0].strip() if publisher_match else "无法获取发布者"
    
    article_info = [title, publisher, article_link, pubdate]
    print(article_info)

关键修改说明

解析器切换：将html.parser替换为xml解析器，因为RSS是标准XML格式，XML解析器能精准识别大小写敏感的标签，避免HTML解析器的自动修正导致标签丢失或识别错误。
发布日期处理：使用find("pubDate")精准定位标签，并增加空值判断，防止因部分RSS条目格式异常导致程序崩溃。
真实链接提取：谷歌新闻的<link>是跳转至谷歌新闻页面的中转链接，文章原始链接嵌套在<description>的HTML内容中，因此需要再次用BeautifulSoup解析description的HTML，提取<a>标签的href属性。
健壮性优化：对所有字段都增加了空值判断，确保程序在遇到格式异常的RSS条目时仍能正常运行。

内容的提问来源于stack exchange，提问作者Ilumtics

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，最新支持 DeepSeek-V4 系列与 GLM-5.1，受邀下单叠加9.5折

ArkClaw

7×24在线专属智能伙伴

Seedance 2.0 全面开放 API

创作无限可能，一键生成电影级 AI 视频

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠