如何用Python Beautiful Soup仅提取HTML中的第一个Href链接

免费开始使用

如何用Python Beautiful Soup仅提取HTML中的第一个Href链接

阿华AIGC实验室

2026-5-28

提取HTML片段中的第一个Href

你当前的代码会遍历所有符合正则条件的<a>标签并输出它们的href，要只获取第一个匹配的链接，有两种简单的修改方式：

方法一：使用`find()`代替`findAll()`

find()方法会直接返回匹配到的第一个元素，而非返回一个列表，这样就不需要循环遍历：

# import the module
import bs4 as bs
import urllib.request
import re
import PyPDF2
import pypyodbc
from time import sleep
html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
soup = bs.BeautifulSoup(html,'lxml')
# 用find()获取第一个匹配的a标签
first_link = soup.find('a', attrs={'href': re.compile("^http://")})
if first_link:  # 先检查是否找到元素，避免报错
    temp = first_link.get('href')
    print(temp)

方法二：从`findAll()`的结果中取第一个元素

如果你还是想用findAll()，可以直接取返回列表的第一个元素（记得先判断列表不为空，防止索引越界）：

# import the module
import bs4 as bs
import urllib.request
import re
import PyPDF2
import pypyodbc
from time import sleep
html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
soup = bs.BeautifulSoup(html,'lxml')
links = soup.findAll('a', attrs={'href': re.compile("^http://")})
if links:  # 确保列表有内容
    temp = links[0].get('href')
    print(temp)

补充说明

两种方法都添加了if判断，是为了防止HTML中没有符合条件的<a>标签时出现AttributeError或者索引越界错误，让代码更健壮。
如果你确定HTML里一定存在符合条件的链接，也可以去掉if判断，但实际爬取网页时很容易遇到没有匹配元素的情况，所以保留判断会更稳妥。

内容的提问来源于stack exchange，提问作者Ian-Fogelman

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，最新支持 DeepSeek-V4 系列与 GLM-5.1，受邀下单叠加9.5折

ArkClaw

7×24在线专属智能伙伴

Seedance 2.0 全面开放 API

创作无限可能，一键生成电影级 AI 视频

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠