You need to enable JavaScript to run this app.
优惠活动
大模型
产品
解决方案
定价
更多
文档控制台
免费开始使用

如何用Python Beautiful Soup仅提取HTML中的第一个Href链接

提取HTML片段中的第一个Href

你当前的代码会遍历所有符合正则条件的<a>标签并输出它们的href,要只获取第一个匹配的链接,有两种简单的修改方式:

方法一:使用find()代替findAll()

find()方法会直接返回匹配到的第一个元素,而非返回一个列表,这样就不需要循环遍历:

# import the module
import bs4 as bs
import urllib.request
import re
import PyPDF2
import pypyodbc
from time import sleep
html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
soup = bs.BeautifulSoup(html,'lxml')
# 用find()获取第一个匹配的a标签
first_link = soup.find('a', attrs={'href': re.compile("^http://")})
if first_link:  # 先检查是否找到元素,避免报错
    temp = first_link.get('href')
    print(temp)

方法二:从findAll()的结果中取第一个元素

如果你还是想用findAll(),可以直接取返回列表的第一个元素(记得先判断列表不为空,防止索引越界):

# import the module
import bs4 as bs
import urllib.request
import re
import PyPDF2
import pypyodbc
from time import sleep
html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
soup = bs.BeautifulSoup(html,'lxml')
links = soup.findAll('a', attrs={'href': re.compile("^http://")})
if links:  # 确保列表有内容
    temp = links[0].get('href')
    print(temp)

补充说明

  • 两种方法都添加了if判断,是为了防止HTML中没有符合条件的<a>标签时出现AttributeError或者索引越界错误,让代码更健壮。
  • 如果你确定HTML里一定存在符合条件的链接,也可以去掉if判断,但实际爬取网页时很容易遇到没有匹配元素的情况,所以保留判断会更稳妥。

内容的提问来源于stack exchange,提问作者Ian-Fogelman

火山引擎 最新活动