如何用Python Beautiful Soup仅提取HTML中的第一个Href链接
提取HTML片段中的第一个Href
你当前的代码会遍历所有符合正则条件的<a>标签并输出它们的href,要只获取第一个匹配的链接,有两种简单的修改方式:
方法一:使用find()代替findAll()
find()方法会直接返回匹配到的第一个元素,而非返回一个列表,这样就不需要循环遍历:
# import the module import bs4 as bs import urllib.request import re import PyPDF2 import pypyodbc from time import sleep html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>' soup = bs.BeautifulSoup(html,'lxml') # 用find()获取第一个匹配的a标签 first_link = soup.find('a', attrs={'href': re.compile("^http://")}) if first_link: # 先检查是否找到元素,避免报错 temp = first_link.get('href') print(temp)
方法二:从findAll()的结果中取第一个元素
如果你还是想用findAll(),可以直接取返回列表的第一个元素(记得先判断列表不为空,防止索引越界):
# import the module import bs4 as bs import urllib.request import re import PyPDF2 import pypyodbc from time import sleep html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>' soup = bs.BeautifulSoup(html,'lxml') links = soup.findAll('a', attrs={'href': re.compile("^http://")}) if links: # 确保列表有内容 temp = links[0].get('href') print(temp)
补充说明
- 两种方法都添加了
if判断,是为了防止HTML中没有符合条件的<a>标签时出现AttributeError或者索引越界错误,让代码更健壮。 - 如果你确定HTML里一定存在符合条件的链接,也可以去掉
if判断,但实际爬取网页时很容易遇到没有匹配元素的情况,所以保留判断会更稳妥。
内容的提问来源于stack exchange,提问作者Ian-Fogelman




