You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

通过For循环修改URL遍历网页,爬取TripAdvisor指定酒店数据

Hey fellow dev! Looks like you're trying to scrape reviews for the Hawthorn Suites by Wyndham Wichita East on TripAdvisor—let's walk through how to handle the pagination and get that data you need.

核心思路:分页URL构造

First, let's break down the pagination pattern for this hotel's review pages:

  • The base review URL is:
    https://www.tripadvisor.com/Hotel_Review-g39143-d92240-Reviews-Hawthorn_Suites_by_Wyndham_Wichita_East-Wichita_Kansas.html
    
  • Pagination is controlled by adding -or{X}- right after the -Reviews segment, where X is a multiple of 5 (since each page shows 5 reviews).
    • Page 1 (default): No -or{X}- segment (equivalent to -or0-)
    • Page 2: -or5-
    • Page 3: -or10-
    • And so on, incrementing X by 5 for each subsequent page.
代码示例:自动生成分页URLs(Python)

Here's a quick snippet to generate URLs for multiple pages—you can adjust the range based on how many pages of reviews exist:

# Base components of the URL
base_review_segment = "https://www.tripadvisor.com/Hotel_Review-g39143-d92240-Reviews"
hotel_identifier = "-Hawthorn_Suites_by_Wyndham_Wichita_East-Wichita_Kansas.html"

# Generate URLs for the first 6 pages (adjust the end value as needed)
for offset in range(0, 30, 5):
    if offset == 0:
        # First page uses the base URL without the offset
        full_url = f"{base_review_segment}{hotel_identifier}"
    else:
        full_url = f"{base_review_segment}-or{offset}{hotel_identifier}"
    print(f"Page URL: {full_url}")
关键爬取注意事项

Don't forget these important details to avoid getting blocked or missing data:

  • Respect TripAdvisor's anti-scraping measures:
    • Add a valid User-Agent header to mimic a real browser request
    • Add delays between requests (e.g., 2-5 seconds per page) to avoid rate-limiting
    • Consider using proxy IPs if you're scraping a large number of pages
  • Parse reviews effectively: Use libraries like BeautifulSoup or Scrapy to extract key data points (review text, star ratings, reviewer names, stay dates) from each page
  • Stop at the right time: Check each page to see if it contains reviews—once you hit a page with no reviews, you can stop generating new URLs.

内容的提问来源于stack exchange,提问作者CJ090

火山引擎 最新活动