Scrapy爬取App Store评论页问题:无法获取全部用户评分
Ah, I see exactly what's going on here—you're hitting App Store's lazy-loaded review system! When you first load the page, only the top 3 reviews are included in the static HTML. The rest load dynamically via AJAX as you scroll down, which is why your static requests.get() call only pulls in those initial 3 ratings.
Here are two reliable solutions to get every review's rating:
1. Simulate Browser Scrolling with Selenium
This method mimics human behavior by scrolling the page to load all reviews, then extracts the data once everything's loaded. It's straightforward and doesn't require digging into API details.
First, install Selenium (and make sure you have a ChromeDriver matching your browser version):
pip install selenium
Then use this code:
from selenium import webdriver from selenium.webdriver.common.by import By import time url = "https://apps.apple.com/us/app/mathy-cool-math-learner-games/id1476596747#see-all/reviews" # Initialize Chrome driver driver = webdriver.Chrome() driver.get(url) # Scroll to load all reviews last_scroll_height = driver.execute_script("return document.body.scrollHeight") while True: # Scroll to the bottom of the page driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait for new reviews to load (adjust sleep time if needed) time.sleep(2) # Check if we've reached the end of the page new_scroll_height = driver.execute_script("return document.body.scrollHeight") if new_scroll_height == last_scroll_height: break last_scroll_height = new_scroll_height # Extract all rating aria-label values rating_elements = driver.find_elements( By.CSS_SELECTOR, "figure.we-star-rating.ember-view.we-customer-review__rating.we-star-rating--large" ) all_ratings = [elem.get_attribute("aria-label") for elem in rating_elements] print(all_ratings) driver.quit()
2. Directly Call App Store's Review API
If you want a faster, more efficient approach, you can bypass the browser entirely by hitting App Store's internal API that serves the reviews.
How to find the API:
- Open your browser's DevTools (F12)
- Go to the Network tab, filter by "XHR"
- Scroll the reviews page—you'll see requests to an endpoint like
https://amp-api.apps.apple.com/v1/catalog/us/apps/1476596747/reviews
Here's a sample code snippet to fetch all reviews via the API (note: you'll need to grab a valid Authorization token from the browser's request headers):
import requests # Replace with a valid token from your browser's network requests HEADERS = { "Authorization": "Bearer YOUR_AUTH_TOKEN", "Accept": "application/json" } BASE_API_URL = "https://amp-api.apps.apple.com/v1/catalog/us/apps/1476596747/reviews" all_ratings = [] offset = 0 limit = 20 while True: params = { "l": "en-US", "offset": offset, "limit": limit } response = requests.get(BASE_API_URL, headers=HEADERS, params=params) data = response.json() reviews = data.get("data", []) if not reviews: break # No more reviews to fetch # Extract rating and format it to match the aria-label text for review in reviews: rating = review["attributes"]["rating"] all_ratings.append(f"{rating} out of 5") offset += limit print(all_ratings)
Quick note on the API method:
- The
Authorizationtoken can expire or change, so you'll need to refresh it periodically by checking the browser's network requests. - API parameters (like
limitor regional codes) might vary, so adjust them based on what you see in DevTools.
Which method should you pick?
- Use Selenium if you want a low-maintenance solution that works without needing to reverse-engineer APIs.
- Use the API method if you need faster scraping and don't mind handling token updates.
内容的提问来源于stack exchange,提问作者Lucas Magalhães




