使用Python Requests库爬取网站时遭遇429 Client Error问题求助
Hey there! That 429 error is a classic anti-scraping flag—even with a session and basic headers, Groww's systems are picking up your request as non-human. Let's walk through actionable fixes to get past this:
Key Issues & Solutions
1. Add Random Delays Between Requests
Even two back-to-back requests can trigger rate limits. Adding a small random pause mimics how a real user navigates:
import time import random # After fetching the BASE_URL time.sleep(random.uniform(1.5, 3.5)) # Random delay between 1.5-3.5 seconds
2. Rotate User-Agents
Sticking to one static user-agent is a red flag. Use a list of real, up-to-date user-agents and pick one randomly each run:
USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36' ] # Update your HEADERS HEADERS = { 'user-agent': random.choice(USER_AGENTS), # ... keep other existing headers }
3. Let Requests Session Manage Cookies Automatically
Your current code manually extracts and passes cookies, but requests.Session() already persists cookies across requests. This manual step might miss critical cookies or format them incorrectly—remove it:
# Remove these lines entirely: # cookies = dict(request.cookies) # ... and the `cookies=cookies` parameter in the second GET call # Just use the session directly for the second request: response = session.get(url=LISTING_URL, headers=HEADERS, params=PARAMS, timeout=20)
4. Add More Browser-Like Headers
Your headers are missing key fields that real browsers send. Update them to include these:
HEADERS = { 'user-agent': random.choice(USER_AGENTS), 'accept-language': 'en,gu;q=0.9,hi;q=0.8', 'accept-encoding': 'gzip, deflate, br', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Referer': BASE_URL, # Tells the server you came from the main MF page 'DNT': '1', # Common "Do Not Track" header from real browsers 'Connection': 'keep-alive' }
5. If All Else Fails: Mimic a Real Browser
Groww might set critical cookies or tokens via JavaScript, which requests (a non-JS client) can't handle. Try using requests-html (a JS-enabled requests wrapper) to render the page like a browser:
from requests_html import HTMLSession # Replace requests.Session() with HTMLSession() session = HTMLSession() request = session.get(BASE_URL, headers=HEADERS, timeout=20) request.html.render() # Executes JS to set necessary cookies/tokens time.sleep(random.uniform(1,3)) response = session.get(LISTING_URL, headers=HEADERS, params=PARAMS, timeout=20)
Modified Full Code
Here's how your code looks with all the fixes applied:
import requests import json import time import random USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36' ] if __name__ == '__main__': BASE_URL = "https://groww.in/mutual-funds" LISTING_URL = "https://groww.in/slr/v1/search/derived/scheme" HEADERS = { 'user-agent': random.choice(USER_AGENTS), 'accept-language': 'en,gu;q=0.9,hi;q=0.8', 'accept-encoding': 'gzip, deflate, br', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Referer': BASE_URL, 'DNT': '1', 'Connection': 'keep-alive' } PARAMS = { 'available_for_investment': 'true', 'doc_type': 'scheme', 'page': 0, 'plan_type': 'Direct', 'size': 16, 'sort_by': 0 } try: session = requests.Session() print('FETCHING & SETTING COOKIES...') request = session.get(BASE_URL, headers=HEADERS, timeout=20) # Add random delay to mimic human behavior time.sleep(random.uniform(1.5, 3.5)) # Let session handle cookies automatically response = session.get(url=LISTING_URL, headers=HEADERS, params=PARAMS, timeout=20) response.raise_for_status() except requests.exceptions.HTTPError as err: raise SystemExit(err) dajs = json.loads(response.text) print("Success! Data fetched successfully.")
A quick note: Always make sure you're complying with Groww's Terms of Service when scraping their data. Some sites prohibit automated scraping, so double-check that first.
内容的提问来源于stack exchange,提问作者BeingSuman




