You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

使用Python Requests库爬取网站时遭遇429 Client Error问题求助

Fixing 429 Error When Scraping Groww Mutual Fund Data

Hey there! That 429 error is a classic anti-scraping flag—even with a session and basic headers, Groww's systems are picking up your request as non-human. Let's walk through actionable fixes to get past this:

Key Issues & Solutions

1. Add Random Delays Between Requests

Even two back-to-back requests can trigger rate limits. Adding a small random pause mimics how a real user navigates:

import time
import random

# After fetching the BASE_URL
time.sleep(random.uniform(1.5, 3.5))  # Random delay between 1.5-3.5 seconds

2. Rotate User-Agents

Sticking to one static user-agent is a red flag. Use a list of real, up-to-date user-agents and pick one randomly each run:

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
]

# Update your HEADERS
HEADERS = {
    'user-agent': random.choice(USER_AGENTS),
    # ... keep other existing headers
}

3. Let Requests Session Manage Cookies Automatically

Your current code manually extracts and passes cookies, but requests.Session() already persists cookies across requests. This manual step might miss critical cookies or format them incorrectly—remove it:

# Remove these lines entirely:
# cookies = dict(request.cookies)
# ... and the `cookies=cookies` parameter in the second GET call

# Just use the session directly for the second request:
response = session.get(url=LISTING_URL, headers=HEADERS, params=PARAMS, timeout=20)

4. Add More Browser-Like Headers

Your headers are missing key fields that real browsers send. Update them to include these:

HEADERS = {
    'user-agent': random.choice(USER_AGENTS),
    'accept-language': 'en,gu;q=0.9,hi;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Referer': BASE_URL,  # Tells the server you came from the main MF page
    'DNT': '1',  # Common "Do Not Track" header from real browsers
    'Connection': 'keep-alive'
}

5. If All Else Fails: Mimic a Real Browser

Groww might set critical cookies or tokens via JavaScript, which requests (a non-JS client) can't handle. Try using requests-html (a JS-enabled requests wrapper) to render the page like a browser:

from requests_html import HTMLSession

# Replace requests.Session() with HTMLSession()
session = HTMLSession()
request = session.get(BASE_URL, headers=HEADERS, timeout=20)
request.html.render()  # Executes JS to set necessary cookies/tokens
time.sleep(random.uniform(1,3))
response = session.get(LISTING_URL, headers=HEADERS, params=PARAMS, timeout=20)

Modified Full Code

Here's how your code looks with all the fixes applied:

import requests
import json
import time
import random

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
]

if __name__ == '__main__':
    BASE_URL = "https://groww.in/mutual-funds"
    LISTING_URL = "https://groww.in/slr/v1/search/derived/scheme"
    HEADERS = {
        'user-agent': random.choice(USER_AGENTS),
        'accept-language': 'en,gu;q=0.9,hi;q=0.8',
        'accept-encoding': 'gzip, deflate, br',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Referer': BASE_URL,
        'DNT': '1',
        'Connection': 'keep-alive'
    }
    PARAMS = {
        'available_for_investment': 'true',
        'doc_type': 'scheme',
        'page': 0,
        'plan_type': 'Direct',
        'size': 16,
        'sort_by': 0
    }
    try:
        session = requests.Session()
        print('FETCHING & SETTING COOKIES...')
        request = session.get(BASE_URL, headers=HEADERS, timeout=20)
        # Add random delay to mimic human behavior
        time.sleep(random.uniform(1.5, 3.5))
        # Let session handle cookies automatically
        response = session.get(url=LISTING_URL, headers=HEADERS, params=PARAMS, timeout=20)
        response.raise_for_status()
    except requests.exceptions.HTTPError as err:
        raise SystemExit(err)
    dajs = json.loads(response.text)
    print("Success! Data fetched successfully.")

A quick note: Always make sure you're complying with Groww's Terms of Service when scraping their data. Some sites prohibit automated scraping, so double-check that first.

内容的提问来源于stack exchange,提问作者BeingSuman

火山引擎 最新活动