Python爬取CGI生成数据遇403错误求助
Hey there, let's work through this 403 issue you're hitting with that CGI endpoint. I've dealt with similar scraping roadblocks before, so here are some practical steps to get past this:
Double-check the request method
The target URL has query parameters (a=0&b=1), which makes me wonder if the CGI script expects a GET request instead of POST. Many older CGI endpoints are designed to accept parameters via the URL query string, not the request body. Try swapping to a GET request first, like this:import requests # Mimic browser headers (copy these directly from your browser's dev tools!) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36', 'Referer': 'https://www.xxxxx.com/', # Critical: sites often block requests missing this 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' } cookies = { # Paste your valid session cookies here } url = "https://www.xxxxx.com/cgi-bin/gameinfo_fcgi" params = {'a': 0, 'b': 1} # Move query params to the 'params' argument response = requests.get(url, headers=headers, cookies=cookies, params=params) print(f"Status Code: {response.status_code}") print(response.text[:500]) # Print first 500 chars to check contentEnsure your headers are fully browser-compliant
403 errors often stem from missing or incorrect request headers. Open your browser's DevTools (F12), go to the Network tab, find the exact CGI request your browser makes, and copy all headers into your code. Pay special attention toOrigin,Accept-Language, andAccept-Encoding—these are easy to overlook but frequently checked by anti-scraping systems.Check for CSRF tokens
Some CGI scripts require a CSRF token to validate requests. If the page that triggers the CGI call has a hidden input field (e.g.,<input name="csrf_token" value="xxx">), you'll need to scrape that token first and include it in your POST data. Here's how to do that withrequests.Session(which also helps manage cookies automatically):from bs4 import BeautifulSoup import requests session = requests.Session() session.headers.update(headers) # Reuse the headers from above # First, visit the page that loads the CGI endpoint page_response = session.get("https://www.xxxxx.com/page-with-game-info") soup = BeautifulSoup(page_response.text, 'html.parser') csrf_token = soup.find('input', {'name': 'csrf_token'})['value'] # Now send the POST request with the token post_data = { 'a': 0, 'b': 1, 'csrf_token': csrf_token } response = session.post(url, data=post_data)Validate your cookies are fresh and valid
Cookies can expire quickly, especially if they're tied to a user session. Make sure you're using cookies copied right after logging into the site (if login is required). Usingrequests.Sessionwill automatically persist cookies across requests, which is more reliable than manually copying them.Watch for anti-scraping triggers
If none of the above works, the site might be blocking your requests for other reasons:- Request frequency: Add a small delay between requests with
time.sleep(2)to avoid triggering rate limits. - IP blocking: Try using a proxy if you've made too many requests from your current IP.
- Session order: Some sites require you to navigate through pages in a specific order (e.g., visit the homepage first, then the game info page) before accessing the CGI endpoint. Using
requests.Sessionhelps maintain this context.
- Request frequency: Add a small delay between requests with
The most important tip? Copy the exact request your browser makes—match the method, headers, parameters, and cookies perfectly. Tools like Postman can help you replicate the request before translating it into Python code.
内容的提问来源于stack exchange,提问作者rickyi




