Python爬取CGI生成数据遇403错误求助

阿华AIGC实验室

2026-5-27

Troubleshooting 403 Error When Scraping a CGI-Generated Page in Python

Hey there, let's work through this 403 issue you're hitting with that CGI endpoint. I've dealt with similar scraping roadblocks before, so here are some practical steps to get past this:

Double-check the request method
The target URL has query parameters (a=0&b=1), which makes me wonder if the CGI script expects a GET request instead of POST. Many older CGI endpoints are designed to accept parameters via the URL query string, not the request body. Try swapping to a GET request first, like this:

import requests

# Mimic browser headers (copy these directly from your browser's dev tools!)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
    'Referer': 'https://www.xxxxx.com/',  # Critical: sites often block requests missing this
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}

cookies = {
    # Paste your valid session cookies here
}

url = "https://www.xxxxx.com/cgi-bin/gameinfo_fcgi"
params = {'a': 0, 'b': 1}  # Move query params to the 'params' argument

response = requests.get(url, headers=headers, cookies=cookies, params=params)
print(f"Status Code: {response.status_code}")
print(response.text[:500])  # Print first 500 chars to check content

Ensure your headers are fully browser-compliant
403 errors often stem from missing or incorrect request headers. Open your browser's DevTools (F12), go to the Network tab, find the exact CGI request your browser makes, and copy all headers into your code. Pay special attention to Origin, Accept-Language, and Accept-Encoding—these are easy to overlook but frequently checked by anti-scraping systems.

Check for CSRF tokens
Some CGI scripts require a CSRF token to validate requests. If the page that triggers the CGI call has a hidden input field (e.g., <input name="csrf_token" value="xxx">), you'll need to scrape that token first and include it in your POST data. Here's how to do that with requests.Session (which also helps manage cookies automatically):

from bs4 import BeautifulSoup
import requests

session = requests.Session()
session.headers.update(headers)  # Reuse the headers from above

# First, visit the page that loads the CGI endpoint
page_response = session.get("https://www.xxxxx.com/page-with-game-info")
soup = BeautifulSoup(page_response.text, 'html.parser')
csrf_token = soup.find('input', {'name': 'csrf_token'})['value']

# Now send the POST request with the token
post_data = {
    'a': 0,
    'b': 1,
    'csrf_token': csrf_token
}
response = session.post(url, data=post_data)

Validate your cookies are fresh and valid
Cookies can expire quickly, especially if they're tied to a user session. Make sure you're using cookies copied right after logging into the site (if login is required). Using requests.Session will automatically persist cookies across requests, which is more reliable than manually copying them.
Watch for anti-scraping triggers
If none of the above works, the site might be blocking your requests for other reasons:
- Request frequency: Add a small delay between requests with time.sleep(2) to avoid triggering rate limits.
- IP blocking: Try using a proxy if you've made too many requests from your current IP.
- Session order: Some sites require you to navigate through pages in a specific order (e.g., visit the homepage first, then the game info page) before accessing the CGI endpoint. Using requests.Session helps maintain this context.

The most important tip? Copy the exact request your browser makes—match the method, headers, parameters, and cookies perfectly. Tools like Postman can help you replicate the request before translating it into Python code.

内容的提问来源于stack exchange，提问作者rickyi