Python网页爬取出现simplejson.errors.JSONDecodeError及CSV写入异常的解决求助
Let's break down what's causing the JSONDecodeError and why your 2020_data_2.csv is empty, then fix each issue step by step.
Key Issues Identified
- Invalid Request URL: The post URL has an extra space after the
?(? service=...), which makes the request invalid—this is the primary cause of the JSON decode error, as the server returns an error page instead of valid JSON. - Missing Request Headers: Many websites block requests without proper
User-Agentheaders, which can also lead to non-JSON responses. - Misplaced Exception Handling: Your try/except block only runs after calling
.json(), so the error occurs before the exception can catch it. - Flawed Data Collection Logic: Checking
if r.keys() not in namesdoesn't work as expected (sincedict_keysobjects aren't directly comparable to list items), leading to missing or duplicate headers. - Unchecked Regex Matches: If the regex in
Table()fails to find a match, it will throw anAttributeErrorwhen accessingmatch.group().
Step-by-Step Fixes
1. Fix the Request URL
Remove the extra space in the post URL to ensure the server recognizes the service parameter correctly.
2. Add Proper Request Headers
Add a User-Agent header to mimic a browser request, which helps avoid being blocked by the server.
3. Move Exception Handling to Cover JSON Parsing
Wrap the .json() call in the try block to catch decode errors immediately, and skip faulty entries instead of crashing the script.
4. Simplify Data Collection for Headers and Values
Initialize headers only once when you get the first valid response, then append all subsequent data rows—this ensures a clean, single header row followed by all matching data.
5. Add Regex Match Check
Add a guard clause to skip items where the regex doesn't find a valid match, preventing crashes.
Full Corrected Code
import pandas as pd import csv import re import json import requests def Table(): table = pd.read_json("https://www.nmc.org.in/MCIRest/open/getPaginatedData?service=getPaginatedDoctor&draw=1&columns%5B0%5D%5Bdata%5D=0&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=1&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=2&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=3&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=4&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=5&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=6&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=0&order%5B0%5D%5Bdir%5D=asc&start=20000&length=8751&search%5Bvalue%5D=&search%5Bregex%5D=false&name=®istrationNo=&smcId=&year=2020&_=1611587198138")['data'] with open('C:\\Users\\SmartDB\\Desktop\\2020_out_2.csv', 'w', newline="") as f: writer = csv.writer(f) writer.writerow( ['Year Of The Info', 'Registration#', 'State Medical Councils', 'Name', 'FatherName']) data = [] for item in table: writer.writerow(item[1:6]) required = item[6] match = re.search(r"openDoctorDetailsnew\('([^']*)', '([^']*)'", required) if match: # Skip if regex doesn't find a valid match data.append(match.group().split("'")[1:4:2]) print("Data Saved Into 2020_out_2.csv") return data def Details(): # Add browser-like headers to avoid being blocked headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36" } names = [] items = [] data = Table() for doc, val in data: print(f"Extracting DoctorID# {doc}, RegValue# {val}") payload = {'doctorId': doc, 'regdNoValue': val} try: r = requests.post( "https://www.nmc.org.in/MCIRest/open/getDataFromService?service=getDoctorDetailsByIdImr", json=payload, headers=headers ) r_json = r.json() # Initialize headers only once with the first valid response if not names: names.append(list(r_json.keys())) items.append(list(r_json.values())) except json.JSONDecodeError: print(f"Failed to decode JSON for DoctorID# {doc} - skipping entry") continue except Exception as e: print(f"Unexpected error for DoctorID# {doc}: {str(e)} - skipping entry") continue print("Done extracting details") return names, items def Save(): with open('C:\\Users\\SmartDB\\Desktop\\2020_data_2.csv','w', newline="") as d: writer = csv.writer(d) n, i = Details() # Only write if we have valid data to avoid empty files if n and i: writer.writerows(n) writer.writerows(i) print("Data saved to 2020_data_2.csv") Save()
Additional Notes
- Renamed the
jsonvariable topayloadto avoid shadowing the importedjsonmodule. - Added checks to skip invalid entries instead of crashing the entire script.
- Added a guard in
Save()to ensure we don't write empty content if no valid data was collected.
内容的提问来源于stack exchange,提问作者Nehal Bendale




