You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Python网页爬取出现simplejson.errors.JSONDecodeError及CSV写入异常的解决求助

Fixing JSONDecodeError & Empty 2020_data_2.csv in Your Web Scraping Code

Let's break down what's causing the JSONDecodeError and why your 2020_data_2.csv is empty, then fix each issue step by step.

Key Issues Identified

  1. Invalid Request URL: The post URL has an extra space after the ? (? service=...), which makes the request invalid—this is the primary cause of the JSON decode error, as the server returns an error page instead of valid JSON.
  2. Missing Request Headers: Many websites block requests without proper User-Agent headers, which can also lead to non-JSON responses.
  3. Misplaced Exception Handling: Your try/except block only runs after calling .json(), so the error occurs before the exception can catch it.
  4. Flawed Data Collection Logic: Checking if r.keys() not in names doesn't work as expected (since dict_keys objects aren't directly comparable to list items), leading to missing or duplicate headers.
  5. Unchecked Regex Matches: If the regex in Table() fails to find a match, it will throw an AttributeError when accessing match.group().

Step-by-Step Fixes

1. Fix the Request URL

Remove the extra space in the post URL to ensure the server recognizes the service parameter correctly.

2. Add Proper Request Headers

Add a User-Agent header to mimic a browser request, which helps avoid being blocked by the server.

3. Move Exception Handling to Cover JSON Parsing

Wrap the .json() call in the try block to catch decode errors immediately, and skip faulty entries instead of crashing the script.

4. Simplify Data Collection for Headers and Values

Initialize headers only once when you get the first valid response, then append all subsequent data rows—this ensures a clean, single header row followed by all matching data.

5. Add Regex Match Check

Add a guard clause to skip items where the regex doesn't find a valid match, preventing crashes.

Full Corrected Code

import pandas as pd
import csv
import re
import json
import requests

def Table():
    table = pd.read_json("https://www.nmc.org.in/MCIRest/open/getPaginatedData?service=getPaginatedDoctor&draw=1&columns%5B0%5D%5Bdata%5D=0&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=1&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=2&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=3&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=4&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=5&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=6&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=0&order%5B0%5D%5Bdir%5D=asc&start=20000&length=8751&search%5Bvalue%5D=&search%5Bregex%5D=false&name=&registrationNo=&smcId=&year=2020&_=1611587198138")['data']
    with open('C:\\Users\\SmartDB\\Desktop\\2020_out_2.csv', 'w', newline="") as f:
        writer = csv.writer(f)
        writer.writerow( ['Year Of The Info', 'Registration#', 'State Medical Councils', 'Name', 'FatherName'])
        data = []
        for item in table:
            writer.writerow(item[1:6])
            required = item[6]
            match = re.search(r"openDoctorDetailsnew\('([^']*)', '([^']*)'", required)
            if match:  # Skip if regex doesn't find a valid match
                data.append(match.group().split("'")[1:4:2])
    print("Data Saved Into 2020_out_2.csv")
    return data

def Details():
    # Add browser-like headers to avoid being blocked
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
    }
    names = []
    items = []
    data = Table()
    for doc, val in data:
        print(f"Extracting DoctorID# {doc}, RegValue# {val}")
        payload = {'doctorId': doc, 'regdNoValue': val}
        try:
            r = requests.post(
                "https://www.nmc.org.in/MCIRest/open/getDataFromService?service=getDoctorDetailsByIdImr",
                json=payload,
                headers=headers
            )
            r_json = r.json()
            # Initialize headers only once with the first valid response
            if not names:
                names.append(list(r_json.keys()))
            items.append(list(r_json.values()))
        except json.JSONDecodeError:
            print(f"Failed to decode JSON for DoctorID# {doc} - skipping entry")
            continue
        except Exception as e:
            print(f"Unexpected error for DoctorID# {doc}: {str(e)} - skipping entry")
            continue
    print("Done extracting details")
    return names, items

def Save():
    with open('C:\\Users\\SmartDB\\Desktop\\2020_data_2.csv','w', newline="") as d:
        writer = csv.writer(d)
        n, i = Details()
        # Only write if we have valid data to avoid empty files
        if n and i:
            writer.writerows(n)
            writer.writerows(i)
    print("Data saved to 2020_data_2.csv")

Save()

Additional Notes

  • Renamed the json variable to payload to avoid shadowing the imported json module.
  • Added checks to skip invalid entries instead of crashing the entire script.
  • Added a guard in Save() to ensure we don't write empty content if no valid data was collected.

内容的提问来源于stack exchange,提问作者Nehal Bendale

火山引擎 最新活动