如何使用Python将GitHub源中的原始Jupyter Notebook转换为JSON文件及JSON列表

阿华AIGC实验室

2026-4-30

Convert GitHub Jupyter Notebook to JSON (and JSON List) with Python

Got it, let's walk through exactly how to convert a Jupyter Notebook from GitHub into JSON format, plus how to turn specific parts (or the whole structure) into a JSON list using Python. Here's a practical, step-by-step breakdown:

Prerequisites

First, install the required libraries—these handle Notebook parsing and HTTP requests:

pip install nbformat requests

Step 1: Fetch the Raw Notebook from GitHub

GitHub serves Jupyter Notebooks as raw JSON under the hood, but you need the raw file URL (not the web page URL). To get this:

Go to your Notebook's GitHub page
Click the "Raw" button in the top-right of the file view
Copy that URL

Use requests to pull the raw content:

import requests
import nbformat
import json

# Replace with your Notebook's raw GitHub URL
raw_url = "https://raw.githubusercontent.com/your-username/your-repo/main/notebooks/example.ipynb"

try:
    response = requests.get(raw_url)
    response.raise_for_status()  # Raise error if request fails
    notebook_raw = response.text
except requests.exceptions.RequestException as e:
    print(f"Error fetching Notebook: {e}")

Step 2: Convert the Notebook to JSON

The nbformat library is the official tool for working with Jupyter Notebook files. We'll parse the raw content and then serialize it to JSON:

# Parse the raw Notebook content into a Notebook object
notebook = nbformat.reads(notebook_raw, as_version=4)  # Version 4 is the current standard

# Convert the Notebook object to a JSON string
notebook_json = nbformat.writes(notebook)

# Save to a JSON file
with open("notebook_output.json", "w") as f:
    f.write(notebook_json)

This gives you a full JSON representation of the entire Notebook, including cells, metadata, outputs, etc.

Step 3: Convert to a JSON List

If you need a JSON list (e.g., a list of code cells, markdown cells, or cell metadata), you can extract specific parts of the Notebook object and format them as a list. Here are two common use cases:

Example 1: List of All Code Cell Contents

Extract just the code from every code cell and save as a JSON list:

code_cells = []
for cell in notebook.cells:
    if cell.cell_type == "code":
        # Join the source lines (which are stored as a list) into a single string
        code_content = "".join(cell.source)
        code_cells.append(code_content)

# Convert the list to JSON
code_cells_json = json.dumps(code_cells, indent=2)

# Save to file
with open("code_cells_list.json", "w") as f:
    f.write(code_cells_json)

Example 2: List of Full Cell Objects (as Dictionaries)

Create a list where each item is a dictionary containing all details of a cell (type, source, outputs, metadata):

cell_list = []
for cell in notebook.cells:
    cell_data = {
        "cell_type": cell.cell_type,
        "source": "".join(cell.source),
        "metadata": cell.metadata
    }
    # Add output details for code cells
    if cell.cell_type == "code":
        cell_data["execution_count"] = cell.execution_count
        # Convert outputs to serializable dictionaries
        cell_data["outputs"] = [json.loads(nbformat.writes(out)) for out in cell.outputs]
    
    cell_list.append(cell_data)

# Convert to JSON list
cell_list_json = json.dumps(cell_list, indent=2)

with open("notebook_cells_list.json", "w") as f:
    f.write(cell_list_json)

Key Notes

Always use the raw GitHub URL—the web page URL won't work (it's HTML, not raw Notebook content).
Handle exceptions (like network errors or invalid Notebook formats) to make your script robust.
nbformat supports all Notebook versions, but we use as_version=4 since it's the most widely adopted.

Hope this helps you get exactly the JSON output you need. Feel free to adjust the extraction logic if you have specific requirements for the JSON list!

内容的提问来源于stack exchange，提问作者big_cactus