如何使用Python将GitHub源中的原始Jupyter Notebook转换为JSON文件及JSON列表
Got it, let's walk through exactly how to convert a Jupyter Notebook from GitHub into JSON format, plus how to turn specific parts (or the whole structure) into a JSON list using Python. Here's a practical, step-by-step breakdown:
Prerequisites
First, install the required libraries—these handle Notebook parsing and HTTP requests:
pip install nbformat requests
Step 1: Fetch the Raw Notebook from GitHub
GitHub serves Jupyter Notebooks as raw JSON under the hood, but you need the raw file URL (not the web page URL). To get this:
- Go to your Notebook's GitHub page
- Click the "Raw" button in the top-right of the file view
- Copy that URL
Use requests to pull the raw content:
import requests import nbformat import json # Replace with your Notebook's raw GitHub URL raw_url = "https://raw.githubusercontent.com/your-username/your-repo/main/notebooks/example.ipynb" try: response = requests.get(raw_url) response.raise_for_status() # Raise error if request fails notebook_raw = response.text except requests.exceptions.RequestException as e: print(f"Error fetching Notebook: {e}")
Step 2: Convert the Notebook to JSON
The nbformat library is the official tool for working with Jupyter Notebook files. We'll parse the raw content and then serialize it to JSON:
# Parse the raw Notebook content into a Notebook object notebook = nbformat.reads(notebook_raw, as_version=4) # Version 4 is the current standard # Convert the Notebook object to a JSON string notebook_json = nbformat.writes(notebook) # Save to a JSON file with open("notebook_output.json", "w") as f: f.write(notebook_json)
This gives you a full JSON representation of the entire Notebook, including cells, metadata, outputs, etc.
Step 3: Convert to a JSON List
If you need a JSON list (e.g., a list of code cells, markdown cells, or cell metadata), you can extract specific parts of the Notebook object and format them as a list. Here are two common use cases:
Example 1: List of All Code Cell Contents
Extract just the code from every code cell and save as a JSON list:
code_cells = [] for cell in notebook.cells: if cell.cell_type == "code": # Join the source lines (which are stored as a list) into a single string code_content = "".join(cell.source) code_cells.append(code_content) # Convert the list to JSON code_cells_json = json.dumps(code_cells, indent=2) # Save to file with open("code_cells_list.json", "w") as f: f.write(code_cells_json)
Example 2: List of Full Cell Objects (as Dictionaries)
Create a list where each item is a dictionary containing all details of a cell (type, source, outputs, metadata):
cell_list = [] for cell in notebook.cells: cell_data = { "cell_type": cell.cell_type, "source": "".join(cell.source), "metadata": cell.metadata } # Add output details for code cells if cell.cell_type == "code": cell_data["execution_count"] = cell.execution_count # Convert outputs to serializable dictionaries cell_data["outputs"] = [json.loads(nbformat.writes(out)) for out in cell.outputs] cell_list.append(cell_data) # Convert to JSON list cell_list_json = json.dumps(cell_list, indent=2) with open("notebook_cells_list.json", "w") as f: f.write(cell_list_json)
Key Notes
- Always use the raw GitHub URL—the web page URL won't work (it's HTML, not raw Notebook content).
- Handle exceptions (like network errors or invalid Notebook formats) to make your script robust.
nbformatsupports all Notebook versions, but we useas_version=4since it's the most widely adopted.
Hope this helps you get exactly the JSON output you need. Feel free to adjust the extraction logic if you have specific requirements for the JSON list!
内容的提问来源于stack exchange,提问作者big_cactus




