如何在Databricks工作区列出所有Notebooks与Jobs并将结果加载至DataFrame及DBFS内的Managed Table？

阿华AIGC实验室

2026-5-1

Absolutely! You can absolutely achieve this by leveraging Databricks' native SDKs to fetch both notebooks and jobs, convert the results into Spark DataFrames, and then persist them into managed tables in DBFS. Let's break this down step by step with actionable code examples:

Solution Overview

We'll use two core Databricks SDK components to gather the required metadata:

Workspace Client: To recursively traverse all workspace directories and collect notebook details
Jobs Client: To fetch all jobs (with automatic pagination handling for large job lists)
Then we'll convert the collected data into Spark DataFrames and save them as managed tables, which are automatically stored and managed in DBFS.

Prerequisites

Ensure your Databricks user/service principal has Workspace Read and Jobs Read permissions
The Databricks SDK is pre-installed in most Databricks runtime versions. If not, run %pip install databricks-sdk in a notebook cell to install it.

Step 1: Fetch All Notebooks (Recursive Traversal)

We'll write a recursive function to walk through every directory in the workspace and collect key notebook metadata:

from databricks.sdk import WorkspaceClient

# Initialize workspace client (uses current cluster's credentials automatically)
ws = WorkspaceClient()

def get_all_notebooks(root_path="/"):
    notebooks = []
    # List all objects in the current path
    for obj in ws.workspace.list(root_path):
        if obj.object_type == "NOTEBOOK":
            # Collect critical notebook metadata
            notebooks.append({
                "notebook_path": obj.path,
                "notebook_name": obj.path.split("/")[-1],
                "language": obj.language,
                "created_at": obj.created_at,
                "updated_at": obj.updated_at
            })
        elif obj.object_type == "DIRECTORY":
            # Recursively check subdirectories
            notebooks.extend(get_all_notebooks(obj.path))
    return notebooks

# Fetch all notebooks in the workspace
notebooks_list = get_all_notebooks()

Step 2: Fetch All Jobs (Handle Pagination)

The Jobs API returns results in pages, but the Databricks SDK simplifies this by returning an iterator that handles pagination automatically. We'll collect key job metadata, flattening nested fields for easier DataFrame handling:

from databricks.sdk import JobsClient

# Initialize jobs client
jobs_client = JobsClient()

def get_all_jobs():
    jobs = []
    # Iterate through all jobs (SDK manages pagination behind the scenes)
    for job in jobs_client.jobs.list():
        # Collect core job details
        job_details = {
            "job_id": job.job_id,
            "job_name": job.settings.name,
            "created_at": job.created_time,
            "creator_user_name": job.creator_user_name,
            "job_status": job.state,
            "job_type": job.settings.job_type if hasattr(job.settings, 'job_type') else "UNKNOWN"
        }
        # Add schedule info if the job is scheduled
        if hasattr(job.settings, 'schedule'):
            job_details["schedule_cron"] = job.settings.schedule.cron_expression
            job_details["timezone"] = job.settings.schedule.timezone_id
        else:
            job_details["schedule_cron"] = "NO_SCHEDULE"
            job_details["timezone"] = None
        
        jobs.append(job_details)
    return jobs

# Fetch all jobs in the workspace
jobs_list = get_all_jobs()

Step 3: Convert to Spark DataFrames

Now convert the collected lists into Spark DataFrames for easy manipulation and storage:

# Convert notebooks list to a Spark DataFrame
notebooks_df = spark.createDataFrame(notebooks_list)

# Convert jobs list to a Spark DataFrame
jobs_df = spark.createDataFrame(jobs_list)

# Optional: Preview the data to verify
display(notebooks_df)
display(jobs_df)

Step 4: Save to Managed Tables in DBFS

Managed tables are automatically stored in DBFS and managed by Databricks. Use saveAsTable to persist the DataFrames:

# Save notebooks to a managed table (replace `default` with your target database if needed)
notebooks_df.write.mode("overwrite").saveAsTable("default.workspace_notebooks")

# Save jobs to a managed table
jobs_df.write.mode("overwrite").saveAsTable("default.workspace_jobs")

Key Notes on Managed Tables:

Use mode("append") instead of overwrite if you want to add new data without replacing existing records
The table data is stored in DBFS under Databricks' managed storage path—you don't need to specify a DBFS path explicitly
You can query these tables directly using SQL or Spark APIs

Verify the Results

Confirm the data was saved correctly by querying the managed tables:

-- Check notebook data
SELECT * FROM default.workspace_notebooks LIMIT 10;

-- Check job data
SELECT * FROM default.workspace_jobs LIMIT 10;

内容的提问来源于stack exchange，提问作者Shruti

火山引擎最新活动

方舟 Coding Plan

HOT

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

查看详情

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

查看详情

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

查看详情

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠

查看详情

ArkClaw 专属智能伙伴