You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何在Databricks工作区列出所有Notebooks与Jobs并将结果加载至DataFrame及DBFS内的Managed Table?

Absolutely! You can absolutely achieve this by leveraging Databricks' native SDKs to fetch both notebooks and jobs, convert the results into Spark DataFrames, and then persist them into managed tables in DBFS. Let's break this down step by step with actionable code examples:

Solution Overview

We'll use two core Databricks SDK components to gather the required metadata:

  • Workspace Client: To recursively traverse all workspace directories and collect notebook details
  • Jobs Client: To fetch all jobs (with automatic pagination handling for large job lists)
    Then we'll convert the collected data into Spark DataFrames and save them as managed tables, which are automatically stored and managed in DBFS.

Prerequisites

  • Ensure your Databricks user/service principal has Workspace Read and Jobs Read permissions
  • The Databricks SDK is pre-installed in most Databricks runtime versions. If not, run %pip install databricks-sdk in a notebook cell to install it.

Step 1: Fetch All Notebooks (Recursive Traversal)

We'll write a recursive function to walk through every directory in the workspace and collect key notebook metadata:

from databricks.sdk import WorkspaceClient

# Initialize workspace client (uses current cluster's credentials automatically)
ws = WorkspaceClient()

def get_all_notebooks(root_path="/"):
    notebooks = []
    # List all objects in the current path
    for obj in ws.workspace.list(root_path):
        if obj.object_type == "NOTEBOOK":
            # Collect critical notebook metadata
            notebooks.append({
                "notebook_path": obj.path,
                "notebook_name": obj.path.split("/")[-1],
                "language": obj.language,
                "created_at": obj.created_at,
                "updated_at": obj.updated_at
            })
        elif obj.object_type == "DIRECTORY":
            # Recursively check subdirectories
            notebooks.extend(get_all_notebooks(obj.path))
    return notebooks

# Fetch all notebooks in the workspace
notebooks_list = get_all_notebooks()

Step 2: Fetch All Jobs (Handle Pagination)

The Jobs API returns results in pages, but the Databricks SDK simplifies this by returning an iterator that handles pagination automatically. We'll collect key job metadata, flattening nested fields for easier DataFrame handling:

from databricks.sdk import JobsClient

# Initialize jobs client
jobs_client = JobsClient()

def get_all_jobs():
    jobs = []
    # Iterate through all jobs (SDK manages pagination behind the scenes)
    for job in jobs_client.jobs.list():
        # Collect core job details
        job_details = {
            "job_id": job.job_id,
            "job_name": job.settings.name,
            "created_at": job.created_time,
            "creator_user_name": job.creator_user_name,
            "job_status": job.state,
            "job_type": job.settings.job_type if hasattr(job.settings, 'job_type') else "UNKNOWN"
        }
        # Add schedule info if the job is scheduled
        if hasattr(job.settings, 'schedule'):
            job_details["schedule_cron"] = job.settings.schedule.cron_expression
            job_details["timezone"] = job.settings.schedule.timezone_id
        else:
            job_details["schedule_cron"] = "NO_SCHEDULE"
            job_details["timezone"] = None
        
        jobs.append(job_details)
    return jobs

# Fetch all jobs in the workspace
jobs_list = get_all_jobs()

Step 3: Convert to Spark DataFrames

Now convert the collected lists into Spark DataFrames for easy manipulation and storage:

# Convert notebooks list to a Spark DataFrame
notebooks_df = spark.createDataFrame(notebooks_list)

# Convert jobs list to a Spark DataFrame
jobs_df = spark.createDataFrame(jobs_list)

# Optional: Preview the data to verify
display(notebooks_df)
display(jobs_df)

Step 4: Save to Managed Tables in DBFS

Managed tables are automatically stored in DBFS and managed by Databricks. Use saveAsTable to persist the DataFrames:

# Save notebooks to a managed table (replace `default` with your target database if needed)
notebooks_df.write.mode("overwrite").saveAsTable("default.workspace_notebooks")

# Save jobs to a managed table
jobs_df.write.mode("overwrite").saveAsTable("default.workspace_jobs")

Key Notes on Managed Tables:

  • Use mode("append") instead of overwrite if you want to add new data without replacing existing records
  • The table data is stored in DBFS under Databricks' managed storage path—you don't need to specify a DBFS path explicitly
  • You can query these tables directly using SQL or Spark APIs

Verify the Results

Confirm the data was saved correctly by querying the managed tables:

-- Check notebook data
SELECT * FROM default.workspace_notebooks LIMIT 10;

-- Check job data
SELECT * FROM default.workspace_jobs LIMIT 10;

内容的提问来源于stack exchange,提问作者Shruti

火山引擎 最新活动