如何在Databricks工作区列出所有Notebooks与Jobs并将结果加载至DataFrame及DBFS内的Managed Table?
Absolutely! You can absolutely achieve this by leveraging Databricks' native SDKs to fetch both notebooks and jobs, convert the results into Spark DataFrames, and then persist them into managed tables in DBFS. Let's break this down step by step with actionable code examples:
We'll use two core Databricks SDK components to gather the required metadata:
- Workspace Client: To recursively traverse all workspace directories and collect notebook details
- Jobs Client: To fetch all jobs (with automatic pagination handling for large job lists)
Then we'll convert the collected data into Spark DataFrames and save them as managed tables, which are automatically stored and managed in DBFS.
Prerequisites
- Ensure your Databricks user/service principal has Workspace Read and Jobs Read permissions
- The Databricks SDK is pre-installed in most Databricks runtime versions. If not, run
%pip install databricks-sdkin a notebook cell to install it.
Step 1: Fetch All Notebooks (Recursive Traversal)
We'll write a recursive function to walk through every directory in the workspace and collect key notebook metadata:
from databricks.sdk import WorkspaceClient # Initialize workspace client (uses current cluster's credentials automatically) ws = WorkspaceClient() def get_all_notebooks(root_path="/"): notebooks = [] # List all objects in the current path for obj in ws.workspace.list(root_path): if obj.object_type == "NOTEBOOK": # Collect critical notebook metadata notebooks.append({ "notebook_path": obj.path, "notebook_name": obj.path.split("/")[-1], "language": obj.language, "created_at": obj.created_at, "updated_at": obj.updated_at }) elif obj.object_type == "DIRECTORY": # Recursively check subdirectories notebooks.extend(get_all_notebooks(obj.path)) return notebooks # Fetch all notebooks in the workspace notebooks_list = get_all_notebooks()
Step 2: Fetch All Jobs (Handle Pagination)
The Jobs API returns results in pages, but the Databricks SDK simplifies this by returning an iterator that handles pagination automatically. We'll collect key job metadata, flattening nested fields for easier DataFrame handling:
from databricks.sdk import JobsClient # Initialize jobs client jobs_client = JobsClient() def get_all_jobs(): jobs = [] # Iterate through all jobs (SDK manages pagination behind the scenes) for job in jobs_client.jobs.list(): # Collect core job details job_details = { "job_id": job.job_id, "job_name": job.settings.name, "created_at": job.created_time, "creator_user_name": job.creator_user_name, "job_status": job.state, "job_type": job.settings.job_type if hasattr(job.settings, 'job_type') else "UNKNOWN" } # Add schedule info if the job is scheduled if hasattr(job.settings, 'schedule'): job_details["schedule_cron"] = job.settings.schedule.cron_expression job_details["timezone"] = job.settings.schedule.timezone_id else: job_details["schedule_cron"] = "NO_SCHEDULE" job_details["timezone"] = None jobs.append(job_details) return jobs # Fetch all jobs in the workspace jobs_list = get_all_jobs()
Step 3: Convert to Spark DataFrames
Now convert the collected lists into Spark DataFrames for easy manipulation and storage:
# Convert notebooks list to a Spark DataFrame notebooks_df = spark.createDataFrame(notebooks_list) # Convert jobs list to a Spark DataFrame jobs_df = spark.createDataFrame(jobs_list) # Optional: Preview the data to verify display(notebooks_df) display(jobs_df)
Step 4: Save to Managed Tables in DBFS
Managed tables are automatically stored in DBFS and managed by Databricks. Use saveAsTable to persist the DataFrames:
# Save notebooks to a managed table (replace `default` with your target database if needed) notebooks_df.write.mode("overwrite").saveAsTable("default.workspace_notebooks") # Save jobs to a managed table jobs_df.write.mode("overwrite").saveAsTable("default.workspace_jobs")
Key Notes on Managed Tables:
- Use
mode("append")instead ofoverwriteif you want to add new data without replacing existing records - The table data is stored in DBFS under Databricks' managed storage path—you don't need to specify a DBFS path explicitly
- You can query these tables directly using SQL or Spark APIs
Verify the Results
Confirm the data was saved correctly by querying the managed tables:
-- Check notebook data SELECT * FROM default.workspace_notebooks LIMIT 10; -- Check job data SELECT * FROM default.workspace_jobs LIMIT 10;
内容的提问来源于stack exchange,提问作者Shruti




