如何通过GitHub API统计公共项目贡献者的PR提交频率变化?
Absolutely! You can absolutely track how often each contributor submits PRs over time for a public GitHub repo using the GitHub REST API. Let's break this down into actionable steps, with code snippets and practical tips to make it smooth.
- Grab a Personal Access Token (PAT): While unauthenticated requests work, GitHub caps them at 60 per hour — way too low for repos with hundreds of PRs. A PAT bumps this limit to 5000 requests per hour. Create one in your GitHub settings (under Developer settings > Personal access tokens) with just the
public_reposcope (since we're only accessing public repos). - Pick Your Tools: I’ll use Python for examples here because it’s perfect for data wrangling and API calls. You’ll need the
requestslibrary for API calls, pluspandasandmatplotlibfor analysis and visualization. Install them with:pip install requests pandas matplotlib
The core endpoint to pull PRs is GET /repos/{owner}/{repo}/pulls. Here’s how to use it effectively:
- Critical Parameters:
state=all: Fetches all PRs (open, closed, merged) — swap tomergedif you only care about accepted contributions.per_page=100: Maximize results per request to cut down on API calls.page={page_number}: Handle pagination (GitHub returns results in chunks, so you’ll loop through pages until there are no more PRs left).
Example code to fetch all PRs:
import requests GITHUB_TOKEN = "your_pat_here" OWNER = "repo_owner_username" REPO = "target_repo_name" headers = {"Authorization": f"token {GITHUB_TOKEN}"} all_prs = [] current_page = 1 while True: api_url = f"https://api.github.com/repos/{OWNER}/{REPO}/pulls?state=all&per_page=100&page={current_page}" response = requests.get(api_url, headers=headers) response.raise_for_status() # Throw an error if the request fails page_prs = response.json() if not page_prs: break # No more PRs to fetch all_prs.extend(page_prs) current_page += 1
From each PR, we only need two key details: the contributor’s username (pr['user']['login']) and the PR creation timestamp (pr['created_at']). Let’s clean this data and group it by time intervals (weekly or monthly works best):
import pandas as pd # Convert raw PR data into a structured DataFrame pr_df = pd.DataFrame([ { "contributor": pr["user"]["login"], "created_at": pd.to_datetime(pr["created_at"]) } for pr in all_prs ]) # Group by contributor and weekly intervals (use 'M' instead of 'W' for monthly) pr_frequency = pr_df.groupby([ pd.Grouper(key="created_at", freq="W"), "contributor" ]).size().unstack(fill_value=0)
Now you can plot the data to see how each contributor’s submission rate changes over time. Here’s a quick matplotlib example:
import matplotlib.pyplot as plt plt.figure(figsize=(12, 6)) for contributor in pr_frequency.columns: plt.plot(pr_frequency.index, pr_frequency[contributor], label=contributor) plt.title("PR Submission Frequency Over Time by Contributor") plt.xlabel("Date") plt.ylabel("Number of PRs") plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left") plt.tight_layout() plt.show()
- Rate Limiting: Always check the
X-RateLimit-Remainingheader in API responses. If you’re running low, add a small delay (liketime.sleep(1)) between requests to avoid getting blocked. - Pagination: The example above loops until no PRs are returned, but you can also parse the
Linkheader in responses to directly grab the next page URL for more precise control. - Filtering: If you only want to count merged PRs, add a check for
pr['merged_at'] is not Nonewhen building the DataFrame. - Large Repos: For repos with tens of thousands of PRs, consider using the GitHub GraphQL API instead — it lets you fetch targeted data in fewer requests, making the process far more efficient.
内容的提问来源于stack exchange,提问作者aardvark




