如何通过GitHub API统计公共项目贡献者的PR提交频率变化？

阿华AIGC实验室

2026-5-27

Absolutely! You can absolutely track how often each contributor submits PRs over time for a public GitHub repo using the GitHub REST API. Let's break this down into actionable steps, with code snippets and practical tips to make it smooth.

1. Prep Work First

Grab a Personal Access Token (PAT): While unauthenticated requests work, GitHub caps them at 60 per hour — way too low for repos with hundreds of PRs. A PAT bumps this limit to 5000 requests per hour. Create one in your GitHub settings (under Developer settings > Personal access tokens) with just the public_repo scope (since we're only accessing public repos).
Pick Your Tools: I’ll use Python for examples here because it’s perfect for data wrangling and API calls. You’ll need the requests library for API calls, plus pandas and matplotlib for analysis and visualization. Install them with:
```
pip install requests pandas matplotlib
```

2. Fetch All PR Data via the GitHub API

The core endpoint to pull PRs is GET /repos/{owner}/{repo}/pulls. Here’s how to use it effectively:

Critical Parameters:
- state=all: Fetches all PRs (open, closed, merged) — swap to merged if you only care about accepted contributions.
- per_page=100: Maximize results per request to cut down on API calls.
- page={page_number}: Handle pagination (GitHub returns results in chunks, so you’ll loop through pages until there are no more PRs left).

Example code to fetch all PRs:

import requests

GITHUB_TOKEN = "your_pat_here"
OWNER = "repo_owner_username"
REPO = "target_repo_name"

headers = {"Authorization": f"token {GITHUB_TOKEN}"}
all_prs = []
current_page = 1

while True:
    api_url = f"https://api.github.com/repos/{OWNER}/{REPO}/pulls?state=all&per_page=100&page={current_page}"
    response = requests.get(api_url, headers=headers)
    response.raise_for_status()  # Throw an error if the request fails
    
    page_prs = response.json()
    if not page_prs:
        break  # No more PRs to fetch
    
    all_prs.extend(page_prs)
    current_page += 1

3. Extract & Structure Your Target Data

From each PR, we only need two key details: the contributor’s username (pr['user']['login']) and the PR creation timestamp (pr['created_at']). Let’s clean this data and group it by time intervals (weekly or monthly works best):

import pandas as pd

# Convert raw PR data into a structured DataFrame
pr_df = pd.DataFrame([
    {
        "contributor": pr["user"]["login"],
        "created_at": pd.to_datetime(pr["created_at"])
    }
    for pr in all_prs
])

# Group by contributor and weekly intervals (use 'M' instead of 'W' for monthly)
pr_frequency = pr_df.groupby([
    pd.Grouper(key="created_at", freq="W"),
    "contributor"
]).size().unstack(fill_value=0)

4. Visualize PR Frequency Over Time

Now you can plot the data to see how each contributor’s submission rate changes over time. Here’s a quick matplotlib example:

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
for contributor in pr_frequency.columns:
    plt.plot(pr_frequency.index, pr_frequency[contributor], label=contributor)

plt.title("PR Submission Frequency Over Time by Contributor")
plt.xlabel("Date")
plt.ylabel("Number of PRs")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()

5. Key Things to Keep in Mind

Rate Limiting: Always check the X-RateLimit-Remaining header in API responses. If you’re running low, add a small delay (like time.sleep(1)) between requests to avoid getting blocked.
Pagination: The example above loops until no PRs are returned, but you can also parse the Link header in responses to directly grab the next page URL for more precise control.
Filtering: If you only want to count merged PRs, add a check for pr['merged_at'] is not None when building the DataFrame.
Large Repos: For repos with tens of thousands of PRs, consider using the GitHub GraphQL API instead — it lets you fetch targeted data in fewer requests, making the process far more efficient.

内容的提问来源于stack exchange，提问作者aardvark