如何通过编程方式提取GitHub Issues？含已知Git日志提取方法的问询

阿华AIGC实验室

2026-5-21

Hey there! Glad you've got the Git log extraction workflow sorted—let's dive into your GitHub Issues extraction questions next. Here's how to handle both tasks effectively:

1. 编程方式提取GitHub Issues

The easiest way to do this programmatically is by using the GitHub API, and for Python developers, the PyGitHub library simplifies this a lot (no need to mess with raw HTTP request formatting). Here's a step-by-step breakdown:

First, install the library:
```
pip install pygithub
```
Next, create a GitHub Personal Access Token (PAT) with repo permissions (this lets you access private repos if needed, and bypasses stricter rate limits than unauthenticated requests).

Then, use this code snippet to fetch issues and export them to a CSV:

from github import Github
import csv

# Initialize with your PAT
g = Github("YOUR_GITHUB_PAT")
repo = g.get_repo("OWNER/REPO_NAME")  # Replace with your repo path (e.g., "opencv/opencv")

# Open a CSV file to write
with open("GitHub_Issues.csv", "w", newline="", encoding="utf-8") as csvfile:
    fieldnames = ["Issue_Number", "Title", "Creator_Name", "Creator_Email", "Created_Date", "State", "Description"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    # Fetch all issues (adjust state to "open" or "closed" if you don't want all)
    for issue in repo.get_issues(state="all"):
        # Skip pull requests (since GitHub counts PRs as issues)
        if not issue.pull_request:
            writer.writerow({
                "Issue_Number": issue.number,
                "Title": issue.title,
                "Creator_Name": issue.user.name,
                "Creator_Email": issue.user.email,
                "Created_Date": issue.created_at.strftime("%Y-%m-%d %H:%M:%S"),
                "State": issue.state,
                "Description": issue.body.replace("\n", " ") if issue.body else ""
            })

Note: If you don't want to use PyGitHub, you can send direct GET requests to the GitHub API endpoint using libraries like requests, but you'll need to handle pagination and authentication manually.

2. 提取GitHub Issues及其内部的评论消息

To include comments, you just need to extend the above code to fetch comments for each issue and link them back to the parent issue. Here's how to modify the script:

from github import Github
import csv

g = Github("YOUR_GITHUB_PAT")
repo = g.get_repo("OWNER/REPO_NAME")

with open("GitHub_Issues_With_Comments.csv", "w", newline="", encoding="utf-8") as csvfile:
    fieldnames = ["Issue_Number", "Issue_Title", "Comment_Author_Name", "Comment_Author_Email", "Comment_Date", "Comment_Content"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    for issue in repo.get_issues(state="all"):
        if not issue.pull_request:
            # Optional: Write the issue's original content as a "comment" from the creator
            # writer.writerow({
            #     "Issue_Number": issue.number,
            #     "Issue_Title": issue.title,
            #     "Comment_Author_Name": "ISSUE_CREATOR",
            #     "Comment_Author_Email": issue.user.email,
            #     "Comment_Date": issue.created_at.strftime("%Y-%m-%d %H:%M:%S"),
            #     "Comment_Content": issue.body.replace("\n", " ") if issue.body else ""
            # })
            
            # Fetch and write all user comments for the issue
            for comment in issue.get_comments():
                writer.writerow({
                    "Issue_Number": issue.number,
                    "Issue_Title": issue.title,
                    "Comment_Author_Name": comment.user.name,
                    "Comment_Author_Email": comment.user.email,
                    "Comment_Date": comment.created_at.strftime("%Y-%m-%d %H:%M:%S"),
                    "Comment_Content": comment.body.replace("\n", " ") if comment.body else ""
                })

This script creates a CSV where each row is linked to an issue number—either the original issue content (if you uncomment that section) or a user comment. This makes it easy to track full conversations tied to specific issues.

A quick heads-up: GitHub's API has rate limits—with a PAT, you can make up to 5000 requests per hour, which should cover most repos. For extra large repos with tens of thousands of issues/comments, add small delays or handle rate-limit headers to avoid getting blocked.

内容的提问来源于stack exchange，提问作者Sandy Pabilonia