如何使用Java获取GitHub项目文件夹下所有文件及其原始内容

阿华AIGC实验室

2026-4-28

批量获取GitHub文件夹下所有文件的原始内容

我明白你现在的困扰——直接请求GitHub的文件夹网页链接只能拿到HTML页面，没法批量获取文件内容。下面给你两种可行的解决方案，优先推荐第一种，因为更稳定可靠。

问题根源

你当前使用的https://github.com/company/project/tree/master/folder1/tempFolder是GitHub的网页展示链接，返回的是渲染后的HTML页面，不是结构化的文件列表数据，所以没法直接提取所有文件的原始内容。要批量获取，得用GitHub提供的API或者解析页面（不推荐）。

方案一：使用GitHub REST API（推荐）

GitHub提供了专门的API来获取仓库内容，这个接口会返回文件夹下所有文件的元数据（包括原始内容的下载链接），然后我们可以遍历这些链接逐个获取内容。

步骤1：调用API获取文件列表

API端点格式：

GET https://api.github.com/repos/{owner}/{repo}/contents/{path}

对应你的例子就是：

https://api.github.com/repos/company/project/contents/folder1/tempFolder

这个接口返回一个JSON数组，每个元素包含name（文件名）、download_url（原始内容链接）等字段。

步骤2：Java代码实现

首先需要引入JSON解析库，比如Jackson（Maven依赖）：

<dependency>
    <groupId>com.fasterxml.jackson.core</groupId>
    <artifactId>jackson-databind</artifactId>
    <version>2.15.2</version>
</dependency>

然后修改你的代码，先获取文件列表，再逐个请求原始内容：

import org.apache.http.StatusLine;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.nio.charset.StandardCharsets;

public class GitHubFileFetcher {
    public static void main(String[] args) throws IOException {
        // GitHub API 端点：替换成你的仓库和路径
        final String folderApiUrl = "https://api.github.com/repos/company/project/contents/folder1/tempFolder";
        CloseableHttpClient httpClient = HttpClients.createDefault();
        ObjectMapper objectMapper = new ObjectMapper();

        // 第一步：获取文件夹下的文件列表
        CloseableHttpResponse folderResponse = httpClient.execute(new HttpGet(folderApiUrl));
        StatusLine folderStatus = folderResponse.getStatusLine();
        if (folderStatus.getStatusCode() != 200) {
            System.err.println("Failed to get file list: " + folderStatus.getReasonPhrase());
            return;
        }
        String folderResponseBody = EntityUtils.toString(folderResponse.getEntity(), StandardCharsets.UTF_8);
        JsonNode fileList = objectMapper.readTree(folderResponseBody);

        // 第二步：遍历每个文件，获取原始内容
        for (JsonNode file : fileList) {
            // 只处理文件，跳过子文件夹
            if (!file.get("type").asText().equals("file")) {
                continue;
            }
            String fileName = file.get("name").asText();
            String rawFileUrl = file.get("download_url").asText();

            // 请求原始文件内容
            CloseableHttpResponse fileResponse = httpClient.execute(new HttpGet(rawFileUrl));
            StatusLine fileStatus = fileResponse.getStatusLine();
            if (fileStatus.getStatusCode() == 200) {
                String fileContent = EntityUtils.toString(fileResponse.getEntity(), StandardCharsets.UTF_8);
                System.out.println("=== File: " + fileName + " ===");
                System.out.println(fileContent);
                System.out.println("=====================\n");
            } else {
                System.err.println("Failed to fetch " + fileName + ": " + fileStatus.getReasonPhrase());
            }
            fileResponse.close();
        }

        folderResponse.close();
        httpClient.close();
    }
}

说明

公开仓库不需要认证即可调用这个API，但如果请求频率过高可能会被限流（GitHub对未认证请求的限流是每小时60次）。
代码中跳过了子文件夹，如果需要递归获取子文件夹内容，可以递归调用API。

方案二：解析GitHub网页（不推荐）

如果不想用API，可以解析GitHub文件夹页面的HTML来提取文件链接，但这种方法不稳定——GitHub随时可能修改页面结构，导致代码失效。这里用Jsoup库来解析：

Maven依赖

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.16.1</version>
</dependency>

示例代码

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.nio.charset.StandardCharsets;

public class GitHubHtmlParser {
    public static void main(String[] args) throws IOException {
        final String folderPageUrl = "https://github.com/company/project/tree/master/folder1/tempFolder";
        CloseableHttpClient httpClient = HttpClients.createDefault();

        // 获取文件夹页面HTML
        CloseableHttpResponse pageResponse = httpClient.execute(new HttpGet(folderPageUrl));
        String pageHtml = EntityUtils.toString(pageResponse.getEntity(), StandardCharsets.UTF_8);
        Document doc = Jsoup.parse(pageHtml);

        // 提取文件链接（注意：GitHub页面结构可能会变，这个选择器可能失效）
        Elements fileLinks = doc.select("div[role=row] a[title]");
        for (Element link : fileLinks) {
            String fileName = link.attr("title");
            // 构造原始文件链接：把/tree/master替换成/raw/master，去掉blob（如果有的话）
            String rawUrl = folderPageUrl.replace("/tree/master", "/raw/master") + "/" + fileName;
            rawUrl = rawUrl.replace("/blob/", "/"); // 处理单个文件链接的格式

            // 请求原始内容
            CloseableHttpResponse fileResponse = httpClient.execute(new HttpGet(rawUrl));
            if (fileResponse.getStatusLine().getStatusCode() == 200) {
                String content = EntityUtils.toString(fileResponse.getEntity(), StandardCharsets.UTF_8);
                System.out.println("=== File: " + fileName + " ===");
                System.out.println(content);
                System.out.println("=====================\n");
            }
            fileResponse.close();
        }

        pageResponse.close();
        httpClient.close();
    }
}