You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何通过代码获取论文过去12个月的引用量?OpenAlex与Semantic Scholar API实践问题问询

解决方案

问题根源分析

  1. OpenAlex API结果偏低:你测试时使用了未来时间范围(2025-2026),但OpenAlex的数据仅更新到当前时间,自然没有大量未发表的引用数据。换成真实的过去12个月窗口后,结果会符合预期。另外你的代码逻辑本身是正确的,filter参数格式没有问题。
  2. Semantic Scholar API的局限:原代码仅按year字段筛选,无法支持自定义月份窗口;同时免费API有10k结果上限,但可以通过基于日期的提前终止优化来减少不必要的请求。

改进方案1:修正OpenAlex API测试

将时间范围改为真实的过去12个月(自动计算当前日期往前推12个月),验证结果:

import requests
import time
from datetime import datetime, timedelta

openalex_id = "W2626778328"
# 自动生成过去12个月的日期范围
to_date = datetime.today().strftime("%Y-%m-%d")
from_date = (datetime.today() - timedelta(days=365)).strftime("%Y-%m-%d")

results_all = []
page = 1
per_page = 200

while True:
    url = (
        "https://api.openalex.org/works"
        f"?filter=referenced_works:{openalex_id},"
        f"publication_date:{from_date},{to_date}"  # 用更简洁的范围筛选格式
        f"&per-page={per_page}&page={page}"
    )
    res = requests.get(url)
    print("status:", res.status_code)
    if res.status_code != 200:
        print(res.text)
        break
    data = res.json()
    results = data.get("results", [])
    if not results:
        break
    results_all.extend(results)
    # 检查是否已获取所有结果
    if page * per_page >= data["meta"]["count"]:
        break
    page += 1
    time.sleep(0.1)  # 遵守OpenAlex的速率限制

print(f"Total citations in {from_date} to {to_date}: {len(results_all)}")

说明:

  • OpenAlex支持用publication_date:YYYY-MM-DD,YYYY-MM-DD直接指定日期范围,语法更简洁。
  • 使用真实的过去12个月时间,返回结果会和平台显示的近期引用量匹配。

改进方案2:Semantic Scholar API支持自定义日期窗口

修改代码,获取引用论文的publicationDate字段,在本地筛选自定义时间窗口内的引用,同时优化分页逻辑提前终止:

import json
import requests
import re
import time
from collections import Counter
from datetime import datetime

# =========================
# CONFIG
# =========================
API_KEY = ""  # 填入你的Semantic Scholar API Key
ARXIV_PATH = "/content/drive/MyDrive/arxiv_data/arxiv_entries_2017_23.json"
TARGET_TITLE = "Attention Is All You Need"
# 自定义12个月时间窗口(示例:过去12个月)
TO_DATE = datetime.today()
FROM_DATE = TO_DATE - timedelta(days=365)
HEADERS = {
    "x-api-key": API_KEY
}

# =========================
# HTTP WITH RETRY
# =========================
def safe_get(url, params=None, retries=5):
    for i in range(retries):
        res = requests.get(url, params=params, headers=HEADERS)
        if res.status_code == 200:
            return res
        if res.status_code == 429:
            wait = 2 ** i
            print(f"[429] Rate limited. Sleeping {wait}s...")
            time.sleep(wait)
        else:
            raise ValueError(f"Request failed: {res.status_code} {res.text}")
    raise ValueError("Max retries exceeded")

# =========================
# LOAD ARXIV DATA & FIND PAPER
# =========================
with open(ARXIV_PATH, "r") as f:
    data = json.load(f)

paper_entry = None
for entry in data:
    if TARGET_TITLE.lower() in entry.get("title", "").lower():
        paper_entry = entry
        break
if paper_entry is None:
    raise ValueError("Paper not found")
print("Found:", paper_entry["title"])

# =========================
# EXTRACT ARXIV ID
# =========================
raw_id = paper_entry["id"]
match = re.search(r"abs/([0-9]+\.[0-9]+)(v\d+)?", raw_id)
if not match:
    raise ValueError("Could not parse arXiv ID")
arxiv_id = match.group(1)
print("Parsed arXiv ID:", arxiv_id)

# =========================
# RESOLVE TO SEMANTIC SCHOLAR
# =========================
url = f"https://api.semanticscholar.org/graph/v1/paper/ARXIV:{arxiv_id}"
params = {"fields": "paperId,title"}
res = safe_get(url, params)
paper_data = res.json()
if "paperId" not in paper_data:
    raise ValueError(f"Semantic Scholar lookup failed: {paper_data}")
paper_id = paper_data["paperId"]
print("Semantic Scholar ID:", paper_id)
print("Resolved title:", paper_data.get("title"))

# =========================
# FETCH CITATIONS WITH DATE FILTER
# =========================
url = f"https://api.semanticscholar.org/graph/v1/paper/{paper_id}/citations"
params = {
    "fields": "citingPaper.publicationDate,citingPaper.paperId",
    "limit": 1000
}
offset = 0
in_range_count = 0
missing_date = 0
seen_ids = set()
earliest_date_in_page = None

while True:
    params["offset"] = offset
    res = safe_get(url, params)
    data = res.json()
    citations = data.get("data", [])
    if not citations:
        break
    
    page_in_range = 0
    page_earliest = None
    for c in citations:
        paper = c.get("citingPaper", {})
        pid = paper.get("paperId")
        if pid is None or pid in seen_ids:
            continue
        seen_ids.add(pid)
        
        pub_date_str = paper.get("publicationDate")
        if not pub_date_str:
            missing_date += 1
            continue
        
        # 解析日期(兼容YYYY-MM-DD和YYYY-MM格式)
        try:
            if len(pub_date_str) == 7:  # 处理YYYY-MM格式
                pub_date = datetime.strptime(pub_date_str, "%Y-%m")
                pub_date = pub_date.replace(day=1)  # 统一为当月第一天
            else:
                pub_date = datetime.strptime(pub_date_str, "%Y-%m-%d")
        except ValueError:
            missing_date += 1
            continue
        
        # 更新页面最早日期
        if not page_earliest or pub_date < page_earliest:
            page_earliest = pub_date
        
        # 判断是否在目标时间窗口内
        if FROM_DATE <= pub_date <= TO_DATE:
            in_range_count += 1
            page_in_range += 1
    
    print(f"Fetched {len(citations)} at offset {offset} | in-range on page: {page_in_range}")
    
    # 提前终止:当前页面所有论文日期早于起始日期,后续无符合条件的引用
    if page_earliest and page_earliest < FROM_DATE:
        print("Stopping: No more recent citations in subsequent pages.")
        break
    
    offset += len(citations)
    # 处理免费API上限:最多获取10k结果
    if offset >= 10000:
        print("Stopping: Reached Semantic Scholar API free tier limit (10k results).")
        break

# =========================
# RESULTS
# =========================
print("\n--- RESULTS ---")
print(f"Custom time window: {FROM_DATE.strftime('%Y-%m-%d')} to {TO_DATE.strftime('%Y-%m-%d')}")
print("Citations in window:", in_range_count)
print("Missing/Invalid publication dates:", missing_date)

关键改进点:

  1. 自定义日期窗口:获取citingPaper.publicationDate字段,解析后和目标时间范围比较,支持任意连续12个月的统计。
  2. 提前终止逻辑:当页面中最早的论文日期早于目标起始日期时,停止后续请求,减少不必要的API调用。
  3. 日期格式兼容:处理YYYY-MMYYYY-MM-DD两种常见的发表日期格式。
  4. 去重处理:用seen_ids集合避免重复统计同一引用论文。

额外建议

  • 平台选择:OpenAlex的数据覆盖更广,且无严格免费API上限,适合需要大量数据的场景;Semantic Scholar的引用数据更新更快,适合需要最新引用的场景。
  • 速率限制:OpenAlex建议每秒不超过10请求,Semantic Scholar免费版上限是100请求/分钟,代码中的休眠和重试逻辑已经处理了这一点。

内容的提问来源于stack exchange,提问作者cheese

火山引擎 最新活动