如何通过代码获取论文过去12个月的引用量?OpenAlex与Semantic Scholar API实践问题问询
解决方案
问题根源分析
- OpenAlex API结果偏低:你测试时使用了未来时间范围(2025-2026),但OpenAlex的数据仅更新到当前时间,自然没有大量未发表的引用数据。换成真实的过去12个月窗口后,结果会符合预期。另外你的代码逻辑本身是正确的,filter参数格式没有问题。
- Semantic Scholar API的局限:原代码仅按
year字段筛选,无法支持自定义月份窗口;同时免费API有10k结果上限,但可以通过基于日期的提前终止优化来减少不必要的请求。
改进方案1:修正OpenAlex API测试
将时间范围改为真实的过去12个月(自动计算当前日期往前推12个月),验证结果:
import requests import time from datetime import datetime, timedelta openalex_id = "W2626778328" # 自动生成过去12个月的日期范围 to_date = datetime.today().strftime("%Y-%m-%d") from_date = (datetime.today() - timedelta(days=365)).strftime("%Y-%m-%d") results_all = [] page = 1 per_page = 200 while True: url = ( "https://api.openalex.org/works" f"?filter=referenced_works:{openalex_id}," f"publication_date:{from_date},{to_date}" # 用更简洁的范围筛选格式 f"&per-page={per_page}&page={page}" ) res = requests.get(url) print("status:", res.status_code) if res.status_code != 200: print(res.text) break data = res.json() results = data.get("results", []) if not results: break results_all.extend(results) # 检查是否已获取所有结果 if page * per_page >= data["meta"]["count"]: break page += 1 time.sleep(0.1) # 遵守OpenAlex的速率限制 print(f"Total citations in {from_date} to {to_date}: {len(results_all)}")
说明:
- OpenAlex支持用
publication_date:YYYY-MM-DD,YYYY-MM-DD直接指定日期范围,语法更简洁。 - 使用真实的过去12个月时间,返回结果会和平台显示的近期引用量匹配。
改进方案2:Semantic Scholar API支持自定义日期窗口
修改代码,获取引用论文的publicationDate字段,在本地筛选自定义时间窗口内的引用,同时优化分页逻辑提前终止:
import json import requests import re import time from collections import Counter from datetime import datetime # ========================= # CONFIG # ========================= API_KEY = "" # 填入你的Semantic Scholar API Key ARXIV_PATH = "/content/drive/MyDrive/arxiv_data/arxiv_entries_2017_23.json" TARGET_TITLE = "Attention Is All You Need" # 自定义12个月时间窗口(示例:过去12个月) TO_DATE = datetime.today() FROM_DATE = TO_DATE - timedelta(days=365) HEADERS = { "x-api-key": API_KEY } # ========================= # HTTP WITH RETRY # ========================= def safe_get(url, params=None, retries=5): for i in range(retries): res = requests.get(url, params=params, headers=HEADERS) if res.status_code == 200: return res if res.status_code == 429: wait = 2 ** i print(f"[429] Rate limited. Sleeping {wait}s...") time.sleep(wait) else: raise ValueError(f"Request failed: {res.status_code} {res.text}") raise ValueError("Max retries exceeded") # ========================= # LOAD ARXIV DATA & FIND PAPER # ========================= with open(ARXIV_PATH, "r") as f: data = json.load(f) paper_entry = None for entry in data: if TARGET_TITLE.lower() in entry.get("title", "").lower(): paper_entry = entry break if paper_entry is None: raise ValueError("Paper not found") print("Found:", paper_entry["title"]) # ========================= # EXTRACT ARXIV ID # ========================= raw_id = paper_entry["id"] match = re.search(r"abs/([0-9]+\.[0-9]+)(v\d+)?", raw_id) if not match: raise ValueError("Could not parse arXiv ID") arxiv_id = match.group(1) print("Parsed arXiv ID:", arxiv_id) # ========================= # RESOLVE TO SEMANTIC SCHOLAR # ========================= url = f"https://api.semanticscholar.org/graph/v1/paper/ARXIV:{arxiv_id}" params = {"fields": "paperId,title"} res = safe_get(url, params) paper_data = res.json() if "paperId" not in paper_data: raise ValueError(f"Semantic Scholar lookup failed: {paper_data}") paper_id = paper_data["paperId"] print("Semantic Scholar ID:", paper_id) print("Resolved title:", paper_data.get("title")) # ========================= # FETCH CITATIONS WITH DATE FILTER # ========================= url = f"https://api.semanticscholar.org/graph/v1/paper/{paper_id}/citations" params = { "fields": "citingPaper.publicationDate,citingPaper.paperId", "limit": 1000 } offset = 0 in_range_count = 0 missing_date = 0 seen_ids = set() earliest_date_in_page = None while True: params["offset"] = offset res = safe_get(url, params) data = res.json() citations = data.get("data", []) if not citations: break page_in_range = 0 page_earliest = None for c in citations: paper = c.get("citingPaper", {}) pid = paper.get("paperId") if pid is None or pid in seen_ids: continue seen_ids.add(pid) pub_date_str = paper.get("publicationDate") if not pub_date_str: missing_date += 1 continue # 解析日期(兼容YYYY-MM-DD和YYYY-MM格式) try: if len(pub_date_str) == 7: # 处理YYYY-MM格式 pub_date = datetime.strptime(pub_date_str, "%Y-%m") pub_date = pub_date.replace(day=1) # 统一为当月第一天 else: pub_date = datetime.strptime(pub_date_str, "%Y-%m-%d") except ValueError: missing_date += 1 continue # 更新页面最早日期 if not page_earliest or pub_date < page_earliest: page_earliest = pub_date # 判断是否在目标时间窗口内 if FROM_DATE <= pub_date <= TO_DATE: in_range_count += 1 page_in_range += 1 print(f"Fetched {len(citations)} at offset {offset} | in-range on page: {page_in_range}") # 提前终止:当前页面所有论文日期早于起始日期,后续无符合条件的引用 if page_earliest and page_earliest < FROM_DATE: print("Stopping: No more recent citations in subsequent pages.") break offset += len(citations) # 处理免费API上限:最多获取10k结果 if offset >= 10000: print("Stopping: Reached Semantic Scholar API free tier limit (10k results).") break # ========================= # RESULTS # ========================= print("\n--- RESULTS ---") print(f"Custom time window: {FROM_DATE.strftime('%Y-%m-%d')} to {TO_DATE.strftime('%Y-%m-%d')}") print("Citations in window:", in_range_count) print("Missing/Invalid publication dates:", missing_date)
关键改进点:
- 自定义日期窗口:获取
citingPaper.publicationDate字段,解析后和目标时间范围比较,支持任意连续12个月的统计。 - 提前终止逻辑:当页面中最早的论文日期早于目标起始日期时,停止后续请求,减少不必要的API调用。
- 日期格式兼容:处理
YYYY-MM和YYYY-MM-DD两种常见的发表日期格式。 - 去重处理:用
seen_ids集合避免重复统计同一引用论文。
额外建议
- 平台选择:OpenAlex的数据覆盖更广,且无严格免费API上限,适合需要大量数据的场景;Semantic Scholar的引用数据更新更快,适合需要最新引用的场景。
- 速率限制:OpenAlex建议每秒不超过10请求,Semantic Scholar免费版上限是100请求/分钟,代码中的休眠和重试逻辑已经处理了这一点。
内容的提问来源于stack exchange,提问作者cheese




