基于Python计算企业年度专利赫芬达尔指数(HHI)的技术问询
Got it, let's walk through how to compute the Herfindahl Index (HHI) and total patent count for each firm-year combination using pandas. This will give you exactly the structured DataFrame you need.
Step 1: Import Pandas and Load Your Data
First, make sure pandas is imported, then create or load your DataFrame. For this example, we'll use the sample data you provided:
import pandas as pd # Sample data (replace this with your actual dataset) data = { 'patnum': [1706123, 1579247, 1579225, 1605442, 1699538, 1579325, 1579234, 1579268, 1665388, 1748147], 'permno': [10006]*10, 'class': [251,72,137,164,198,72,74,105,105,105], 'year': [1921,1922,1922,1922,1922,1923,1923,1923,1923,1923] } df = pd.DataFrame(data)
Step 2: Define a Function to Compute Metrics
We'll create a custom function that takes a group of data (one firm-year combination) and calculates both the total patent count and HHI:
def calculate_firm_year_metrics(group): # Count how many patents are in each class for this firm-year class_counts = group['class'].value_counts() # Total patents filed by the firm this year total_patents = class_counts.sum() # Calculate HHI: sum of (class patent count / total patents) squared for all classes hhi = sum((count / total_patents) ** 2 for count in class_counts) # Return the metrics as a Series (we'll round HHI to 2 decimals like your example) return pd.Series({ 'patent_count': total_patents, 'hhi': round(hhi, 2) })
Step 3: Apply the Function to Each Firm-Year Group
Use pandas' groupby to split the data by permno (firm) and year, then apply our function to compute the metrics:
# Group by firm and year, compute metrics, then reset index to get a clean DataFrame result_df = df.groupby(['permno', 'year']).apply(calculate_firm_year_metrics).reset_index()
Step 4: Check the Result
Print the resulting DataFrame to verify it matches your expected output:
print(result_df)
This will output:
permno year patent_count hhi 0 10006 1921 1 1.00 1 10006 1922 4 0.25 2 10006 1923 5 0.44
Perfect! This aligns exactly with the examples you provided:
- 1921 has 1 patent, so HHI is 1.00 (no diversification)
- 1922 has 4 patents across 4 classes, HHI = 0.25
- 1923 has 5 patents (3 in one class, 1 each in two others), HHI = 0.44
Notes for Your Actual Data
- If your dataset has duplicate
patnumentries (even though you said it's unique), adddf = df.drop_duplicates(subset='patnum')before grouping to avoid counting the same patent multiple times. - The
round(hhi,2)is optional—you can remove it if you want more decimal precision.
内容的提问来源于stack exchange,提问作者Farid Mammadaliyev




