基于Python计算企业年度专利赫芬达尔指数(HHI)的技术问询

阿华AIGC实验室

2026-5-14

Calculate Firm-Year Patent HHI and Count in Python

Got it, let's walk through how to compute the Herfindahl Index (HHI) and total patent count for each firm-year combination using pandas. This will give you exactly the structured DataFrame you need.

Step 1: Import Pandas and Load Your Data

First, make sure pandas is imported, then create or load your DataFrame. For this example, we'll use the sample data you provided:

import pandas as pd

# Sample data (replace this with your actual dataset)
data = {
    'patnum': [1706123, 1579247, 1579225, 1605442, 1699538, 1579325, 1579234, 1579268, 1665388, 1748147],
    'permno': [10006]*10,
    'class': [251,72,137,164,198,72,74,105,105,105],
    'year': [1921,1922,1922,1922,1922,1923,1923,1923,1923,1923]
}
df = pd.DataFrame(data)

Step 2: Define a Function to Compute Metrics

We'll create a custom function that takes a group of data (one firm-year combination) and calculates both the total patent count and HHI:

def calculate_firm_year_metrics(group):
    # Count how many patents are in each class for this firm-year
    class_counts = group['class'].value_counts()
    # Total patents filed by the firm this year
    total_patents = class_counts.sum()
    # Calculate HHI: sum of (class patent count / total patents) squared for all classes
    hhi = sum((count / total_patents) ** 2 for count in class_counts)
    # Return the metrics as a Series (we'll round HHI to 2 decimals like your example)
    return pd.Series({
        'patent_count': total_patents,
        'hhi': round(hhi, 2)
    })

Step 3: Apply the Function to Each Firm-Year Group

Use pandas' groupby to split the data by permno (firm) and year, then apply our function to compute the metrics:

# Group by firm and year, compute metrics, then reset index to get a clean DataFrame
result_df = df.groupby(['permno', 'year']).apply(calculate_firm_year_metrics).reset_index()

Step 4: Check the Result

Print the resulting DataFrame to verify it matches your expected output:

print(result_df)

This will output:

permno  year  patent_count   hhi
0   10006  1921             1  1.00
1   10006  1922             4  0.25
2   10006  1923             5  0.44

Perfect! This aligns exactly with the examples you provided:

1921 has 1 patent, so HHI is 1.00 (no diversification)
1922 has 4 patents across 4 classes, HHI = 0.25
1923 has 5 patents (3 in one class, 1 each in two others), HHI = 0.44

Notes for Your Actual Data

If your dataset has duplicate patnum entries (even though you said it's unique), add df = df.drop_duplicates(subset='patnum') before grouping to avoid counting the same patent multiple times.
The round(hhi,2) is optional—you can remove it if you want more decimal precision.

内容的提问来源于stack exchange，提问作者Farid Mammadaliyev