Pandas .isin()方法在一个数据集生效，在另一个数据集失效

阿华AIGC实验室

2026-5-21

Hey there! Since you already know .isin() works for this kind of filtering, let's adapt that approach to your two geolocation DataFrames. Here's a step-by-step solution:

Step 1: Define your valid states list

First, create a list of all 50 U.S. states plus Washington D.C. — adjust this to match the format (full name vs. abbreviation) used in your State column:

For full state names:

valid_states = [
    "Alabama", "Alaska", "Arizona", "Arkansas", "California",
    "Colorado", "Connecticut", "Delaware", "Florida", "Georgia",
    "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa",
    "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland",
    "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
    "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey",
    "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio",
    "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina",
    "South Dakota", "Tennessee", "Texas", "Utah", "Vermont",
    "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming",
    "District of Columbia"
]

For state abbreviations:

valid_state_abbrs = [
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA",
    "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
    "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
    "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
    "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY",
    "DC"
]

Step 2: Filter your DataFrames

Use boolean indexing with .isin() to keep only rows where the State is in your valid list. Let's assume your two DataFrames are named df1 and df2:

# If using full state names
df1_filtered = df1[df1['State'].isin(valid_states)]
df2_filtered = df2[df2['State'].isin(valid_states)]

# If using abbreviations instead
# df1_filtered = df1[df1['State'].isin(valid_state_abbrs)]
# df2_filtered = df2[df2['State'].isin(valid_state_abbrs)]

Step 3: Verify the results

Double-check that you've removed the unwanted territories by inspecting the filtered data:

# Check the number of observations post-filter
df1_filtered.info()
df2_filtered.info()

# Confirm only valid states remain
print("Valid states in df1:", df1_filtered['State'].unique())
print("Valid states in df2:", df2_filtered['State'].unique())

Pro Tip: Handle formatting inconsistencies

If your State column has messy formatting (like extra spaces, lowercase letters), standardize it first:

# Strip whitespace and convert to title case (matches our full name list)
df1['State'] = df1['State'].str.strip().str.title()
df2['State'] = df2['State'].str.strip().str.title()

This approach mirrors the .isin() method you used successfully before, so it should work smoothly for your two datasets.

内容的提问来源于stack exchange，提问作者Justin Wiltshire