基于Python的1.5万条多语言城市数据英文名称高效映射方案

阿华AIGC实验室

2026-5-14

Hey there, dealing with 15k messy multilingual city entries is a pain—manual mapping is totally impractical here. Let’s break down scalable technical solutions to standardize them to English names:

1. Batch Geocoding APIs (Most Reliable for Global Cities)

Geocoding APIs turn raw city strings into structured geographic data, including standardized English names. Tools like Nominatim (free, OpenStreetMap-based) or Google Maps Geocoding (paid, more robust) work great for this. You’ll need to handle rate limits and parse the returned data to extract the city name.

Here’s a Python example using geopy with Nominatim:

from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import pandas as pd

# Initialize geocoder (user agent is required for Nominatim)
geolocator = Nominatim(user_agent="city_standardization_tool")
# Add rate limiting to avoid hitting API restrictions
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

def get_standard_english_city(city_str):
    try:
        # Geocode the string, request results in English
        location = geocode(city_str, exactly_one=True, language="en")
        if not location:
            return None
        
        # Extract the city/town name from address components
        address = location.raw.get("address", {})
        for key in ["city", "town", "municipality", "village"]:
            if key in address:
                return address[key].title()
        
        # Fallback to the first part of the display name if no city key exists
        return location.raw["display_name"].split(",")[0].title()
    except Exception as e:
        print(f"Failed to process '{city_str}': {str(e)}")
        return None

# Load your dataset and apply the function
df = pd.read_csv("your_cities_dataset.csv")
df["mapped_city"] = df["city_name"].apply(get_standard_english_city)

Notes:

Nominatim has rate limits (1 request/second for free users), so the RateLimiter is critical.
For commercial use or higher volume, Google Maps Geocoding is more reliable but requires an API key and has costs.
This handles strings with country suffixes (like "bruxelles belgium") automatically, since the geocoder will parse the full context.

2. Fuzzy Matching with a Standard City Database

If API calls aren’t feasible (e.g., offline work), use a pre-built list of global cities (like GeoNames’ cities15000.txt, which includes all cities with population >15k) and fuzzy matching to find the closest English name.

Here’s how to do this with fuzzywuzzy:

from fuzzywuzzy import process
import pandas as pd

# Load GeoNames' standard city list (pre-filtered to English names)
standard_cities = pd.read_csv("geonames_cities.csv", sep="\t")["Name"].tolist()

def fuzzy_match_standard_city(city_str):
    # Clean the input: remove country suffixes, lowercase
    clean_city = city_str.split()[0].lower()
    # Find the best match with a score threshold (adjust based on accuracy needs)
    match, score = process.extractOne(clean_city, standard_cities)
    if score >= 80:  # 80 is a starting point—tune this for your data
        return match.title()
    return None

df = pd.read_csv("your_cities_dataset.csv")
df["mapped_city"] = df["city_name"].apply(fuzzy_match_standard_city)

Notes:

Preprocess the GeoNames list to remove duplicates and focus on English primary names.
Adjust the score threshold: higher = stricter matches (fewer false positives), lower = more matches (more false positives).

3. Hybrid Approach (Best of Both Worlds)

For maximum accuracy, combine geocoding and fuzzy matching:

Use geocoding for most entries (it handles context like country suffixes best).
Use fuzzy matching on entries the geocoder couldn’t identify.
Manually review the tiny remaining subset (likely <5% of your 15k entries).

Example workflow:

# First pass: geocoding
df["mapped_city"] = df["city_name"].apply(get_standard_english_city)

# Isolate unmatched entries
unmatched = df[df["mapped_city"].isna()].copy()

# Second pass: fuzzy matching on unmatched entries
unmatched["mapped_city"] = unmatched["city_name"].apply(fuzzy_match_standard_city)

# Merge results back
df.update(unmatched)

# Export remaining unmatched entries for manual review
df[df["mapped_city"].isna()].to_csv("manual_review_cities.csv", index=False)

Pro Tips

Test each method on a small sample of your data first to tweak thresholds and ensure accuracy.
For non-Latin script city names (e.g., Chinese, Arabic), geocoding APIs will automatically transliterate them to English.
Cache geocoding results to avoid redundant API calls (save results to a CSV or database).

内容的提问来源于stack exchange，提问作者Raul Gonzales