基于Python的1.5万条多语言城市数据英文名称高效映射方案
Hey there, dealing with 15k messy multilingual city entries is a pain—manual mapping is totally impractical here. Let’s break down scalable technical solutions to standardize them to English names:
1. Batch Geocoding APIs (Most Reliable for Global Cities)
Geocoding APIs turn raw city strings into structured geographic data, including standardized English names. Tools like Nominatim (free, OpenStreetMap-based) or Google Maps Geocoding (paid, more robust) work great for this. You’ll need to handle rate limits and parse the returned data to extract the city name.
Here’s a Python example using geopy with Nominatim:
from geopy.geocoders import Nominatim from geopy.extra.rate_limiter import RateLimiter import pandas as pd # Initialize geocoder (user agent is required for Nominatim) geolocator = Nominatim(user_agent="city_standardization_tool") # Add rate limiting to avoid hitting API restrictions geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1) def get_standard_english_city(city_str): try: # Geocode the string, request results in English location = geocode(city_str, exactly_one=True, language="en") if not location: return None # Extract the city/town name from address components address = location.raw.get("address", {}) for key in ["city", "town", "municipality", "village"]: if key in address: return address[key].title() # Fallback to the first part of the display name if no city key exists return location.raw["display_name"].split(",")[0].title() except Exception as e: print(f"Failed to process '{city_str}': {str(e)}") return None # Load your dataset and apply the function df = pd.read_csv("your_cities_dataset.csv") df["mapped_city"] = df["city_name"].apply(get_standard_english_city)
Notes:
- Nominatim has rate limits (1 request/second for free users), so the
RateLimiteris critical. - For commercial use or higher volume, Google Maps Geocoding is more reliable but requires an API key and has costs.
- This handles strings with country suffixes (like "bruxelles belgium") automatically, since the geocoder will parse the full context.
2. Fuzzy Matching with a Standard City Database
If API calls aren’t feasible (e.g., offline work), use a pre-built list of global cities (like GeoNames’ cities15000.txt, which includes all cities with population >15k) and fuzzy matching to find the closest English name.
Here’s how to do this with fuzzywuzzy:
from fuzzywuzzy import process import pandas as pd # Load GeoNames' standard city list (pre-filtered to English names) standard_cities = pd.read_csv("geonames_cities.csv", sep="\t")["Name"].tolist() def fuzzy_match_standard_city(city_str): # Clean the input: remove country suffixes, lowercase clean_city = city_str.split()[0].lower() # Find the best match with a score threshold (adjust based on accuracy needs) match, score = process.extractOne(clean_city, standard_cities) if score >= 80: # 80 is a starting point—tune this for your data return match.title() return None df = pd.read_csv("your_cities_dataset.csv") df["mapped_city"] = df["city_name"].apply(fuzzy_match_standard_city)
Notes:
- Preprocess the GeoNames list to remove duplicates and focus on English primary names.
- Adjust the score threshold: higher = stricter matches (fewer false positives), lower = more matches (more false positives).
3. Hybrid Approach (Best of Both Worlds)
For maximum accuracy, combine geocoding and fuzzy matching:
- Use geocoding for most entries (it handles context like country suffixes best).
- Use fuzzy matching on entries the geocoder couldn’t identify.
- Manually review the tiny remaining subset (likely <5% of your 15k entries).
Example workflow:
# First pass: geocoding df["mapped_city"] = df["city_name"].apply(get_standard_english_city) # Isolate unmatched entries unmatched = df[df["mapped_city"].isna()].copy() # Second pass: fuzzy matching on unmatched entries unmatched["mapped_city"] = unmatched["city_name"].apply(fuzzy_match_standard_city) # Merge results back df.update(unmatched) # Export remaining unmatched entries for manual review df[df["mapped_city"].isna()].to_csv("manual_review_cities.csv", index=False)
Pro Tips
- Test each method on a small sample of your data first to tweak thresholds and ensure accuracy.
- For non-Latin script city names (e.g., Chinese, Arabic), geocoding APIs will automatically transliterate them to English.
- Cache geocoding results to avoid redundant API calls (save results to a CSV or database).
内容的提问来源于stack exchange,提问作者Raul Gonzales




