You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

基于Python的1.5万条多语言城市数据英文名称高效映射方案

Hey there, dealing with 15k messy multilingual city entries is a pain—manual mapping is totally impractical here. Let’s break down scalable technical solutions to standardize them to English names:

1. Batch Geocoding APIs (Most Reliable for Global Cities)

Geocoding APIs turn raw city strings into structured geographic data, including standardized English names. Tools like Nominatim (free, OpenStreetMap-based) or Google Maps Geocoding (paid, more robust) work great for this. You’ll need to handle rate limits and parse the returned data to extract the city name.

Here’s a Python example using geopy with Nominatim:

from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import pandas as pd

# Initialize geocoder (user agent is required for Nominatim)
geolocator = Nominatim(user_agent="city_standardization_tool")
# Add rate limiting to avoid hitting API restrictions
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

def get_standard_english_city(city_str):
    try:
        # Geocode the string, request results in English
        location = geocode(city_str, exactly_one=True, language="en")
        if not location:
            return None
        
        # Extract the city/town name from address components
        address = location.raw.get("address", {})
        for key in ["city", "town", "municipality", "village"]:
            if key in address:
                return address[key].title()
        
        # Fallback to the first part of the display name if no city key exists
        return location.raw["display_name"].split(",")[0].title()
    except Exception as e:
        print(f"Failed to process '{city_str}': {str(e)}")
        return None

# Load your dataset and apply the function
df = pd.read_csv("your_cities_dataset.csv")
df["mapped_city"] = df["city_name"].apply(get_standard_english_city)

Notes:

  • Nominatim has rate limits (1 request/second for free users), so the RateLimiter is critical.
  • For commercial use or higher volume, Google Maps Geocoding is more reliable but requires an API key and has costs.
  • This handles strings with country suffixes (like "bruxelles belgium") automatically, since the geocoder will parse the full context.

2. Fuzzy Matching with a Standard City Database

If API calls aren’t feasible (e.g., offline work), use a pre-built list of global cities (like GeoNames’ cities15000.txt, which includes all cities with population >15k) and fuzzy matching to find the closest English name.

Here’s how to do this with fuzzywuzzy:

from fuzzywuzzy import process
import pandas as pd

# Load GeoNames' standard city list (pre-filtered to English names)
standard_cities = pd.read_csv("geonames_cities.csv", sep="\t")["Name"].tolist()

def fuzzy_match_standard_city(city_str):
    # Clean the input: remove country suffixes, lowercase
    clean_city = city_str.split()[0].lower()
    # Find the best match with a score threshold (adjust based on accuracy needs)
    match, score = process.extractOne(clean_city, standard_cities)
    if score >= 80:  # 80 is a starting point—tune this for your data
        return match.title()
    return None

df = pd.read_csv("your_cities_dataset.csv")
df["mapped_city"] = df["city_name"].apply(fuzzy_match_standard_city)

Notes:

  • Preprocess the GeoNames list to remove duplicates and focus on English primary names.
  • Adjust the score threshold: higher = stricter matches (fewer false positives), lower = more matches (more false positives).

3. Hybrid Approach (Best of Both Worlds)

For maximum accuracy, combine geocoding and fuzzy matching:

  1. Use geocoding for most entries (it handles context like country suffixes best).
  2. Use fuzzy matching on entries the geocoder couldn’t identify.
  3. Manually review the tiny remaining subset (likely <5% of your 15k entries).

Example workflow:

# First pass: geocoding
df["mapped_city"] = df["city_name"].apply(get_standard_english_city)

# Isolate unmatched entries
unmatched = df[df["mapped_city"].isna()].copy()

# Second pass: fuzzy matching on unmatched entries
unmatched["mapped_city"] = unmatched["city_name"].apply(fuzzy_match_standard_city)

# Merge results back
df.update(unmatched)

# Export remaining unmatched entries for manual review
df[df["mapped_city"].isna()].to_csv("manual_review_cities.csv", index=False)

Pro Tips

  • Test each method on a small sample of your data first to tweak thresholds and ensure accuracy.
  • For non-Latin script city names (e.g., Chinese, Arabic), geocoding APIs will automatically transliterate them to English.
  • Cache geocoding results to avoid redundant API calls (save results to a CSV or database).

内容的提问来源于stack exchange,提问作者Raul Gonzales

火山引擎 最新活动