气象数据下载清洗CSV存储及美国四城市降水数据处理开发需求
Let's walk through exactly how to solve this task—from fetching and cleaning the weather data to generating those 48 monthly files and calculating the average precipitation. I’ll use Python since it’s the go-to for data tasks like this.
First, make sure you have the necessary libraries installed. We’ll use requests to pull data from the API, pandas for data cleaning/analysis, and os for file management. Install them via pip if you haven’t already:
pip install requests pandas
Start by reading the stations.csv file to get the four city station IDs. Adjust the column name if your CSV uses something other than station_id:
import pandas as pd import requests import os from datetime import datetime # Load station IDs from the CSV stations_df = pd.read_csv("stations.csv") station_ids = stations_df["station_id"].tolist() # Update column name if needed
Next, we’ll loop through each station and each month of 2017, fetch the data, clean it, and save it as a separate CSV. You’ll need to replace the base_url with your actual weather data API’s URL template (adjust the placeholders to match how the API accepts station ID, year, and month):
# Create a folder to store the monthly files (avoids clutter) os.makedirs("monthly_weather_data", exist_ok=True) # Replace this with your actual API URL template base_url = "https://your-weather-api-url.com/stations/{station_id}/data?year={year}&month={month}" for station_id in station_ids: for month in range(1, 13): # Format month as two digits (e.g., 01 for January) month_str = f"{month:02d}" # Build the full request URL request_url = base_url.format(station_id=station_id, year=2017, month=month_str) try: # Fetch the data from the API response = requests.get(request_url) response.raise_for_status() # Throw an error if the request fails # Parse the data into a DataFrame (adjust this based on your API's output format) # Example for JSON data: data = response.json() df = pd.DataFrame(data["observations"]) # Update to match your API's structure # --- Data Cleaning Steps (customize these to your data!) --- # 1. Keep only columns we need (date and precipitation) df = df[["date", "precipitation"]] # Replace with your actual column names # 2. Convert date to datetime format (critical for sorting/filtering) df["date"] = pd.to_datetime(df["date"]) # 3. Drop rows with missing precipitation values df = df.dropna(subset=["precipitation"]) # 4. Ensure precipitation values are numeric (fix any string entries) df["precipitation"] = pd.to_numeric(df["precipitation"], errors="coerce") df = df.dropna(subset=["precipitation"]) # --- Save the cleaned data as a CSV --- output_filename = f"monthly_weather_data/{station_id}_2017_{month_str}.csv" df.to_csv(output_filename, index=False) print(f"Successfully saved: {output_filename}") except Exception as e: print(f"Failed to process {station_id} - {month_str}/2017: {str(e)}") continue
Quick Notes on Data Cleaning:
- If your API returns CSV data instead of JSON, replace the parsing step with
df = pd.read_csv(request_url) - Adjust column names to match what your data uses (e.g.,
precipinstead ofprecipitation,timestampinstead ofdate) - Add extra steps if needed: filter out invalid values (like negative precipitation), convert units (inches to mm), or handle timezone differences
Once you have all 48 monthly files, calculating the averages is straightforward. We’ll loop through each file, compute the mean precipitation, and save the results in a summary CSV:
# Store average precipitation data in a list avg_precip_results = [] for station_id in station_ids: for month in range(1, 13): month_str = f"{month:02d}" file_path = f"monthly_weather_data/{station_id}_2017_{month_str}.csv" # Read the cleaned monthly data df = pd.read_csv(file_path) # Calculate the monthly average monthly_avg = df["precipitation"].mean() # Add to our results list avg_precip_results.append({ "station_id": station_id, "year": 2017, "month": month, "average_precipitation": round(monthly_avg, 2) # Round to 2 decimal places for readability }) # Convert results to a DataFrame and save avg_precip_df = pd.DataFrame(avg_precip_results) avg_precip_df.to_csv("2017_monthly_avg_precipitation.csv", index=False) print("Monthly average precipitation results saved to 2017_monthly_avg_precipitation.csv")
- API Rate Limits: If you hit API request limits, add a small delay between requests with
time.sleep(1)(don’t forget to importtime) - Missing Data: Some months might have no data—our try-except block will skip those and log the error
- Data Types: Double-check that precipitation values are numeric; if you see errors, adjust the
pd.to_numericstep to handle edge cases
内容的提问来源于stack exchange,提问作者Ma_




