如何用Python将BeautifulSoup爬取的多页数据导出为CSV/Excel
Hey there! Nice job getting that multi-page web scraping working—pulling data across pages is a solid win for a Python newbie. Let's turn those printed results into a usable CSV or Excel file. I'll walk you through two simple methods: one using Python's built-in tools, and another with pandas for easier data handling.
Method 1: Use Python's Built-in csv Module (No Extra Installs Needed)
This is great if you want to stick to the standard library without installing new packages. We'll modify your existing code to collect all the bid data into a list of dictionaries first, then write it to a CSV file.
Here's the updated code:
import requests import csv from urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnings(InsecureRequestWarning) from bs4 import BeautifulSoup as bs def scrape_bid_data(): page_no = 1 # initial page number # Initialize an empty list to store all bid data bid_records = [] while True: print('Hold on creating URL to fetch data...') URL = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no) # fixed URL typo (amp; was unnecessary) print('URL created: ' + URL) scraped_data = requests.get(URL, verify=False) soup_data = bs(scraped_data.text, 'lxml') extracted_data = soup_data.find('div', {'id':'pagi_content'}) if not extracted_data or len(extracted_data) == 0: # Handle case where div isn't found break for idx in range(len(extracted_data)): if idx % 2 == 1: # Target odd indexes as before bid_data = extracted_data.contents[idx].text.strip().split('\n') # Clean up whitespace in each field bid_record = { "Bid Number": bid_data[0].strip(), "Items": bid_data[5].strip(), "Quantity Required": bid_data[6].strip(), "Department": (bid_data[10] + bid_data[12].strip()).strip(), "Start Date": bid_data[16].strip(), "End Date": bid_data[17].strip() } bid_records.append(bid_record) # Optional: Keep printing if you want to track progress print('-' * 100) print(bid_record["Bid Number"]) print('-' * 100) page_no +=1 # Write all collected data to CSV with open('gem_bid_data.csv', 'w', newline='', encoding='utf-8') as csv_file: # Define column headers matching our dictionary keys fieldnames = ["Bid Number", "Items", "Quantity Required", "Department", "Start Date", "End Date"] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() # Write column names first writer.writerows(bid_records) # Write all rows print(f"Successfully saved {len(bid_records)} bid records to gem_bid_data.csv!") scrape_bid_data()
Key changes I made:
- Added a
bid_recordslist to store all data instead of just printing it - Converted each bid entry into a dictionary for clear column mapping
- Added whitespace cleaning with
.strip()to avoid messy data - Fixed a small typo in your URL (
&->&) - Added a check for
extracted_databeingNoneto avoid errors
Method 2: Use pandas (Easier for Excel & Data Manipulation)
If you want to export to Excel or do any quick data analysis later, pandas is the way to go. It's more concise and handles formatting automatically.
First, install the required packages:
pip install pandas openpyxl
Then use this updated code:
import requests import pandas as pd from urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnings(InsecureRequestWarning) from bs4 import BeautifulSoup as bs def scrape_bid_data(): page_no = 1 bid_records = [] while True: print('Hold on creating URL to fetch data...') URL = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no) print('URL created: ' + URL) scraped_data = requests.get(URL, verify=False) soup_data = bs(scraped_data.text, 'lxml') extracted_data = soup_data.find('div', {'id':'pagi_content'}) if not extracted_data or len(extracted_data) == 0: break for idx in range(len(extracted_data)): if idx % 2 == 1: bid_data = extracted_data.contents[idx].text.strip().split('\n') bid_record = { "Bid Number": bid_data[0].strip(), "Items": bid_data[5].strip(), "Quantity Required": bid_data[6].strip(), "Department": (bid_data[10] + bid_data[12].strip()).strip(), "Start Date": bid_data[16].strip(), "End Date": bid_data[17].strip() } bid_records.append(bid_record) page_no +=1 # Convert list of dictionaries to a DataFrame df = pd.DataFrame(bid_records) # Save to CSV df.to_csv('gem_bid_data.csv', index=False, encoding='utf-8') # Save to Excel (requires openpyxl) df.to_excel('gem_bid_data.xlsx', index=False, engine='openpyxl') print(f"Successfully saved {len(bid_records)} bid records to CSV and Excel!") scrape_bid_data()
This method gives you both CSV and Excel files with minimal extra code. The index=False parameter ensures we don't add an unnecessary row number column.
Quick Tips:
- If you run into encoding issues, try adding
encoding='utf-8-sig'instead ofutf-8when saving files - Double-check the
bid_dataindexes (likebid_data[5]) to make sure you're pulling the right fields—sometimes website layouts change! - For large datasets,
pandaswill handle memory more efficiently than manual CSV writing
内容的提问来源于stack exchange,提问作者user14241072




