You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Python将BeautifulSoup爬取的多页数据导出为CSV/Excel

How to Export Scraped Bid Data to CSV/Excel in Python

Hey there! Nice job getting that multi-page web scraping working—pulling data across pages is a solid win for a Python newbie. Let's turn those printed results into a usable CSV or Excel file. I'll walk you through two simple methods: one using Python's built-in tools, and another with pandas for easier data handling.

Method 1: Use Python's Built-in csv Module (No Extra Installs Needed)

This is great if you want to stick to the standard library without installing new packages. We'll modify your existing code to collect all the bid data into a list of dictionaries first, then write it to a CSV file.

Here's the updated code:

import requests
import csv
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs

def scrape_bid_data():
    page_no = 1 # initial page number
    # Initialize an empty list to store all bid data
    bid_records = []
    
    while True:
        print('Hold on creating URL to fetch data...')
        URL = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no) # fixed URL typo (amp; was unnecessary)
        print('URL created: ' + URL)
        
        scraped_data = requests.get(URL, verify=False)
        soup_data = bs(scraped_data.text, 'lxml')
        extracted_data = soup_data.find('div', {'id':'pagi_content'})
        
        if not extracted_data or len(extracted_data) == 0: # Handle case where div isn't found
            break
        
        for idx in range(len(extracted_data)):
            if idx % 2 == 1: # Target odd indexes as before
                bid_data = extracted_data.contents[idx].text.strip().split('\n')
                # Clean up whitespace in each field
                bid_record = {
                    "Bid Number": bid_data[0].strip(),
                    "Items": bid_data[5].strip(),
                    "Quantity Required": bid_data[6].strip(),
                    "Department": (bid_data[10] + bid_data[12].strip()).strip(),
                    "Start Date": bid_data[16].strip(),
                    "End Date": bid_data[17].strip()
                }
                bid_records.append(bid_record)
                # Optional: Keep printing if you want to track progress
                print('-' * 100)
                print(bid_record["Bid Number"])
                print('-' * 100)
        
        page_no +=1
    
    # Write all collected data to CSV
    with open('gem_bid_data.csv', 'w', newline='', encoding='utf-8') as csv_file:
        # Define column headers matching our dictionary keys
        fieldnames = ["Bid Number", "Items", "Quantity Required", "Department", "Start Date", "End Date"]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        
        writer.writeheader() # Write column names first
        writer.writerows(bid_records) # Write all rows
    
    print(f"Successfully saved {len(bid_records)} bid records to gem_bid_data.csv!")

scrape_bid_data()

Key changes I made:

  • Added a bid_records list to store all data instead of just printing it
  • Converted each bid entry into a dictionary for clear column mapping
  • Added whitespace cleaning with .strip() to avoid messy data
  • Fixed a small typo in your URL (& -> &)
  • Added a check for extracted_data being None to avoid errors

Method 2: Use pandas (Easier for Excel & Data Manipulation)

If you want to export to Excel or do any quick data analysis later, pandas is the way to go. It's more concise and handles formatting automatically.

First, install the required packages:

pip install pandas openpyxl

Then use this updated code:

import requests
import pandas as pd
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs

def scrape_bid_data():
    page_no = 1
    bid_records = []
    
    while True:
        print('Hold on creating URL to fetch data...')
        URL = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no)
        print('URL created: ' + URL)
        
        scraped_data = requests.get(URL, verify=False)
        soup_data = bs(scraped_data.text, 'lxml')
        extracted_data = soup_data.find('div', {'id':'pagi_content'})
        
        if not extracted_data or len(extracted_data) == 0:
            break
        
        for idx in range(len(extracted_data)):
            if idx % 2 == 1:
                bid_data = extracted_data.contents[idx].text.strip().split('\n')
                bid_record = {
                    "Bid Number": bid_data[0].strip(),
                    "Items": bid_data[5].strip(),
                    "Quantity Required": bid_data[6].strip(),
                    "Department": (bid_data[10] + bid_data[12].strip()).strip(),
                    "Start Date": bid_data[16].strip(),
                    "End Date": bid_data[17].strip()
                }
                bid_records.append(bid_record)
        
        page_no +=1
    
    # Convert list of dictionaries to a DataFrame
    df = pd.DataFrame(bid_records)
    
    # Save to CSV
    df.to_csv('gem_bid_data.csv', index=False, encoding='utf-8')
    # Save to Excel (requires openpyxl)
    df.to_excel('gem_bid_data.xlsx', index=False, engine='openpyxl')
    
    print(f"Successfully saved {len(bid_records)} bid records to CSV and Excel!")

scrape_bid_data()

This method gives you both CSV and Excel files with minimal extra code. The index=False parameter ensures we don't add an unnecessary row number column.

Quick Tips:

  • If you run into encoding issues, try adding encoding='utf-8-sig' instead of utf-8 when saving files
  • Double-check the bid_data indexes (like bid_data[5]) to make sure you're pulling the right fields—sometimes website layouts change!
  • For large datasets, pandas will handle memory more efficiently than manual CSV writing

内容的提问来源于stack exchange,提问作者user14241072

火山引擎 最新活动