请教：如何开发cashpoint.com足球赔率的WebScraper与爬虫项目？

阿华AIGC实验室

2026-5-29

Hey there! Let's break down how you can build this scraper step by step—since you already have Python 3 and basic crawling under your belt, you're already halfway there. Your two initial ideas are solid, so let's refine them and add practical details to make this work smoothly.

整体项目流程规划

We'll split this into 4 core stages: grab league entry links, scrape odds data from each page, clean & structure data, and store it in a database. Let's dive into each part.

1. Prep Your Tools

First, install the dependencies you'll need—these will handle browser automation, data handling, and database interactions:

pip install selenium webdriver-manager pandas sqlalchemy

selenium: Handles JS-rendered pages (your go-to for cashpoint's dynamic content)
webdriver-manager: Automatically manages browser drivers (no manual downloads!)
pandas: Makes data structuring and batch storage easy
sqlalchemy: Simplifies database operations (works with SQLite, MySQL, PostgreSQL—start with SQLite for simplicity)

2. Get All Football League Links

You can pick either of your ideas, or combine them for reliability:

This is straightforward since you're already comfortable with Selenium:

Launch the browser and navigate to cashpoint's football homepage
Wait for the left menu to load, then locate all league links using XPath/CSS selectors (example: //ul[contains(@class, 'league-nav')]/li/a)
Either extract the href attribute directly, or click each link and save the current URL (use WebDriverWait to avoid race conditions: from selenium.webdriver.support.ui import WebDriverWait; from selenium.webdriver.support import expected_conditions as EC)
Store all unique URLs in a list/set to avoid duplicates

Option B: Crawl Links with a Spider (Your First Idea)

If you want to speed up link collection, use a crawler like Scrapy with the scrapy-selenium middleware to render JS. This lets you scrape links in bulk without manually clicking. But if Selenium is more familiar, stick with Option A first.

3. Scrape Odds Data from Each League Page

Once you have all league URLs, loop through each one to extract match details:

Load the league page and wait for the odds table to render (use WebDriverWait to wait for the table element to be present)
Traverse each row in the odds table to pull:
- League name
- Match title (e.g., "Man Utd vs Liverpool")
- Match date/time
- Home win, draw, away win odds

Structure each match's data into a dictionary for consistency:

match_data = {
    "league": "Premier League",
    "match_name": "Man Utd vs Liverpool",
    "match_time": "2024-05-20 19:00",
    "home_odds": 2.15,
    "draw_odds": 3.40,
    "away_odds": 3.25
}

Handle dynamic loading: Some pages load more matches as you scroll—simulate scrolling with driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") and wait for new content to load before scraping again.

4. Store Data in a Database

Using SQLAlchemy makes this clean and scalable. Here's a quick example with SQLite:

Step 1: Define a Data Model

from sqlalchemy import create_engine, Column, Integer, String, Float, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from datetime import datetime

Base = declarative_base()

class MatchOdds(Base):
    __tablename__ = "match_odds"
    id = Column(Integer, primary_key=True, autoincrement=True)
    league = Column(String(50))
    match_name = Column(String(100))
    match_time = Column(DateTime)
    home_odds = Column(Float)
    draw_odds = Column(Float)
    away_odds = Column(Float)

# Create database connection and table
engine = create_engine("sqlite:///cashpoint_odds.db")
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()

Step 2: Save Scraped Data

You can save individual entries or batch-save with Pandas (faster for large datasets):

# Individual entry
new_match = MatchOdds(
    league=match_data["league"],
    match_name=match_data["match_name"],
    match_time=datetime.strptime(match_data["match_time"], "%Y-%m-%d %H:%M"),
    home_odds=match_data["home_odds"],
    draw_odds=match_data["draw_odds"],
    away_odds=match_data["away_odds"]
)
session.add(new_match)
session.commit()

# Batch save with Pandas
import pandas as pd
data_list = []  # Collect all match_data dicts here
df = pd.DataFrame(data_list)
df.to_sql("match_odds", engine, if_exists="append", index=False)

5. Key Optimizations & Anti-Blocking Tips

Cashpoint might flag your scraper, so add these safeguards:

Random wait times: Use time.sleep(random.uniform(1, 3)) between requests to mimic human behavior
Custom User-Agent: Spoof a real browser's user agent with driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"})
Error handling: Wrap scraping code in try-except blocks to catch missing elements or failed loads, so your script doesn't crash mid-run
Modular code: Take inspiration from gingeleski's odds-portal-scraper—split your code into functions/classes for get_leagues(), scrape_odds(), and save_to_db() to keep it maintainable.

6. Optional: Speed Up with Asynchronous Scraping

If you have hundreds of leagues, consider using playwright (supports async) instead of Selenium, or set up Selenium Grid for parallel scraping. But start with the synchronous version first—get it working reliably before optimizing speed.

内容的提问来源于stack exchange，提问作者user2939562

火山引擎最新活动

方舟 Coding Plan

HOT

模型自由，工具不限，最新支持 DeepSeek-V4 系列与 GLM-5.1，受邀下单叠加9.5折

查看详情

ArkClaw

7×24在线专属智能伙伴

查看详情

Seedance 2.0 全面开放 API

创作无限可能，一键生成电影级 AI 视频

查看详情

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠

查看详情

方舟 Agent Plan