请教:如何开发cashpoint.com足球赔率的WebScraper与爬虫项目?
Hey there! Let's break down how you can build this scraper step by step—since you already have Python 3 and basic crawling under your belt, you're already halfway there. Your two initial ideas are solid, so let's refine them and add practical details to make this work smoothly.
We'll split this into 4 core stages: grab league entry links, scrape odds data from each page, clean & structure data, and store it in a database. Let's dive into each part.
1. Prep Your Tools
First, install the dependencies you'll need—these will handle browser automation, data handling, and database interactions:
pip install selenium webdriver-manager pandas sqlalchemy
selenium: Handles JS-rendered pages (your go-to for cashpoint's dynamic content)webdriver-manager: Automatically manages browser drivers (no manual downloads!)pandas: Makes data structuring and batch storage easysqlalchemy: Simplifies database operations (works with SQLite, MySQL, PostgreSQL—start with SQLite for simplicity)
2. Get All Football League Links
You can pick either of your ideas, or combine them for reliability:
Option A: Traverse Left Menu via Selenium (Your Second Idea)
This is straightforward since you're already comfortable with Selenium:
- Launch the browser and navigate to cashpoint's football homepage
- Wait for the left menu to load, then locate all league links using XPath/CSS selectors (example:
//ul[contains(@class, 'league-nav')]/li/a) - Either extract the
hrefattribute directly, or click each link and save the current URL (useWebDriverWaitto avoid race conditions:from selenium.webdriver.support.ui import WebDriverWait; from selenium.webdriver.support import expected_conditions as EC) - Store all unique URLs in a list/set to avoid duplicates
Option B: Crawl Links with a Spider (Your First Idea)
If you want to speed up link collection, use a crawler like Scrapy with the scrapy-selenium middleware to render JS. This lets you scrape links in bulk without manually clicking. But if Selenium is more familiar, stick with Option A first.
3. Scrape Odds Data from Each League Page
Once you have all league URLs, loop through each one to extract match details:
- Load the league page and wait for the odds table to render (use
WebDriverWaitto wait for the table element to be present) - Traverse each row in the odds table to pull:
- League name
- Match title (e.g., "Man Utd vs Liverpool")
- Match date/time
- Home win, draw, away win odds
- Structure each match's data into a dictionary for consistency:
match_data = { "league": "Premier League", "match_name": "Man Utd vs Liverpool", "match_time": "2024-05-20 19:00", "home_odds": 2.15, "draw_odds": 3.40, "away_odds": 3.25 } - Handle dynamic loading: Some pages load more matches as you scroll—simulate scrolling with
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")and wait for new content to load before scraping again.
4. Store Data in a Database
Using SQLAlchemy makes this clean and scalable. Here's a quick example with SQLite:
Step 1: Define a Data Model
from sqlalchemy import create_engine, Column, Integer, String, Float, DateTime from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import sessionmaker from datetime import datetime Base = declarative_base() class MatchOdds(Base): __tablename__ = "match_odds" id = Column(Integer, primary_key=True, autoincrement=True) league = Column(String(50)) match_name = Column(String(100)) match_time = Column(DateTime) home_odds = Column(Float) draw_odds = Column(Float) away_odds = Column(Float) # Create database connection and table engine = create_engine("sqlite:///cashpoint_odds.db") Base.metadata.create_all(engine) Session = sessionmaker(bind=engine) session = Session()
Step 2: Save Scraped Data
You can save individual entries or batch-save with Pandas (faster for large datasets):
# Individual entry new_match = MatchOdds( league=match_data["league"], match_name=match_data["match_name"], match_time=datetime.strptime(match_data["match_time"], "%Y-%m-%d %H:%M"), home_odds=match_data["home_odds"], draw_odds=match_data["draw_odds"], away_odds=match_data["away_odds"] ) session.add(new_match) session.commit() # Batch save with Pandas import pandas as pd data_list = [] # Collect all match_data dicts here df = pd.DataFrame(data_list) df.to_sql("match_odds", engine, if_exists="append", index=False)
5. Key Optimizations & Anti-Blocking Tips
Cashpoint might flag your scraper, so add these safeguards:
- Random wait times: Use
time.sleep(random.uniform(1, 3))between requests to mimic human behavior - Custom User-Agent: Spoof a real browser's user agent with
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"}) - Error handling: Wrap scraping code in
try-exceptblocks to catch missing elements or failed loads, so your script doesn't crash mid-run - Modular code: Take inspiration from gingeleski's
odds-portal-scraper—split your code into functions/classes forget_leagues(),scrape_odds(), andsave_to_db()to keep it maintainable.
6. Optional: Speed Up with Asynchronous Scraping
If you have hundreds of leagues, consider using playwright (supports async) instead of Selenium, or set up Selenium Grid for parallel scraping. But start with the synchronous version first—get it working reliably before optimizing speed.
内容的提问来源于stack exchange,提问作者user2939562




