Scrapy爬虫深度爬取网站遇阻：多环节整合实现全视频链接爬取求助

阿华AIGC实验室

2026-5-20

How to Integrate Your Web Scraping Workflow for Full Video Link Extraction

Hey there, sounds like you’ve got all the critical building blocks already—now it’s just about stitching them into a smooth, end-to-end pipeline! Let’s walk through how to combine your three existing steps into one complete process that pulls every video link you need.

Step 1: Recap Your Existing Functions

First, let’s assume you’ve already got these core functions defined (adjust the names if yours differ slightly):

generate_module_urls(): Spits out a list of all your module URLs (module_urls)
get_lesson_urls(module_url): Takes a single module URL and returns a list of lesson URLs (lesson_urls) from that page
get_video_url(lesson_url): Takes a single lesson URL and extracts the direct video link from that page

Step 2: Build the Integration Pipeline

Now we’ll create a main workflow that iterates through each level: modules → lessons → videos. Here’s a Python example that ties everything together (plug in your existing code where noted):

import requests
from bs4 import BeautifulSoup  # Adjust if you use a different parser like lxml
import time  # For rate limiting

# Your existing functions (replace with your actual code)
def generate_module_urls():
    # Add your code to generate the list of module URLs here
    return ["https://example.com/module-1", "https://example.com/module-2"]

def get_lesson_urls(module_url):
    # Add your code to scrape lesson links from the module page
    response = requests.get(module_url)
    soup = BeautifulSoup(response.text, "html.parser")
    # Update the CSS selector to match the lesson links on your target site
    lesson_elements = soup.select("div.lesson-container a")
    return [link["href"] for link in lesson_elements]

def get_video_url(lesson_url):
    # Add your code to extract the video link from the lesson page
    response = requests.get(lesson_url)
    soup = BeautifulSoup(response.text, "html.parser")
    # Update the selector to match where the video link lives (e.g., <video> tag, iframe, etc.)
    video_element = soup.find("video")
    return video_element["src"] if video_element else None

# Main integration function
def scrape_all_video_links():
    all_video_links = []
    # Use a session to reuse connections (faster and more polite)
    session = requests.Session()

    # Step 1: Fetch all module URLs
    module_urls = generate_module_urls()
    
    # Step 2: Loop through each module
    for module_url in module_urls:
        print(f"Processing module: {module_url}")
        try:
            # Get lesson URLs for this module
            lesson_urls = get_lesson_urls(module_url)
            
            # Step 3: Loop through each lesson in the module
            for lesson_url in lesson_urls:
                print(f"Processing lesson: {lesson_url}")
                try:
                    # Add a small delay to avoid overwhelming the server
                    time.sleep(1)
                    # Fetch the video link for this lesson
                    video_url = get_video_url(lesson_url)
                    if video_url:
                        all_video_links.append(video_url)
                        print(f"Found video: {video_url}")
                    else:
                        print(f"No video found for lesson: {lesson_url}")
                except Exception as e:
                    print(f"Failed to process lesson {lesson_url}: {str(e)}")
        except Exception as e:
            print(f"Failed to process module {module_url}: {str(e)}")
    
    return all_video_links

# Run the full scraper
if __name__ == "__main__":
    video_links = scrape_all_video_links()
    
    # Save results to a text file for easy access
    with open("all_video_links.txt", "w") as f:
        for link in video_links:
            f.write(f"{link}\n")
    
    print(f"Done! Scraped {len(video_links)} video links total.")

Key Tips for Smooth Execution:

Error Handling: The try-except blocks ensure that if one module or lesson fails, the scraper keeps running instead of crashing entirely.
Rate Limiting: The time.sleep(1) adds a 1-second delay between requests—adjust this as needed to be polite to the website’s server (avoid getting blocked!).
Session Reuse: Using requests.Session() maintains persistent connections, making your scraper faster and more efficient than making individual get requests.
Selector Tweaks: Double-check the CSS selectors in get_lesson_urls and get_video_url—use your browser’s dev tools to inspect the actual HTML elements on the target site and update the selectors to match.

Troubleshooting Common Hiccups:

If you’re getting empty lesson URL lists: Verify your selector in get_lesson_urls matches the actual class/id of the lesson links on the module page.
If video links aren’t showing up: Some sites load videos dynamically with JavaScript—if that’s the case, you might need to use tools like Selenium or Playwright to render the page fully before scraping.

This pipeline should take you from module URLs all the way to collecting every video link—just plug in your existing code for each function and tweak the selectors to fit your target site!

内容的提问来源于stack exchange，提问作者user8298092