You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Python多进程爬取触发[Errno 32] Broken pipe错误,如何解决?

解决多进程爬虫的Broken Pipe错误与持续运行问题

Hey there! Let's break down your problem and work through the solutions one by one. You're using multiprocessing.Pool to crawl a large number of URLs, but long runtime leads to browser connection timeouts and [Errno 32] Broken pipe errors. Here's how to fix this:

核心疑问解答

  1. 如何让脚本持续运行?
    The Broken pipe error pops up when the connection to the target server gets abruptly closed—usually due to timeouts or server-side rate limits. The fix is to catch this error in your worker function, log it, and let the script keep processing other URLs instead of crashing.

  2. 能否抑制该错误使脚本继续执行?
    Absolutely! Wrap your risky download logic in a try-except block to catch the OSError (since Errno 32 falls under this category) and handle it gracefully without stopping the entire workflow.

  3. 捕获错误后脚本仍会停止吗?
    No—if you catch the error inside the download_slick_slide_html function (the worker running in each subprocess), the subprocess won't terminate unexpectedly. The main process's Pool will just move on to the next task in the queue. The whole script only stops if an uncaught error hits the main process.

  4. 是否必须放弃在耗时脚本中使用多进程?
    Definitely not! Multiprocessing is still a great choice for I/O-bound tasks like web crawling. The issue here isn't multiprocessing itself—it's unhandled connection errors. Fixing the error handling will let you keep using multiprocessing effectively.

可行解决方案

1. 在Worker函数中添加错误捕获与重试

Modify your download_slick_slide_html function to catch connection errors, log them, and optionally retry failed requests. Here's an updated version:

import time
from multiprocessing import Pool, repeat

def download_slick_slide_html(f_snd_link_list, f_mode, f_path_to_ff, f_path_to_binaries, f_date_time, f_scraped_supplier, f_log_file):
    max_retries = 3
    retry_delay = 2  # Seconds between retries
    
    for attempt in range(max_retries):
        try:
            # Your existing downloading logic goes here
            # Example:
            # response = requests.get(f_snd_link_list, timeout=15)
            # response.raise_for_status()
            # ... rest of your processing code ...
            print(f"Successfully processed {f_snd_link_list}")
            return  # Exit function on successful crawl
        except OSError as e:
            if e.errno == 32:  # Target broken pipe error
                with open(f_log_file, "a") as log:
                    log.write(f"Broken pipe on {f_snd_link_list} (attempt {attempt+1}/{max_retries})\n")
                time.sleep(retry_delay)
            else:
                # Handle other OS-related errors
                with open(f_log_file, "a") as log:
                    log.write(f"OS Error {e.errno} on {f_snd_link_list}: {str(e)}\n")
                break  # Don't retry non-pipe errors
        except Exception as e:
            # Catch all other unexpected errors, log, and move on
            with open(f_log_file, "a") as log:
                log.write(f"Unexpected error on {f_snd_link_list}: {str(e)}\n")
            break
    # Log if all retries fail
    with open(f_log_file, "a") as log:
        log.write(f"Failed to process {f_snd_link_list} after {max_retries} attempts\n")

# Your Pool setup (note: `with Pool()` handles close/join automatically)
if __name__ == "__main__":
    # Assume sndLinkList, mode, pathToFF, etc. are defined here
    with Pool(5) as p:
        p.starmap(
            download_slick_slide_html, 
            zip(
                sndLinkList, 
                repeat(mode), 
                repeat(pathToFF),
                repeat(pathToBinaries), 
                repeat(dateTime), 
                repeat(scrapedSupplier), 
                repeat(logfile)
            )
        )

2. Optimize Crawling Strategy

  • Add explicit timeouts: Ensure your download calls have timeouts (e.g., requests.get(url, timeout=15)) to avoid hanging connections that trigger pipe errors.
  • Rate limiting: Add small delays between requests (even in multiprocessing) to avoid overwhelming the target server, which often leads to connection closures.
  • Rotate User-Agents: Use different User-Agent strings for each request to avoid being flagged as a bot and blocked.
  • Use proxies: If you hit IP rate limits, distribute requests across proxies to avoid getting blocked.

3. Adjust Multiprocessing Parameters

  • Reduce pool size: Try Pool(3) instead of Pool(5)—some servers block too many simultaneous connections from the same IP.
  • Use imap_unordered: If you don't need results in the order of your URL list, imap_unordered processes results as they're ready, which helps with memory management for very large datasets.

4. Persist Crawl State

For massive URL lists, save the state of crawled URLs to a file or database (like SQLite). If the script stops unexpectedly, you can resume from where you left off instead of starting over.

5. Consider Asynchronous I/O (Optional)

If your crawl is heavily I/O-bound (spending most time waiting for server responses), you might get better efficiency with aiohttp instead of multiprocessing. But this is a choice, not a requirement—multiprocessing works perfectly once you handle the errors.

内容的提问来源于stack exchange,提问作者acincognito

火山引擎 最新活动