Python多进程爬取触发[Errno 32] Broken pipe错误,如何解决?
Hey there! Let's break down your problem and work through the solutions one by one. You're using multiprocessing.Pool to crawl a large number of URLs, but long runtime leads to browser connection timeouts and [Errno 32] Broken pipe errors. Here's how to fix this:
核心疑问解答
如何让脚本持续运行?
TheBroken pipeerror pops up when the connection to the target server gets abruptly closed—usually due to timeouts or server-side rate limits. The fix is to catch this error in your worker function, log it, and let the script keep processing other URLs instead of crashing.能否抑制该错误使脚本继续执行?
Absolutely! Wrap your risky download logic in atry-exceptblock to catch theOSError(sinceErrno 32falls under this category) and handle it gracefully without stopping the entire workflow.捕获错误后脚本仍会停止吗?
No—if you catch the error inside thedownload_slick_slide_htmlfunction (the worker running in each subprocess), the subprocess won't terminate unexpectedly. The main process'sPoolwill just move on to the next task in the queue. The whole script only stops if an uncaught error hits the main process.是否必须放弃在耗时脚本中使用多进程?
Definitely not! Multiprocessing is still a great choice for I/O-bound tasks like web crawling. The issue here isn't multiprocessing itself—it's unhandled connection errors. Fixing the error handling will let you keep using multiprocessing effectively.
可行解决方案
1. 在Worker函数中添加错误捕获与重试
Modify your download_slick_slide_html function to catch connection errors, log them, and optionally retry failed requests. Here's an updated version:
import time from multiprocessing import Pool, repeat def download_slick_slide_html(f_snd_link_list, f_mode, f_path_to_ff, f_path_to_binaries, f_date_time, f_scraped_supplier, f_log_file): max_retries = 3 retry_delay = 2 # Seconds between retries for attempt in range(max_retries): try: # Your existing downloading logic goes here # Example: # response = requests.get(f_snd_link_list, timeout=15) # response.raise_for_status() # ... rest of your processing code ... print(f"Successfully processed {f_snd_link_list}") return # Exit function on successful crawl except OSError as e: if e.errno == 32: # Target broken pipe error with open(f_log_file, "a") as log: log.write(f"Broken pipe on {f_snd_link_list} (attempt {attempt+1}/{max_retries})\n") time.sleep(retry_delay) else: # Handle other OS-related errors with open(f_log_file, "a") as log: log.write(f"OS Error {e.errno} on {f_snd_link_list}: {str(e)}\n") break # Don't retry non-pipe errors except Exception as e: # Catch all other unexpected errors, log, and move on with open(f_log_file, "a") as log: log.write(f"Unexpected error on {f_snd_link_list}: {str(e)}\n") break # Log if all retries fail with open(f_log_file, "a") as log: log.write(f"Failed to process {f_snd_link_list} after {max_retries} attempts\n") # Your Pool setup (note: `with Pool()` handles close/join automatically) if __name__ == "__main__": # Assume sndLinkList, mode, pathToFF, etc. are defined here with Pool(5) as p: p.starmap( download_slick_slide_html, zip( sndLinkList, repeat(mode), repeat(pathToFF), repeat(pathToBinaries), repeat(dateTime), repeat(scrapedSupplier), repeat(logfile) ) )
2. Optimize Crawling Strategy
- Add explicit timeouts: Ensure your download calls have timeouts (e.g.,
requests.get(url, timeout=15)) to avoid hanging connections that trigger pipe errors. - Rate limiting: Add small delays between requests (even in multiprocessing) to avoid overwhelming the target server, which often leads to connection closures.
- Rotate User-Agents: Use different User-Agent strings for each request to avoid being flagged as a bot and blocked.
- Use proxies: If you hit IP rate limits, distribute requests across proxies to avoid getting blocked.
3. Adjust Multiprocessing Parameters
- Reduce pool size: Try
Pool(3)instead ofPool(5)—some servers block too many simultaneous connections from the same IP. - Use
imap_unordered: If you don't need results in the order of your URL list,imap_unorderedprocesses results as they're ready, which helps with memory management for very large datasets.
4. Persist Crawl State
For massive URL lists, save the state of crawled URLs to a file or database (like SQLite). If the script stops unexpectedly, you can resume from where you left off instead of starting over.
5. Consider Asynchronous I/O (Optional)
If your crawl is heavily I/O-bound (spending most time waiting for server responses), you might get better efficiency with aiohttp instead of multiprocessing. But this is a choice, not a requirement—multiprocessing works perfectly once you handle the errors.
内容的提问来源于stack exchange,提问作者acincognito




