Python基于requests异步高效下载文件:请求处理及性能优化咨询
Great question! The core problem here is that your original code is making a risky, invalid assumption about request completion order—network requests are totally unpredictable, so you can never count on the first URL finishing first. Let's walk through how to fix this properly, plus some solid tips to boost your download speeds.
Fixing the Completion Order Issue
The key solution here is to use tools that let you handle requests as they complete, regardless of the order you started them in. Here are two reliable approaches:
1. Use Async HTTP with aiohttp + asyncio.as_completed()
Since requests is a synchronous library, true asynchronous downloading requires switching to an async HTTP client like aiohttp. Pair it with asyncio.as_completed() to process tasks the moment they finish.
First, install aiohttp:
pip install aiohttp
Here's a corrected implementation:
import asyncio import aiohttp async def download_file(session, url, save_path): async with session.get(url) as response: response.raise_for_status() # Fail fast on HTTP errors with open(save_path, 'wb') as f: # Stream the file in chunks to avoid loading everything into memory async for chunk in response.content.iter_chunked(8192): f.write(chunk) print(f"✅ Done: {url} saved to {save_path}") async def main(url_list): # Limit concurrent requests to avoid overwhelming the server semaphore = asyncio.Semaphore(4) async with aiohttp.ClientSession() as session: # Create all download tasks tasks = [ download_file(session, url, f"download_{i}.bin") for i, url in enumerate(url_list) ] # Process tasks AS THEY COMPLETE (no order assumptions!) for completed_task in asyncio.as_completed(tasks): await completed_task if __name__ == "__main__": urls = [ "https://example.com/large_file1.bin", "https://example.com/large_file2.bin", "https://example.com/large_file3.bin" ] asyncio.run(main(urls))
asyncio.as_completed() iterates over tasks in the order they finish, so you never have to guess which request will be done first.
2. Stick with requests Using concurrent.futures.ThreadPoolExecutor
If you don't want to switch libraries, you can use a thread pool to run synchronous requests calls concurrently. The concurrent.futures.as_completed() function works here too:
import requests from concurrent.futures import ThreadPoolExecutor, as_completed def download_file(url, save_path): # Stream the file to save memory with requests.get(url, stream=True) as response: response.raise_for_status() with open(save_path, 'wb') as f: for chunk in response.iter_content(chunk_size=8192): f.write(chunk) return f"✅ Done: {url} saved to {save_path}" def main(url_list): # Adjust max_workers based on your network and server limits with ThreadPoolExecutor(max_workers=4) as executor: # Map futures to their URLs for easy error handling future_to_url = { executor.submit(download_file, url, f"download_{i}.bin"): url for i, url in enumerate(url_list) } # Process completed tasks in any order for future in as_completed(future_to_url): url = future_to_url[future] try: result = future.result() print(result) except Exception as exc: print(f"❌ {url} failed: {str(exc)}") if __name__ == "__main__": urls = [ "https://example.com/large_file1.bin", "https://example.com/large_file2.bin", "https://example.com/large_file3.bin" ] main(urls)
This approach lets you keep using requests while avoiding the order assumption.
Performance Optimization Tips
Beyond fixing the order issue, here are some ways to speed up your downloads:
- Stream large files: Always use
stream=True(requests) oriter_chunked()(aiohttp) to avoid loading entire files into memory. This is critical for large downloads and reduces memory overhead. - Limit concurrent requests: Use semaphores (async) or adjust
max_workers(threads) to avoid getting rate-limited or blocked by the server. Start with 4-8 concurrent requests and tweak based on results. - Enable HTTP/2: For
aiohttp, enable HTTP/2 when creating your session to reuse connections and reduce latency:connector = aiohttp.TCPConnector(http2=True) async with aiohttp.ClientSession(connector=connector) as session: # ... - Add retries: Handle transient network errors with retry logic. For
requests, useHTTPAdapterwithurllib3.Retry:from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry session = requests.Session() retry_strategy = Retry( total=3, backoff_factor=0.5, status_forcelist=[429, 500, 502, 503, 504] ) adapter = HTTPAdapter(max_retries=retry_strategy) session.mount("https://", adapter) session.mount("http://", adapter) - Optimize chunk size: 8KB (8192 bytes) is a safe default, but you can test larger sizes (like 16KB or 32KB) to see if they improve speed for your specific files.
- Use DNS caching: For
aiohttp, enable DNS caching in the connector to avoid repeated DNS lookups:connector = aiohttp.TCPConnector(use_dns_cache=True)
内容的提问来源于stack exchange,提问作者user2650277




