You need to enable JavaScript to run this app.
优惠活动
大模型
产品
解决方案
定价
更多
文档控制台
免费开始使用

Python基于requests异步高效下载文件:请求处理及性能优化咨询

Great question! The core problem here is that your original code is making a risky, invalid assumption about request completion order—network requests are totally unpredictable, so you can never count on the first URL finishing first. Let's walk through how to fix this properly, plus some solid tips to boost your download speeds.

Fixing the Completion Order Issue

The key solution here is to use tools that let you handle requests as they complete, regardless of the order you started them in. Here are two reliable approaches:

1. Use Async HTTP with aiohttp + asyncio.as_completed()

Since requests is a synchronous library, true asynchronous downloading requires switching to an async HTTP client like aiohttp. Pair it with asyncio.as_completed() to process tasks the moment they finish.

First, install aiohttp:

pip install aiohttp

Here's a corrected implementation:

import asyncio
import aiohttp

async def download_file(session, url, save_path):
    async with session.get(url) as response:
        response.raise_for_status()  # Fail fast on HTTP errors
        with open(save_path, 'wb') as f:
            # Stream the file in chunks to avoid loading everything into memory
            async for chunk in response.content.iter_chunked(8192):
                f.write(chunk)
    print(f"✅ Done: {url} saved to {save_path}")

async def main(url_list):
    # Limit concurrent requests to avoid overwhelming the server
    semaphore = asyncio.Semaphore(4)
    
    async with aiohttp.ClientSession() as session:
        # Create all download tasks
        tasks = [
            download_file(session, url, f"download_{i}.bin")
            for i, url in enumerate(url_list)
        ]
        
        # Process tasks AS THEY COMPLETE (no order assumptions!)
        for completed_task in asyncio.as_completed(tasks):
            await completed_task

if __name__ == "__main__":
    urls = [
        "https://example.com/large_file1.bin",
        "https://example.com/large_file2.bin",
        "https://example.com/large_file3.bin"
    ]
    asyncio.run(main(urls))

asyncio.as_completed() iterates over tasks in the order they finish, so you never have to guess which request will be done first.

2. Stick with requests Using concurrent.futures.ThreadPoolExecutor

If you don't want to switch libraries, you can use a thread pool to run synchronous requests calls concurrently. The concurrent.futures.as_completed() function works here too:

import requests
from concurrent.futures import ThreadPoolExecutor, as_completed

def download_file(url, save_path):
    # Stream the file to save memory
    with requests.get(url, stream=True) as response:
        response.raise_for_status()
        with open(save_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
    return f"✅ Done: {url} saved to {save_path}"

def main(url_list):
    # Adjust max_workers based on your network and server limits
    with ThreadPoolExecutor(max_workers=4) as executor:
        # Map futures to their URLs for easy error handling
        future_to_url = {
            executor.submit(download_file, url, f"download_{i}.bin"): url
            for i, url in enumerate(url_list)
        }
        
        # Process completed tasks in any order
        for future in as_completed(future_to_url):
            url = future_to_url[future]
            try:
                result = future.result()
                print(result)
            except Exception as exc:
                print(f"❌ {url} failed: {str(exc)}")

if __name__ == "__main__":
    urls = [
        "https://example.com/large_file1.bin",
        "https://example.com/large_file2.bin",
        "https://example.com/large_file3.bin"
    ]
    main(urls)

This approach lets you keep using requests while avoiding the order assumption.

Performance Optimization Tips

Beyond fixing the order issue, here are some ways to speed up your downloads:

  • Stream large files: Always use stream=True (requests) or iter_chunked() (aiohttp) to avoid loading entire files into memory. This is critical for large downloads and reduces memory overhead.
  • Limit concurrent requests: Use semaphores (async) or adjust max_workers (threads) to avoid getting rate-limited or blocked by the server. Start with 4-8 concurrent requests and tweak based on results.
  • Enable HTTP/2: For aiohttp, enable HTTP/2 when creating your session to reuse connections and reduce latency:
    connector = aiohttp.TCPConnector(http2=True)
    async with aiohttp.ClientSession(connector=connector) as session:
        # ...
    
  • Add retries: Handle transient network errors with retry logic. For requests, use HTTPAdapter with urllib3.Retry:
    from requests.adapters import HTTPAdapter
    from urllib3.util.retry import Retry
    
    session = requests.Session()
    retry_strategy = Retry(
        total=3,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504]
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("https://", adapter)
    session.mount("http://", adapter)
    
  • Optimize chunk size: 8KB (8192 bytes) is a safe default, but you can test larger sizes (like 16KB or 32KB) to see if they improve speed for your specific files.
  • Use DNS caching: For aiohttp, enable DNS caching in the connector to avoid repeated DNS lookups:
    connector = aiohttp.TCPConnector(use_dns_cache=True)
    

内容的提问来源于stack exchange,提问作者user2650277

火山引擎 最新活动