如何优化Pandas DataFrame运行速度?附慢执行代码求优化方案
Optimizing Slow Pandas Code for Repeated Binance Kline Data Processing
I've got a Pandas script that fetches Binance 1-minute BTCUSDT klines, processes them to calculate RSI and Bollinger Bands, but it takes ~4 seconds per run. Since this code gets called repeatedly, the current performance just doesn't cut it. Can someone help me optimize this? Here's the original code:
self.dataframe = pd.DataFrame(columns=list(['O' ,'H' ,'L' ,'C' ,'RSI', 'Upper Band', 'Lower Band'])) BinanceHistoricalUrl = "https://api.binance.com/api/v1/klines?" BinanceHistoricalPayload = {'symbol' : 'BTCUSDT','interval': '1m','limit': 100} HistoricalRequestData = requests.get(url=BinanceHistoricalUrl, params=BinanceHistoricalPayload).json() Lenght = len(HistoricalRequestData) for i in range(Lenght): O = HistoricalRequestData[i][1] O = "{:.4f}".format(O) O = float(O) H = HistoricalRequestData[i][2] H = "{:.4f}".format(H) H = float(H) L = HistoricalRequestData[i][3] L = "{:.4f}".format(H) # Note: This is a bug! You used H instead of L here L = float(L) C = HistoricalRequestData[i][4] C = "{:.4f}".format(C) C = float(C) self.dataframe = self.dataframe.append({'O': O, 'H' : H, 'L' : L, 'C' : C}, ignore_index=True) make_RSI(self.dataframe) make_bollinger_bands(self.dataframe) RSI = self.dataframe['RSI'][99] RSI = float(RSI) UppBoll = self.dataframe['Upper Band'][99] UndBoll = self.dataframe['Lower Band'][99] previouscloseprice = self.dataframe['C'][99] MA = self.dataframe['20 Day MA'][99] DistanceUppBoll = UppBoll - MA DistanceUppBoll = float(DistanceUppBoll) DistanceUndBoll = UndBoll - MA DistanceUndBoll = float(DistanceUndBoll) self.dataframe = self.dataframe.iloc[0:0] def make_RSI(dataframe): delta = dataframe['C'].diff() dUp, dDown = delta.copy(), delta.copy() dUp[dUp < 0] = 0 dDown[dDown > 0] = 0 RolUp = dUp.rolling(14).mean() RolDown = dDown.rolling(14).mean().abs() RS = RolUp / RolDown dataframe['RSI'] = 100 - (100/(1+RS)) def make_bollinger_bands(dataframe): dataframe['20 Day MA'] = dataframe['C'].rolling(window=20).mean() dataframe['20 Day STD'] = dataframe['C'].rolling(window=20).std() dataframe['Upper Band'] = dataframe['20 Day MA'] + (dataframe['20 Day STD'] * 2) dataframe['Lower Band'] = dataframe['20 Day MA'] - (dataframe['20 Day STD'] * 2)
Let’s start by fixing the obvious bug first: in your loop, you accidentally used H instead of L when formatting the low price. That’s a critical error for your technical indicators! Now, onto the performance issues:
Main Bottlenecks in Your Code
- Row-wise
appendin a loop: PandasDataFrame.append()creates an entirely new DataFrame every time it’s called. For 100 rows, this leads to O(n²) time complexity—way slower than necessary. - Redundant string formatting: Converting numbers to strings and back to floats is a waste of cycles; Pandas can handle rounding directly with vectorized operations.
- Unnecessary DataFrame maintenance: You build a full DataFrame just to extract the last row’s values, then immediately clear it. This wastes memory and processing time.
Optimized Code with Explanations
import pandas as pd import requests def make_RSI(dataframe): delta = dataframe['C'].diff() # Use vectorized clip operations instead of manual value assignment dUp = delta.clip(lower=0) dDown = delta.clip(upper=0).abs() RolUp = dUp.rolling(14).mean() RolDown = dDown.rolling(14).mean() RS = RolUp / RolDown dataframe['RSI'] = 100 - (100/(1+RS)) def make_bollinger_bands(dataframe): dataframe['20 Day MA'] = dataframe['C'].rolling(window=20).mean() dataframe['20 Day STD'] = dataframe['C'].rolling(window=20).std() dataframe['Upper Band'] = dataframe['20 Day MA'] + (dataframe['20 Day STD'] * 2) dataframe['Lower Band'] = dataframe['20 Day MA'] - (dataframe['20 Day STD'] * 2) # Reusable session for repeated requests (cuts down on HTTP overhead) session = requests.Session() # Main processing logic BinanceHistoricalUrl = "https://api.binance.com/api/v1/klines?" BinanceHistoricalPayload = {'symbol' : 'BTCUSDT','interval': '1m','limit': 100} HistoricalRequestData = session.get(url=BinanceHistoricalUrl, params=BinanceHistoricalPayload).json() # Convert raw data to DataFrame in one go (no loops!) # Binance klines structure: [timestamp, open, high, low, close, volume, ...] df = pd.DataFrame( HistoricalRequestData, columns=['timestamp', 'O', 'H', 'L', 'C', 'volume', 'close_time', 'quote_vol', 'trades', 'taker_base', 'taker_quote', 'ignore'] ) # Keep only needed columns, convert to float, and round to 4 decimals (vectorized) df = df[['O', 'H', 'L', 'C']].astype(float).round(4) # Calculate indicators make_RSI(df) make_bollinger_bands(df) # Extract only the last row's values we care about last_row = df.iloc[-1] RSI = last_row['RSI'] UppBoll = last_row['Upper Band'] UndBoll = last_row['Lower Band'] previouscloseprice = last_row['C'] MA = last_row['20 Day MA'] DistanceUppBoll = UppBoll - MA DistanceUndBoll = UndBoll - MA # No need to clear the DataFrame—let it go out of scope automatically
Key Performance Improvements
Vectorized Data Conversion:
- We convert the entire Binance response to a DataFrame in a single step, eliminating the slow loop and
append()calls. This drops the time complexity from O(n²) to O(n). astype(float).round(4)replaces the redundant string formatting with fast, vectorized operations that Pandas is optimized for.
- We convert the entire Binance response to a DataFrame in a single step, eliminating the slow loop and
Optimized Indicator Calculation:
- Replaced manual value masking with
clip()for RSI calculation—this is a more efficient vectorized operation.
- Replaced manual value masking with
Reduced Overhead:
- Using a reusable
requests.Session()keeps HTTP connections alive between repeated calls, cutting down on connection setup time. - We only keep the columns we need, reducing memory usage and speeding up subsequent calculations.
- Instead of building and clearing a class-level DataFrame, we use a local DataFrame and extract only the last row’s values directly.
- Using a reusable
Bonus Tips for Even More Speed
- Skip redundant type conversions: The values from
last_roware already floats; you don’t need to wrap them infloat()unless you specifically need Python float objects instead of NumPy floats. - Use TA-Lib: For heavy technical analysis, the TA-Lib library has optimized C-based implementations of RSI and Bollinger Bands that are faster than custom Pandas code.
内容的提问来源于stack exchange,提问作者user10795232




