基于可变时间班次条件拆分DataFrame多行数据

阿华AIGC实验室

2026-4-29

Solution for Multi-Day Shift Splitting in Pandas

Problem Breakdown

Let's recap your requirements clearly to make sure we're aligned:

Shift Rules

We have 3 fixed daily shifts:

Shift A: 00:00:00 to 08:00:00
Shift B: 08:00:00 to 16:00:00
Shift C: 16:00:00 to 23:59:59

Core Requirement

Each row in your dataset has a start time (INITRA) and end time (FINTRA). For rows that span multiple days or shifts, you need to split them into multiple rows—each covering exactly one date-shift segment—while retaining all other column values. The challenge you're facing is handling multi-day spans, which your current solution can't address.

Example Scenario

Take the first row in your input: INITRA=2021-02-15 22:07:00 and FINTRA=2021-02-17 05:26:42. This should split into 5 rows covering:

2021-02-15, Shift C (from 22:07:00 to 23:59:59)
2021-02-16, Shift A (full shift)
2021-02-16, Shift B (full shift)
2021-02-16, Shift C (full shift)
2021-02-17, Shift A (from 00:00:00 to 05:26:42)

Proposed Solution

This solution uses Pandas with efficient datetime handling to process your data at scale—even for hundreds of thousands of rows. Here's how to implement it step by step:

Step 1: Define Shift Configuration

First, we'll formalize the shift rules into a structured format for easy lookup:

import pandas as pd
from datetime import datetime, timedelta

# Define shift rules with name, start time, end time
shift_rules = [
    ("A", "00:00:00", "08:00:00"),
    ("B", "08:00:00", "16:00:00"),
    ("C", "16:00:00", "23:59:59")
]

# Convert shift times to datetime.time objects for comparison
shift_rules = [
    (s, datetime.strptime(st, "%H:%M:%S").time(), datetime.strptime(et, "%H:%M:%S").time())
    for s, st, et in shift_rules
]

Step 2: Create the Row-Splitting Function

This function takes a single row, generates all relevant date-shift segments, and returns a list of split rows (as dictionaries):

def split_row(row):
    initra = row["INITRA"]
    fintra = row["FINTRA"]
    # Preserve all columns except INITRA/FINTRA
    base_data = row.drop(["INITRA", "FINTRA"]).to_dict()
    
    split_rows = []
    current_date = initra.date()
    end_date = fintra.date()
    
    while current_date <= end_date:
        # Convert current date to datetime bounds
        date_start = datetime.combine(current_date, datetime.min.time())
        date_end = datetime.combine(current_date, datetime.max.time())
        
        # Calculate the actual time segment for this date
        segment_start = max(initra, date_start)
        segment_end = min(fintra, date_end)
        
        # Check which shifts overlap with this segment
        for shift_name, shift_start, shift_end in shift_rules:
            shift_dt_start = datetime.combine(current_date, shift_start)
            shift_dt_end = datetime.combine(current_date, shift_end)
            
            # Only process overlapping shifts
            if segment_start < shift_dt_end and segment_end > shift_dt_start:
                actual_start = max(segment_start, shift_dt_start)
                actual_end = min(segment_end, shift_dt_end)
                
                # Add the split row to results
                split_rows.append({
                    **base_data,
                    "dateIni": current_date.strftime("%Y-%m-%d"),
                    "dateFin": current_date.strftime("%Y-%m-%d"),
                    "horaIni": actual_start.strftime("%H:%M:%S"),
                    "horaFin": actual_end.strftime("%H:%M:%S"),
                    "Shift": shift_name
                })
        
        # Move to the next day
        current_date += timedelta(days=1)
    
    return split_rows

Step 3: Apply to the Full Dataset

We'll use apply to run the function on every row, then flatten the results into a final DataFrame:

# Load your input data (ensure INITRA/FINTRA are datetime types)
df = pd.DataFrame({
    "col1": ["C2", "X12", "S44", "Z1", "O9", "O7", "H8", "C2"],
    "col2": ["Y5", "Y1", "S1", "J7", "H5", "H8", "Y5", "Q1"],
    "INITRA": pd.to_datetime(["2021-02-15 22:07:00", "2021-02-15 22:00:36", "2021-02-15 21:30:39", "2021-02-15 21:26:28", "2021-02-15 21:14:29", "2021-02-15 20:45:31", "2021-02-15 20:37:05", "2021-02-15 20:15:49"]),
    "FINTRA": pd.to_datetime(["2021-02-17 05:26:42", "2021-02-15 22:02:40", "2021-02-15 21:44:51", "2021-02-15 21:28:00", "2021-02-15 21:15:40", "2021-02-15 20:46:39", "2021-02-15 20:38:10", "2021-02-15 20:27:22"]),
    "col3": ["AA"]*8,
    "col4": ["AL"]*8,
    "col5": [111871, 111863, 111863, 111863, 111863, 111863, 111863, 111863]
})

# Apply splitting function and flatten results
split_results = df.apply(split_row, axis=1).explode()
final_df = pd.DataFrame(split_results.tolist()).reset_index(drop=True)

# Reorder columns to match your desired output
final_df = final_df[["col1", "col2", "col3", "col4", "col5", "dateIni", "dateFin", "horaIni", "horaFin", "Shift"]]

# Print or save the result
print(final_df)

Step 4: Optimizations for Large Datasets

Since you're working with hundreds of thousands of rows, here are ways to speed up processing:

Vectorization: For even faster performance, refactor the logic to use Pandas vectorized operations instead of row-wise apply (avoids Python-level loops).
Parallel Processing: Use the swifter library to automatically optimize apply (runs vectorized if possible, or uses Dask for parallelization):
```
import swifter
split_results = df.swifter.apply(split_row, axis=1).explode()
```
Chunk Processing: If your data is too large for memory, process it in chunks with pd.read_csv(chunksize=...) and save each chunk's results to a file before combining.

Verification

Running this code will produce exactly the output you described: the first row splits into 5 entries, and all shorter time spans remain as single rows.

内容的提问来源于stack exchange，提问作者karek77