Mac命令行处理大CSV：按列值提取后随机拆分至两个文件

阿华AIGC实验室

2026-5-19

解决Mac上大CSV文件的唯一值提取与随机拆分问题

Hey there! As someone who's fumbled through shell scripting on macOS before, let's break down how to tackle your large CSV file step by step. Here's a reliable approach tailored just for your use case:

Step 1: Extract rows with unique values in Column 1

First, we'll use awk to filter out duplicate entries based on Column 1—this will get you the ~138k unique rows you mentioned.

Basic version (for simple CSVs without commas inside quotes):

# Extract rows where Column 1 is unique (includes header if your file has one)
awk -F ',' '!seen[$1]++' original.csv > unique_col1.csv

-F ',': Tells awk to use commas as the field separator.
!seen[$1]++: Uses an associative array seen to track values in Column 1. The first time a value is encountered, seen[$1] is 0 (so !0 is true, and the row gets printed), then we increment the count to mark it as seen. Any subsequent duplicates get skipped automatically.

For CSVs with commas inside quoted fields (e.g., `"Doe, John"`):

If your data has columns wrapped in quotes that contain commas, use FPAT to correctly parse fields without breaking them:

awk -F ',' 'BEGIN{FPAT="([^,]+)|(\"[^\"]+\")"} !seen[$1]++' original.csv > unique_col1.csv

FPAT: Defines the pattern for valid fields—either a string without commas, or a quoted string that can safely include commas.

Step 2: Randomly split the unique rows into two files

macOS uses BSD sort which has a handy -R flag to randomly sort lines. We'll use this to shuffle the data, then split it roughly in half. We'll also handle headers if your file includes one (super common for CSVs!).

Script with header support:

# Set this to 1 if your CSV has a header row, 0 otherwise
has_header=1

if [ $has_header -eq 1 ]; then
    # Extract the header to reuse later
    head -n 1 unique_col1.csv > temp_header.csv
    # Shuffle only the data rows (skip the header)
    tail -n +2 unique_col1.csv | sort -R > shuffled_data.csv
    # Calculate how many rows to split into each file
    total_data_lines=$(wc -l < shuffled_data.csv)
    split_lines=$((total_data_lines / 2))
    # Build file1.csv (header + first half of shuffled data)
    cat temp_header.csv <(head -n $split_lines shuffled_data.csv) > file1.csv
    # Build file2.csv (header + second half of shuffled data)
    cat temp_header.csv <(tail -n +$((split_lines + 1)) shuffled_data.csv) > file2.csv
    # Clean up temporary files
    rm temp_header.csv shuffled_data.csv
else
    # Shuffle all lines and split directly (no header)
    sort -R unique_col1.csv > shuffled_data.csv
    total_lines=$(wc -l < shuffled_data.csv)
    split_lines=$((total_lines / 2))
    head -n $split_lines shuffled_data.csv > file1.csv
    tail -n +$((split_lines + 1)) shuffled_data.csv > file2.csv
    rm shuffled_data.csv
fi

sort -R: Randomizes the order of lines efficiently—way better than loading the whole 100k+ row file into memory for smaller tools.
We shuffle once first to ensure the split is truly random—avoiding separate shuffles for each file which could cause overlaps or missing rows.

Optional: Using `shuf` (if you have Homebrew installed)

If you've installed GNU coreutils via Homebrew (brew install coreutils), you can use gshuf (GNU's version of shuf) which might feel more intuitive:

# With header support
head -n 1 unique_col1.csv > temp_header.csv
tail -n +2 unique_col1.csv | gshuf -n $split_lines > temp_file1_data.csv
tail -n +2 unique_col1.csv | gshuf -n $((total_data_lines - split_lines)) > temp_file2_data.csv
cat temp_header.csv temp_file1_data.csv > file1.csv
cat temp_header.csv temp_file2_data.csv > file2.csv
rm temp_*.csv

内容的提问来源于stack exchange，提问作者colbythenoob