Mac命令行处理大CSV:按列值提取后随机拆分至两个文件
Hey there! As someone who's fumbled through shell scripting on macOS before, let's break down how to tackle your large CSV file step by step. Here's a reliable approach tailored just for your use case:
Step 1: Extract rows with unique values in Column 1
First, we'll use awk to filter out duplicate entries based on Column 1—this will get you the ~138k unique rows you mentioned.
Basic version (for simple CSVs without commas inside quotes):
# Extract rows where Column 1 is unique (includes header if your file has one) awk -F ',' '!seen[$1]++' original.csv > unique_col1.csv
-F ',': Tellsawkto use commas as the field separator.!seen[$1]++: Uses an associative arrayseento track values in Column 1. The first time a value is encountered,seen[$1]is 0 (so!0is true, and the row gets printed), then we increment the count to mark it as seen. Any subsequent duplicates get skipped automatically.
For CSVs with commas inside quoted fields (e.g., "Doe, John"):
If your data has columns wrapped in quotes that contain commas, use FPAT to correctly parse fields without breaking them:
awk -F ',' 'BEGIN{FPAT="([^,]+)|(\"[^\"]+\")"} !seen[$1]++' original.csv > unique_col1.csv
FPAT: Defines the pattern for valid fields—either a string without commas, or a quoted string that can safely include commas.
Step 2: Randomly split the unique rows into two files
macOS uses BSD sort which has a handy -R flag to randomly sort lines. We'll use this to shuffle the data, then split it roughly in half. We'll also handle headers if your file includes one (super common for CSVs!).
Script with header support:
# Set this to 1 if your CSV has a header row, 0 otherwise has_header=1 if [ $has_header -eq 1 ]; then # Extract the header to reuse later head -n 1 unique_col1.csv > temp_header.csv # Shuffle only the data rows (skip the header) tail -n +2 unique_col1.csv | sort -R > shuffled_data.csv # Calculate how many rows to split into each file total_data_lines=$(wc -l < shuffled_data.csv) split_lines=$((total_data_lines / 2)) # Build file1.csv (header + first half of shuffled data) cat temp_header.csv <(head -n $split_lines shuffled_data.csv) > file1.csv # Build file2.csv (header + second half of shuffled data) cat temp_header.csv <(tail -n +$((split_lines + 1)) shuffled_data.csv) > file2.csv # Clean up temporary files rm temp_header.csv shuffled_data.csv else # Shuffle all lines and split directly (no header) sort -R unique_col1.csv > shuffled_data.csv total_lines=$(wc -l < shuffled_data.csv) split_lines=$((total_lines / 2)) head -n $split_lines shuffled_data.csv > file1.csv tail -n +$((split_lines + 1)) shuffled_data.csv > file2.csv rm shuffled_data.csv fi
sort -R: Randomizes the order of lines efficiently—way better than loading the whole 100k+ row file into memory for smaller tools.- We shuffle once first to ensure the split is truly random—avoiding separate shuffles for each file which could cause overlaps or missing rows.
Optional: Using shuf (if you have Homebrew installed)
If you've installed GNU coreutils via Homebrew (brew install coreutils), you can use gshuf (GNU's version of shuf) which might feel more intuitive:
# With header support head -n 1 unique_col1.csv > temp_header.csv tail -n +2 unique_col1.csv | gshuf -n $split_lines > temp_file1_data.csv tail -n +2 unique_col1.csv | gshuf -n $((total_data_lines - split_lines)) > temp_file2_data.csv cat temp_header.csv temp_file1_data.csv > file1.csv cat temp_header.csv temp_file2_data.csv > file2.csv rm temp_*.csv
内容的提问来源于stack exchange,提问作者colbythenoob




