PostgreSQL批量数据跨服务器迁移提速方案咨询（转MongoDB/Cassandra）

阿华AIGC实验室

2026-5-22

Alright, let's break this down. Moving subsets of 20M+ PostgreSQL records between servers with correlated INSERT/UPDATE functions is bound to hit bottlenecks—those operations handle row-level logic and cross-table joins on the fly, which kills throughput even with optimizations. Let's walk through how to leverage MongoDB or Cassandra to speed this up, starting with universal best practices, then diving into database-specific steps.

Universal Pre-Migration Optimizations (Critical for Speed)

First, fix how you extract data from PostgreSQL—this is often the biggest bottleneck:

Ditch the function-based correlated logic: Instead of using functions to join and insert/update rows in real-time, extract your filtered data first in bulk.
- Use PostgreSQL's COPY command to export directly to CSV or JSON—it's orders of magnitude faster than querying and processing rows in a function. For example, to export a filtered subset:
```
COPY (
  SELECT t1.id, t1.value, t2.metadata 
  FROM source_table t1
  JOIN related_table t2 ON t1.id = t2.source_id
  WHERE t1.created_at >= '2023-01-01'
) TO '/tmp/filtered_data.csv' WITH (FORMAT csv, HEADER, DELIMITER ',');
```
- For incremental migrations (not full dumps), use logical replication slots to capture change events (pg_logical_slot_get_changes) instead of re-scanning the entire table every time.
Parallelize everything: Split your exported data into chunks (by ID ranges, date buckets, etc.) and process multiple chunks simultaneously. Both MongoDB and Cassandra's bulk tools support parallel ingestion, so use all available CPU/network capacity.

Migrating to MongoDB

MongoDB's document model and bulk import tools make this straightforward. Here are your best options:

Option 1: Bulk Load with `mongoimport`

This is the fastest path for large static datasets. Once you have your CSV/JSON export:

Run mongoimport with batch and parallelization flags to maximize throughput:

mongoimport --uri "mongodb://mongo-cluster-host:27017/target_db" \
  --collection target_collection \
  --type csv \
  --headerline \
  --file /tmp/filtered_data.csv \
  --batchSize 15000 \
  --numInsertionWorkers 6

Tweak numInsertionWorkers based on your MongoDB cluster's CPU cores, and batchSize to balance memory usage and write speed.

If you exported JSON (array of documents), skip the --type csv and --headerline flags.

Option 2: Stream Data Directly (No Disk I/O)

For real-time or incremental migrations, skip writing to disk and stream data from PostgreSQL to MongoDB using a script. Here's a Python example (use server-side cursors to avoid loading all data into memory):

import psycopg2
from pymongo import MongoClient

# Connect to PostgreSQL (use server-side cursor for large result sets)
pg_conn = psycopg2.connect("dbname=source_db user=postgres host=pg-host")
# Connect to MongoDB
mongo_client = MongoClient("mongodb://mongo-host:27017/")
mongo_db = mongo_client["target_db"]
mongo_coll = mongo_db["target_collection"]

with pg_conn.cursor(name='large_result_cursor') as pg_cursor:
    pg_cursor.itersize = 10000  # Fetch 10k rows at a time
    pg_cursor.execute("SELECT id, value, metadata FROM source_table WHERE ...")
    
    batch = []
    for row in pg_cursor:
        # Map PostgreSQL row to MongoDB document
        doc = {
            "id": row[0],
            "value": row[1],
            "metadata": row[2]
        }
        batch.append(doc)
        
        # Insert in batches
        if len(batch) == 10000:
            mongo_coll.insert_many(batch)
            batch = []
    # Insert remaining records
    if batch:
        mongo_coll.insert_many(batch)

MongoDB-Specific Tweaks

Disable indexes during ingestion: Build indexes after all data is loaded—indexes slow down bulk inserts drastically.
Use sharding for massive datasets: If your target collection will grow beyond a single node's capacity, set up sharding before migration to distribute the load across nodes.

Migrating to Cassandra

Cassandra is built for high-throughput write workloads, but it requires careful schema design (especially partition keys) to avoid hotspots. Here's how to migrate efficiently:

Option 1: Bulk Load with DataStax Bulk Loader (`dsbulk`)

dsbulk is far faster than cqlsh COPY for large datasets—it's optimized for Cassandra's distributed architecture:

First, create your Cassandra table with a well-designed partition key (critical for performance). For example:

CREATE TABLE target_keyspace.target_table (
    id UUID PRIMARY KEY,
    value TEXT,
    metadata MAP<TEXT, TEXT>,
    created_at TIMESTAMP
) WITH CLUSTERING ORDER BY (created_at DESC);

Export your PostgreSQL data to CSV (as covered earlier).

Run dsbulk to load the data:

dsbulk load -k target_keyspace -t target_table \
  -url /tmp/filtered_data.csv \
  -header true \
  -batch.size 1000 \
  -max.concurrent.requests 120

Adjust batch.size and max.concurrent.requests based on your cluster's node count and resource capacity.

Option 2: Stream Data Directly

Use a script to stream batches from PostgreSQL to Cassandra. Here's a Python example using the Cassandra driver:

import psycopg2
from cassandra.cluster import Cluster
from cassandra.query import BatchStatement

# Connect to Cassandra cluster
cluster = Cluster(['cassandra-node-1', 'cassandra-node-2'])
session = cluster.connect('target_keyspace')
# Prepare insert statement (reusable for batches)
insert_stmt = session.prepare("""
    INSERT INTO target_table (id, value, metadata, created_at)
    VALUES (?, ?, ?, ?)
""")

# Connect to PostgreSQL
pg_conn = psycopg2.connect("dbname=source_db user=postgres host=pg-host")

with pg_conn.cursor(name='large_cursor') as pg_cursor:
    pg_cursor.itersize = 5000
    pg_cursor.execute("SELECT id, value, metadata, created_at FROM source_table WHERE ...")
    
    batch = BatchStatement()
    count = 0
    for row in pg_cursor:
        batch.add(insert_stmt, row)
        count += 1
        
        # Execute batch every 5k rows
        if count % 5000 == 0:
            session.execute(batch)
            batch = BatchStatement()
    # Execute remaining records
    if count % 5000 != 0:
        session.execute(batch)

Cassandra-Specific Tweaks

Optimize partition keys: Avoid hot partitions (e.g., don't use a single user ID as the partition key if that user has millions of records). Use composite keys or time-based bucketing if needed.
Tweak consistency temporarily: Set consistency level to LOCAL_ONE during bulk loads (instead of QUORUM) to speed up writes—just re-enable QUORUM after migration.
Disable autocompaction: Run nodetool disableautocompaction before loading, then re-enable and run a compaction afterward. Compaction during bulk loads can cripple performance.

MongoDB vs. Cassandra: Which to Choose?

Go with MongoDB if: You need flexible schemas, ad-hoc querying, or don't want to invest time in complex schema design upfront. It's easier to set up for one-off bulk migrations.
Go with Cassandra if: You need linear scalability for write-heavy workloads, strict consistency guarantees (when configured), or have time-series/streaming data. It's built for massive throughput but requires careful planning.

内容的提问来源于stack exchange，提问作者Vignesh Karthi