R语言：高效筛选列表中矩阵内占比≥75%的重复边行

阿华AIGC实验室

2026-5-27

Got it, let's tackle this problem step by step. You've got 68 edge list matrices (each with thousands of rows of gene-gene interactions: Node1, Node2, weight), and you want to create a final table that merges edges appearing in 75% or more of the matrices (that’s 51+ matrices, since 68×0.75=51) — even if those edges have different weights across matrices — into a single row.

Python’s pandas library is ideal for this kind of tabular data processing. Here’s a practical, customizable approach:

Step 1: Import Required Libraries

First, grab the tools we need:

import pandas as pd
import glob

Step 2: Load All Matrix Files

Assuming your edge lists are stored as CSV/TSV files in a single folder, we’ll load them all into a combined dataframe and track which matrix each row comes from:

# Adjust the file path and pattern to match your files (e.g., "*.tsv" for tab-separated)
file_list = glob.glob("path/to/your/matrix/files/*.csv")

# Load each file and tag rows with their source matrix ID
all_matrices = []
for matrix_num, file in enumerate(file_list, 1):
    # If your files have headers, remove the `names=` argument
    df = pd.read_csv(file, names=["Node1", "Node2", "Weight"])
    df["Source_Matrix"] = f"Matrix_{matrix_num}"
    all_matrices.append(df)

# Combine all matrices into one big dataframe
combined_data = pd.concat(all_matrices, ignore_index=True)

Step 3: Standardize Edge Representation

Gene pairs like GeneA-GeneB and GeneB-GeneA are the same biological edge, so we’ll standardize them to avoid duplicate entries:

# Create a sorted tuple of Node1/Node2 to represent each unique edge
combined_data["Unique_Edge"] = combined_data.apply(
    lambda row: tuple(sorted([row["Node1"], row["Node2"]])),
    axis=1
)

Step 4: Filter Edges That Meet the 75% Threshold

We’ll keep only edges that appear in 51 or more matrices:

# Count how many unique matrices each edge appears in
edge_matrix_counts = combined_data.groupby("Unique_Edge")["Source_Matrix"].nunique()

# Filter edges that meet the 75% requirement
qualified_edges = edge_matrix_counts[edge_matrix_counts >= 51].index

# Keep only these qualified edges in our dataset
filtered_data = combined_data[combined_data["Unique_Edge"].isin(qualified_edges)]

Step 5: Merge Edges into Single Rows

You’ve got two common options here, depending on what you need in your final table:

Option A: Wide Table (Show Weight from Each Matrix)

This creates one row per edge, with columns for each matrix’s weight:

# Pivot the data to wide format
wide_table = filtered_data.pivot(
    index="Unique_Edge",
    columns="Source_Matrix",
    values="Weight"
).reset_index()

# Split the sorted edge tuple back into Node1 and Node2 columns
wide_table[["Node1", "Node2"]] = pd.DataFrame(
    wide_table["Unique_Edge"].tolist(),
    index=wide_table.index
)

# Reorder columns to put gene names first
wide_table = wide_table[["Node1", "Node2"] + [col for col in wide_table.columns if col not in ["Unique_Edge", "Node1", "Node2"]]]

Option B: Aggregated Stats Table (Summarize Weight Variation)

If you just want summary stats (like min/max/mean weight) instead of every matrix’s value:

aggregated_table = filtered_data.groupby("Unique_Edge").agg(
    Matrix_Count=("Source_Matrix", "nunique"),
    Min_Weight=("Weight", "min"),
    Max_Weight=("Weight", "max"),
    Mean_Weight=("Weight", "mean"),
    Std_Weight=("Weight", "std")
).reset_index()

# Split edge tuple into Node1/Node2
aggregated_table[["Node1", "Node2"]] = pd.DataFrame(
    aggregated_table["Unique_Edge"].tolist(),
    index=aggregated_table.index
)

# Reorder columns for readability
aggregated_table = aggregated_table[["Node1", "Node2", "Matrix_Count", "Min_Weight", "Max_Weight", "Mean_Weight", "Std_Weight"]]

Step 6: Save the Final Table

Export your result to a CSV file (adjust the path as needed):

# For wide table
wide_table.to_csv("final_edges_wide.csv", index=False)

# For aggregated stats table
aggregated_table.to_csv("final_edges_aggregated.csv", index=False)

Quick Notes to Tweak for Your Setup:

If your files aren’t CSV, use pd.read_table() for TSV or adjust the read method to match your file type.
If your edges are already consistently ordered (e.g., Node1 is always alphabetically before Node2), you can skip the Unique_Edge standardization step.
Adjust the threshold number (51) if you need a different percentage (e.g., 70% would be 48 matrices).

内容的提问来源于stack exchange，提问作者user9740934