R语言:高效筛选列表中矩阵内占比≥75%的重复边行
Got it, let's tackle this problem step by step. You've got 68 edge list matrices (each with thousands of rows of gene-gene interactions: Node1, Node2, weight), and you want to create a final table that merges edges appearing in 75% or more of the matrices (that’s 51+ matrices, since 68×0.75=51) — even if those edges have different weights across matrices — into a single row.
Python’s pandas library is ideal for this kind of tabular data processing. Here’s a practical, customizable approach:
Step 1: Import Required Libraries
First, grab the tools we need:
import pandas as pd import glob
Step 2: Load All Matrix Files
Assuming your edge lists are stored as CSV/TSV files in a single folder, we’ll load them all into a combined dataframe and track which matrix each row comes from:
# Adjust the file path and pattern to match your files (e.g., "*.tsv" for tab-separated) file_list = glob.glob("path/to/your/matrix/files/*.csv") # Load each file and tag rows with their source matrix ID all_matrices = [] for matrix_num, file in enumerate(file_list, 1): # If your files have headers, remove the `names=` argument df = pd.read_csv(file, names=["Node1", "Node2", "Weight"]) df["Source_Matrix"] = f"Matrix_{matrix_num}" all_matrices.append(df) # Combine all matrices into one big dataframe combined_data = pd.concat(all_matrices, ignore_index=True)
Step 3: Standardize Edge Representation
Gene pairs like GeneA-GeneB and GeneB-GeneA are the same biological edge, so we’ll standardize them to avoid duplicate entries:
# Create a sorted tuple of Node1/Node2 to represent each unique edge combined_data["Unique_Edge"] = combined_data.apply( lambda row: tuple(sorted([row["Node1"], row["Node2"]])), axis=1 )
Step 4: Filter Edges That Meet the 75% Threshold
We’ll keep only edges that appear in 51 or more matrices:
# Count how many unique matrices each edge appears in edge_matrix_counts = combined_data.groupby("Unique_Edge")["Source_Matrix"].nunique() # Filter edges that meet the 75% requirement qualified_edges = edge_matrix_counts[edge_matrix_counts >= 51].index # Keep only these qualified edges in our dataset filtered_data = combined_data[combined_data["Unique_Edge"].isin(qualified_edges)]
Step 5: Merge Edges into Single Rows
You’ve got two common options here, depending on what you need in your final table:
Option A: Wide Table (Show Weight from Each Matrix)
This creates one row per edge, with columns for each matrix’s weight:
# Pivot the data to wide format wide_table = filtered_data.pivot( index="Unique_Edge", columns="Source_Matrix", values="Weight" ).reset_index() # Split the sorted edge tuple back into Node1 and Node2 columns wide_table[["Node1", "Node2"]] = pd.DataFrame( wide_table["Unique_Edge"].tolist(), index=wide_table.index ) # Reorder columns to put gene names first wide_table = wide_table[["Node1", "Node2"] + [col for col in wide_table.columns if col not in ["Unique_Edge", "Node1", "Node2"]]]
Option B: Aggregated Stats Table (Summarize Weight Variation)
If you just want summary stats (like min/max/mean weight) instead of every matrix’s value:
aggregated_table = filtered_data.groupby("Unique_Edge").agg( Matrix_Count=("Source_Matrix", "nunique"), Min_Weight=("Weight", "min"), Max_Weight=("Weight", "max"), Mean_Weight=("Weight", "mean"), Std_Weight=("Weight", "std") ).reset_index() # Split edge tuple into Node1/Node2 aggregated_table[["Node1", "Node2"]] = pd.DataFrame( aggregated_table["Unique_Edge"].tolist(), index=aggregated_table.index ) # Reorder columns for readability aggregated_table = aggregated_table[["Node1", "Node2", "Matrix_Count", "Min_Weight", "Max_Weight", "Mean_Weight", "Std_Weight"]]
Step 6: Save the Final Table
Export your result to a CSV file (adjust the path as needed):
# For wide table wide_table.to_csv("final_edges_wide.csv", index=False) # For aggregated stats table aggregated_table.to_csv("final_edges_aggregated.csv", index=False)
Quick Notes to Tweak for Your Setup:
- If your files aren’t CSV, use
pd.read_table()for TSV or adjust the read method to match your file type. - If your edges are already consistently ordered (e.g., Node1 is always alphabetically before Node2), you can skip the
Unique_Edgestandardization step. - Adjust the threshold number (51) if you need a different percentage (e.g., 70% would be 48 matrices).
内容的提问来源于stack exchange,提问作者user9740934




