You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

R语言:高效筛选列表中矩阵内占比≥75%的重复边行

Got it, let's tackle this problem step by step. You've got 68 edge list matrices (each with thousands of rows of gene-gene interactions: Node1, Node2, weight), and you want to create a final table that merges edges appearing in 75% or more of the matrices (that’s 51+ matrices, since 68×0.75=51) — even if those edges have different weights across matrices — into a single row.

Python’s pandas library is ideal for this kind of tabular data processing. Here’s a practical, customizable approach:

Step 1: Import Required Libraries

First, grab the tools we need:

import pandas as pd
import glob

Step 2: Load All Matrix Files

Assuming your edge lists are stored as CSV/TSV files in a single folder, we’ll load them all into a combined dataframe and track which matrix each row comes from:

# Adjust the file path and pattern to match your files (e.g., "*.tsv" for tab-separated)
file_list = glob.glob("path/to/your/matrix/files/*.csv")

# Load each file and tag rows with their source matrix ID
all_matrices = []
for matrix_num, file in enumerate(file_list, 1):
    # If your files have headers, remove the `names=` argument
    df = pd.read_csv(file, names=["Node1", "Node2", "Weight"])
    df["Source_Matrix"] = f"Matrix_{matrix_num}"
    all_matrices.append(df)

# Combine all matrices into one big dataframe
combined_data = pd.concat(all_matrices, ignore_index=True)

Step 3: Standardize Edge Representation

Gene pairs like GeneA-GeneB and GeneB-GeneA are the same biological edge, so we’ll standardize them to avoid duplicate entries:

# Create a sorted tuple of Node1/Node2 to represent each unique edge
combined_data["Unique_Edge"] = combined_data.apply(
    lambda row: tuple(sorted([row["Node1"], row["Node2"]])),
    axis=1
)

Step 4: Filter Edges That Meet the 75% Threshold

We’ll keep only edges that appear in 51 or more matrices:

# Count how many unique matrices each edge appears in
edge_matrix_counts = combined_data.groupby("Unique_Edge")["Source_Matrix"].nunique()

# Filter edges that meet the 75% requirement
qualified_edges = edge_matrix_counts[edge_matrix_counts >= 51].index

# Keep only these qualified edges in our dataset
filtered_data = combined_data[combined_data["Unique_Edge"].isin(qualified_edges)]

Step 5: Merge Edges into Single Rows

You’ve got two common options here, depending on what you need in your final table:

Option A: Wide Table (Show Weight from Each Matrix)

This creates one row per edge, with columns for each matrix’s weight:

# Pivot the data to wide format
wide_table = filtered_data.pivot(
    index="Unique_Edge",
    columns="Source_Matrix",
    values="Weight"
).reset_index()

# Split the sorted edge tuple back into Node1 and Node2 columns
wide_table[["Node1", "Node2"]] = pd.DataFrame(
    wide_table["Unique_Edge"].tolist(),
    index=wide_table.index
)

# Reorder columns to put gene names first
wide_table = wide_table[["Node1", "Node2"] + [col for col in wide_table.columns if col not in ["Unique_Edge", "Node1", "Node2"]]]

Option B: Aggregated Stats Table (Summarize Weight Variation)

If you just want summary stats (like min/max/mean weight) instead of every matrix’s value:

aggregated_table = filtered_data.groupby("Unique_Edge").agg(
    Matrix_Count=("Source_Matrix", "nunique"),
    Min_Weight=("Weight", "min"),
    Max_Weight=("Weight", "max"),
    Mean_Weight=("Weight", "mean"),
    Std_Weight=("Weight", "std")
).reset_index()

# Split edge tuple into Node1/Node2
aggregated_table[["Node1", "Node2"]] = pd.DataFrame(
    aggregated_table["Unique_Edge"].tolist(),
    index=aggregated_table.index
)

# Reorder columns for readability
aggregated_table = aggregated_table[["Node1", "Node2", "Matrix_Count", "Min_Weight", "Max_Weight", "Mean_Weight", "Std_Weight"]]

Step 6: Save the Final Table

Export your result to a CSV file (adjust the path as needed):

# For wide table
wide_table.to_csv("final_edges_wide.csv", index=False)

# For aggregated stats table
aggregated_table.to_csv("final_edges_aggregated.csv", index=False)

Quick Notes to Tweak for Your Setup:

  • If your files aren’t CSV, use pd.read_table() for TSV or adjust the read method to match your file type.
  • If your edges are already consistently ordered (e.g., Node1 is always alphabetically before Node2), you can skip the Unique_Edge standardization step.
  • Adjust the threshold number (51) if you need a different percentage (e.g., 70% would be 48 matrices).

内容的提问来源于stack exchange,提问作者user9740934

火山引擎 最新活动