You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

基于Python Pandas整合GFF与Scaffold信息生成目标DataFrame

Got it, let's work through this problem to get you that clean, gene-scaffold association DataFrame you need! The key issue here is filtering out those redundant intron/CDS entries from your Augustus GFF, so let's break this down step by step.

Solution: Clean Gene-Scaffold Association with Pandas

1. First, Understand Your Data

Augustus GFF files have hierarchical entries: gene is the top-level feature, and introns/CDS/exons are child entries linked to a parent gene. We only care about the gene rows to avoid redundancy. Your scaffold tab file is a straightforward mapping of scaffold IDs to their attributes (length, depth, GC content) — we'll use this as a lookup table.

2. Optimized Python Script

Step 1: Import Required Libraries

import pandas as pd

Step 2: Load & Filter the Augustus GFF

We'll read the GFF, define column names, and keep only the gene-level entries. We'll also extract the gene ID from the messy attributes column.

# Define standard GFF column names
gff_columns = ["seqid", "source", "feature", "start", "end", "score", "strand", "phase", "attributes"]

# Load GFF, skip comment lines starting with #
gff_df = pd.read_csv("your_augustus_output.gff", sep="\t", comment="#", names=gff_columns)

# Filter to keep ONLY gene entries (this eliminates intron/CDS/exon redundancy)
gene_only_df = gff_df[gff_df["feature"] == "gene"].copy()

# Extract gene ID from the attributes column (Augustus uses format: ID=gene_name;...)
gene_only_df["gene_id"] = gene_only_df["attributes"].str.extract(r'ID=([^;]+)')

Step 3: Load the Scaffold Attribute Table

Adjust this based on whether your tab file has a header or not:

# If your tab file has a header (e.g., scaffold_id, length, coverage_depth, gc_content)
scaffold_attr_df = pd.read_csv("your_scaffold_stats.tab", sep="\t")

# If NO header, manually define column names:
# scaffold_attr_df = pd.read_csv("your_scaffold_stats.tab", sep="\t", names=["scaffold_id", "length", "coverage_depth", "gc_content"])

Step 4: Merge Gene & Scaffold Data

We'll link the two DataFrames using the scaffold ID (called seqid in GFF, adjust the left_on/right_on params if your column names differ):

# Merge to associate each gene with its scaffold's attributes
final_association_df = pd.merge(
    gene_only_df,
    scaffold_attr_df,
    left_on="seqid",
    right_on="scaffold_id",  # Replace with your scaffold ID column name if different
    how="left"  # Keeps all genes even if a scaffold has no attribute data (handle missing values later if needed)
)

# Clean up the output to only keep useful columns (customize this to your needs)
final_association_df = final_association_df[["gene_id", "seqid", "length", "coverage_depth", "gc_content", "start", "end", "strand"]]

Step 5: Save the Final Result

final_association_df.to_csv("gene_scaffold_association.tsv", index=False, sep="\t")

3. Key Optimizations Explained

  • Filtering Gene Entries: The line gff_df[gff_df["feature"] == "gene"] is the fix for your redundancy problem — it discards all child features like introns/CDS.
  • Precise Gene ID Extraction: Using regex (str.extract(r'ID=([^;]+)')) pulls the exact gene name from the unstructured attributes column, no extra noise.
  • Left Join Safety: Using how="left" ensures you don't lose any genes if a scaffold is missing from your attribute table (you can fill missing values with fillna() if needed).

4. Edge Case Fixes

  • If your Augustus GFF uses gene_id= instead of ID= in the attributes column, adjust the regex to: gene_only_df["gene_id"] = gene_only_df["attributes"].str.extract(r'gene_id=([^;]+)')
  • If your scaffold ID column name doesn't match seqid, update the left_on/right_on parameters in the merge step to match your actual column names.

内容的提问来源于stack exchange,提问作者Grendel

火山引擎 最新活动