基于Python Pandas整合GFF与Scaffold信息生成目标DataFrame

阿华AIGC实验室

2026-5-27

Got it, let's work through this problem to get you that clean, gene-scaffold association DataFrame you need! The key issue here is filtering out those redundant intron/CDS entries from your Augustus GFF, so let's break this down step by step.

Solution: Clean Gene-Scaffold Association with Pandas

1. First, Understand Your Data

Augustus GFF files have hierarchical entries: gene is the top-level feature, and introns/CDS/exons are child entries linked to a parent gene. We only care about the gene rows to avoid redundancy. Your scaffold tab file is a straightforward mapping of scaffold IDs to their attributes (length, depth, GC content) — we'll use this as a lookup table.

2. Optimized Python Script

Step 1: Import Required Libraries

import pandas as pd

Step 2: Load & Filter the Augustus GFF

We'll read the GFF, define column names, and keep only the gene-level entries. We'll also extract the gene ID from the messy attributes column.

# Define standard GFF column names
gff_columns = ["seqid", "source", "feature", "start", "end", "score", "strand", "phase", "attributes"]

# Load GFF, skip comment lines starting with #
gff_df = pd.read_csv("your_augustus_output.gff", sep="\t", comment="#", names=gff_columns)

# Filter to keep ONLY gene entries (this eliminates intron/CDS/exon redundancy)
gene_only_df = gff_df[gff_df["feature"] == "gene"].copy()

# Extract gene ID from the attributes column (Augustus uses format: ID=gene_name;...)
gene_only_df["gene_id"] = gene_only_df["attributes"].str.extract(r'ID=([^;]+)')

Step 3: Load the Scaffold Attribute Table

Adjust this based on whether your tab file has a header or not:

# If your tab file has a header (e.g., scaffold_id, length, coverage_depth, gc_content)
scaffold_attr_df = pd.read_csv("your_scaffold_stats.tab", sep="\t")

# If NO header, manually define column names:
# scaffold_attr_df = pd.read_csv("your_scaffold_stats.tab", sep="\t", names=["scaffold_id", "length", "coverage_depth", "gc_content"])

Step 4: Merge Gene & Scaffold Data

We'll link the two DataFrames using the scaffold ID (called seqid in GFF, adjust the left_on/right_on params if your column names differ):

# Merge to associate each gene with its scaffold's attributes
final_association_df = pd.merge(
    gene_only_df,
    scaffold_attr_df,
    left_on="seqid",
    right_on="scaffold_id",  # Replace with your scaffold ID column name if different
    how="left"  # Keeps all genes even if a scaffold has no attribute data (handle missing values later if needed)
)

# Clean up the output to only keep useful columns (customize this to your needs)
final_association_df = final_association_df[["gene_id", "seqid", "length", "coverage_depth", "gc_content", "start", "end", "strand"]]

Step 5: Save the Final Result

final_association_df.to_csv("gene_scaffold_association.tsv", index=False, sep="\t")

3. Key Optimizations Explained

Filtering Gene Entries: The line gff_df[gff_df["feature"] == "gene"] is the fix for your redundancy problem — it discards all child features like introns/CDS.
Precise Gene ID Extraction: Using regex (str.extract(r'ID=([^;]+)')) pulls the exact gene name from the unstructured attributes column, no extra noise.
Left Join Safety: Using how="left" ensures you don't lose any genes if a scaffold is missing from your attribute table (you can fill missing values with fillna() if needed).

4. Edge Case Fixes

If your Augustus GFF uses gene_id= instead of ID= in the attributes column, adjust the regex to: gene_only_df["gene_id"] = gene_only_df["attributes"].str.extract(r'gene_id=([^;]+)')
If your scaffold ID column name doesn't match seqid, update the left_on/right_on parameters in the merge step to match your actual column names.

内容的提问来源于stack exchange，提问作者Grendel