基于Python Pandas整合GFF与Scaffold信息生成目标DataFrame
Got it, let's work through this problem to get you that clean, gene-scaffold association DataFrame you need! The key issue here is filtering out those redundant intron/CDS entries from your Augustus GFF, so let's break this down step by step.
1. First, Understand Your Data
Augustus GFF files have hierarchical entries: gene is the top-level feature, and introns/CDS/exons are child entries linked to a parent gene. We only care about the gene rows to avoid redundancy. Your scaffold tab file is a straightforward mapping of scaffold IDs to their attributes (length, depth, GC content) — we'll use this as a lookup table.
2. Optimized Python Script
Step 1: Import Required Libraries
import pandas as pd
Step 2: Load & Filter the Augustus GFF
We'll read the GFF, define column names, and keep only the gene-level entries. We'll also extract the gene ID from the messy attributes column.
# Define standard GFF column names gff_columns = ["seqid", "source", "feature", "start", "end", "score", "strand", "phase", "attributes"] # Load GFF, skip comment lines starting with # gff_df = pd.read_csv("your_augustus_output.gff", sep="\t", comment="#", names=gff_columns) # Filter to keep ONLY gene entries (this eliminates intron/CDS/exon redundancy) gene_only_df = gff_df[gff_df["feature"] == "gene"].copy() # Extract gene ID from the attributes column (Augustus uses format: ID=gene_name;...) gene_only_df["gene_id"] = gene_only_df["attributes"].str.extract(r'ID=([^;]+)')
Step 3: Load the Scaffold Attribute Table
Adjust this based on whether your tab file has a header or not:
# If your tab file has a header (e.g., scaffold_id, length, coverage_depth, gc_content) scaffold_attr_df = pd.read_csv("your_scaffold_stats.tab", sep="\t") # If NO header, manually define column names: # scaffold_attr_df = pd.read_csv("your_scaffold_stats.tab", sep="\t", names=["scaffold_id", "length", "coverage_depth", "gc_content"])
Step 4: Merge Gene & Scaffold Data
We'll link the two DataFrames using the scaffold ID (called seqid in GFF, adjust the left_on/right_on params if your column names differ):
# Merge to associate each gene with its scaffold's attributes final_association_df = pd.merge( gene_only_df, scaffold_attr_df, left_on="seqid", right_on="scaffold_id", # Replace with your scaffold ID column name if different how="left" # Keeps all genes even if a scaffold has no attribute data (handle missing values later if needed) ) # Clean up the output to only keep useful columns (customize this to your needs) final_association_df = final_association_df[["gene_id", "seqid", "length", "coverage_depth", "gc_content", "start", "end", "strand"]]
Step 5: Save the Final Result
final_association_df.to_csv("gene_scaffold_association.tsv", index=False, sep="\t")
3. Key Optimizations Explained
- Filtering Gene Entries: The line
gff_df[gff_df["feature"] == "gene"]is the fix for your redundancy problem — it discards all child features like introns/CDS. - Precise Gene ID Extraction: Using regex (
str.extract(r'ID=([^;]+)')) pulls the exact gene name from the unstructuredattributescolumn, no extra noise. - Left Join Safety: Using
how="left"ensures you don't lose any genes if a scaffold is missing from your attribute table (you can fill missing values withfillna()if needed).
4. Edge Case Fixes
- If your Augustus GFF uses
gene_id=instead ofID=in the attributes column, adjust the regex to:gene_only_df["gene_id"] = gene_only_df["attributes"].str.extract(r'gene_id=([^;]+)') - If your scaffold ID column name doesn't match
seqid, update theleft_on/right_onparameters in the merge step to match your actual column names.
内容的提问来源于stack exchange,提问作者Grendel




