面向学术研究的欺诈检测：交易数据集特征选择与工程技术咨询

阿华AIGC实验室

2026-5-27

Great question—feature engineering and selection are often the unsung heroes in fraud detection, especially with transactional data like credit card datasets. Most papers fixate on model architecture, but the right features can make even a simple model outperform a fancy neural network. Let’s break this down specifically for credit card fraud use cases, including key feature types, construction methods, and non-anonymized public datasets you can leverage.

1. Key Feature Types for Credit Card Fraud Detection

These are the categories of features you’ll want to prioritize, rooted in fraud detection domain knowledge:

Raw Transaction Features (directly from the dataset):
- Transaction amount, currency, merchant category code (MCC), card brand, card type (debit/credit), transaction channel (in-store, online, mobile)
- Why they matter: Fraudsters often target high-value transactions or specific high-risk MCCs (like luxury goods or money transfers)
Temporal Features (derived from transaction timestamps):
- Time of day (e.g., 0-6am is high risk for online transactions), day of week, month of year
- Time since last transaction for the same card/user, time since first transaction for the card/user
- Transaction frequency in sliding windows (e.g., number of transactions in the last hour/day)
User/Card-Level Aggregate Features:
- Average, median, maximum, minimum transaction amount for the user over the last 30 days
- Standard deviation of transaction amounts (measures spending consistency)
- Number of unique merchants visited in the last week, number of different MCCs used
Anomaly-Detection Features:
- Transaction amount vs. user’s 95th percentile of historical amounts (flag if it’s an outlier)
- Whether the transaction occurred in a country/region the user has never used before
- Whether the merchant is new to the user
Meta Features:
- Card age (time since card issuance), user account age
- Number of previous fraud attempts (if available) for the card/user

2. Feature Engineering & Selection Techniques (Step-by-Step)

Feature Construction

Preprocess Raw Data:
- Handle missing values (e.g., impute missing MCC with "unknown" or mode)
- Transform skewed features like transaction amount with a log transformation (log(amount + 1) to avoid zero issues)
Build Temporal Features:
- Extract hour/day/month from transaction_timestamp using datetime functions (e.g., in Pandas: df['hour'] = df['transaction_time'].dt.hour)
- Calculate inter-transaction time: df['time_since_last_transaction'] = df.groupby('user_id')['transaction_time'].diff().dt.total_seconds()
Create Aggregate Features with Sliding Windows:
- Use rolling window functions to compute metrics over time (e.g., 1-hour, 24-hour, 7-day windows):
```
# Example: Average amount in last 24 hours per user
df['avg_amount_last_24h'] = df.groupby('user_id')['amount'].transform(
    lambda x: x.rolling(window='24h', on='transaction_time').mean()
)
```
- Compute cumulative metrics (e.g., total transactions ever for the user, total fraud flags)
Add Anomaly & Interaction Features:
- Flag outliers: df['is_amount_outlier'] = df['amount'] > df.groupby('user_id')['amount'].transform(lambda x: x.quantile(0.95))
- Create interaction features: df['high_amount_new_merchant'] = df['is_amount_outlier'] & df['is_new_merchant']

Feature Selection

Since fraud datasets are highly imbalanced, focus on features that separate fraud from legitimate transactions:

Domain Knowledge Filtering: Drop irrelevant features (e.g., card color if it doesn’t correlate with fraud)
Statistical Tests: Use chi-squared test for categorical features, mutual information for numerical features to measure correlation with the fraud label
Model-Based Selection: Use tree-based models (Random Forest, XGBoost) to get feature importance scores, or L1 regularization (Logistic Regression with penalty='l1') to zero out irrelevant features
Class-Specific Analysis: Check if a feature has a significantly different distribution for fraud vs. legitimate transactions (e.g., fraud transactions are more likely to be in the middle of the night)

3. Non-Anonymized Public Credit Card Transaction Datasets

Anonymized datasets (like the famous UCI credit card fraud dataset) limit feature engineering because you can’t build user-level aggregates. Here are non-anonymized options:

Synthetic Financial Datasets For Fraud Detection:
- Synthetic but modeled after real transaction patterns, includes user_id, merchant_id, transaction_time, amount, category, fraud_label, and more
- Perfect for building user/merchant-level aggregate features
IEEE-CIS Fraud Detection Competition Dataset:
- Combines transaction data with user registration details, device information, and merchant metadata
- Non-anonymized, with rich multi-dimensional features to engineer (e.g., device type, user registration country)
PayPal Fraud Detection Dataset:
- Real-world transaction data (retains user/merchant identifiers for aggregation)
- Includes features like transaction status, payment method, and risk scores assigned by PayPal’s internal systems

Pro Tips

Always validate features on a holdout set to avoid overfitting to training data
For imbalanced data, use metrics like precision-recall AUC instead of accuracy when evaluating feature utility
Don’t overlook categorical feature encoding: use target encoding for high-cardinality features like merchant_id (since one-hot encoding would explode dimensionality)

内容的提问来源于stack exchange，提问作者Diego