PyTorch新手求助：如何使用自定义WAV文件构建语音识别数据集（不依赖官方torch.datasets）

阿华AIGC实验室

2026-4-29

Hey there! I get that you're just starting out with PyTorch and want to adapt your speech recognition code to use your own WAV files instead of the official SpeechCommands dataset. Let's break this down step by step—no need to worry about complex stuff, we'll keep it straightforward.

First, the core idea for custom datasets in PyTorch is to inherit from torch.utils.data.Dataset and implement three key methods: __init__, __len__, and __getitem__. Let's assume your dataset has a common folder structure where each subfolder represents a speech category, with all WAV files for that category inside it (like the example below). If your structure is different, we can adjust the code later!

my_custom_speech_data/
├── cat/
│   ├── meow_1.wav
│   ├── meow_2.wav
│   └── ...
├── dog/
│   ├── bark_1.wav
│   ├── bark_2.wav
│   └── ...
└── bird/
    ├── chirp_1.wav
    └── ...

Here's the modified code with detailed explanations:

Custom WAV Dataset Implementation

First, import the necessary libraries (we'll keep most of your original imports plus a few extras):

import torch
from torch import nn, optim
import torch.nn.functional as F
import torchaudio
from torch.utils.data import Dataset, DataLoader, random_split
import os
from pathlib import Path

Next, define your custom dataset class:

class CustomSpeechDataset(Dataset):
    def __init__(self, data_dir, target_sample_rate=16000):
        super().__init__()
        self.data_dir = Path(data_dir)
        self.target_sample_rate = target_sample_rate
        
        # 1. Auto-detect speech categories from subfolder names
        self.classes = sorted([folder.name for folder in self.data_dir.iterdir() if folder.is_dir()])
        # 2. Map category names to numerical labels (required for model training)
        self.class_to_idx = {cls: idx for idx, cls in enumerate(self.classes)}
        
        # 3. Load all WAV file paths and their corresponding labels
        self.audio_paths = []
        self.labels = []
        
        for cls in self.classes:
            cls_dir = self.data_dir / cls
            # Grab all WAV files in the category folder
            wav_files = list(cls_dir.glob("*.wav"))
            for wav_path in wav_files:
                self.audio_paths.append(wav_path)
                self.labels.append(self.class_to_idx[cls])
    
    def __len__(self):
        # Return total number of samples in the dataset
        return len(self.audio_paths)
    
    def __getitem__(self, idx):
        # Load and process a single sample
        audio_path = self.audio_paths[idx]
        label = self.labels[idx]
        
        # Read the WAV file
        waveform, sample_rate = torchaudio.load(audio_path)
        
        # Standardize sample rate (critical for consistent model input)
        if sample_rate != self.target_sample_rate:
            waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=self.target_sample_rate)
        
        # Convert multi-channel audio (like stereo) to mono if needed
        if waveform.shape[0] > 1:
            waveform = torch.mean(waveform, dim=0, keepdim=True)
        
        # Return a structure similar to your original code
        return waveform, self.target_sample_rate, label, audio_path.name

Now replace your original dataset initialization code with this:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Replace this path with your actual dataset folder path
DATA_DIR = "./my_custom_speech_data"

# Load the full dataset
full_dataset = CustomSpeechDataset(DATA_DIR)

# Split into training and testing sets (80-20 split example)
train_size = int(0.8 * len(full_dataset))
test_size = len(full_dataset) - train_size
train_set, test_set = random_split(full_dataset, [train_size, test_size])

# Test loading the first sample
waveform, sample_rate, label, filename = train_set[0]
print(f"Waveform shape: {waveform.shape}, Sample rate: {sample_rate}")
print(f"Label: {label} (corresponding to class: {full_dataset.classes[label]}), Filename: {filename}")

Key Changes Explained

No more dependency on the official SPEECHCOMMANDS class: We're building the dataset from scratch, so you have full control over data loading.
Auto-detect categories: The code pulls category names directly from your folder structure, no need for validation_list.txt or testing_list.txt (if you have pre-defined splits, you can modify the __init__ method to read those files instead).
Consistent audio formatting: We add logic to standardize sample rates and convert to mono, which is essential for speech recognition models to work correctly.
Numerical labels: Category names are mapped to integers, which is required for training classification models.

Extra Tips

If your dataset already has separate train/test folders, modify the CustomSpeechDataset to accept a split parameter (e.g., "train" or "test") and load files from the corresponding folder.
Add data augmentation (like background noise, random cropping) in the __getitem__ method to improve model generalization.
Use DataLoader for batch processing (just like you would with the official dataset):

train_loader = DataLoader(train_set, batch_size=32, shuffle=True)
test_loader = DataLoader(test_set, batch_size=32, shuffle=False)

This setup will work seamlessly with your existing model training code—you're ready to train with your own WAV files!

内容的提问来源于stack exchange，提问作者Ariya Mirzaei

火山引擎最新活动

方舟 Coding Plan

HOT

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

查看详情

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

查看详情

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

查看详情

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠

查看详情

ArkClaw 专属智能伙伴