PyTorch新手求助:如何使用自定义WAV文件构建语音识别数据集(不依赖官方torch.datasets)
Hey there! I get that you're just starting out with PyTorch and want to adapt your speech recognition code to use your own WAV files instead of the official SpeechCommands dataset. Let's break this down step by step—no need to worry about complex stuff, we'll keep it straightforward.
First, the core idea for custom datasets in PyTorch is to inherit from torch.utils.data.Dataset and implement three key methods: __init__, __len__, and __getitem__. Let's assume your dataset has a common folder structure where each subfolder represents a speech category, with all WAV files for that category inside it (like the example below). If your structure is different, we can adjust the code later!
my_custom_speech_data/ ├── cat/ │ ├── meow_1.wav │ ├── meow_2.wav │ └── ... ├── dog/ │ ├── bark_1.wav │ ├── bark_2.wav │ └── ... └── bird/ ├── chirp_1.wav └── ...
Here's the modified code with detailed explanations:
First, import the necessary libraries (we'll keep most of your original imports plus a few extras):
import torch from torch import nn, optim import torch.nn.functional as F import torchaudio from torch.utils.data import Dataset, DataLoader, random_split import os from pathlib import Path
Next, define your custom dataset class:
class CustomSpeechDataset(Dataset): def __init__(self, data_dir, target_sample_rate=16000): super().__init__() self.data_dir = Path(data_dir) self.target_sample_rate = target_sample_rate # 1. Auto-detect speech categories from subfolder names self.classes = sorted([folder.name for folder in self.data_dir.iterdir() if folder.is_dir()]) # 2. Map category names to numerical labels (required for model training) self.class_to_idx = {cls: idx for idx, cls in enumerate(self.classes)} # 3. Load all WAV file paths and their corresponding labels self.audio_paths = [] self.labels = [] for cls in self.classes: cls_dir = self.data_dir / cls # Grab all WAV files in the category folder wav_files = list(cls_dir.glob("*.wav")) for wav_path in wav_files: self.audio_paths.append(wav_path) self.labels.append(self.class_to_idx[cls]) def __len__(self): # Return total number of samples in the dataset return len(self.audio_paths) def __getitem__(self, idx): # Load and process a single sample audio_path = self.audio_paths[idx] label = self.labels[idx] # Read the WAV file waveform, sample_rate = torchaudio.load(audio_path) # Standardize sample rate (critical for consistent model input) if sample_rate != self.target_sample_rate: waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=self.target_sample_rate) # Convert multi-channel audio (like stereo) to mono if needed if waveform.shape[0] > 1: waveform = torch.mean(waveform, dim=0, keepdim=True) # Return a structure similar to your original code return waveform, self.target_sample_rate, label, audio_path.name
Now replace your original dataset initialization code with this:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Replace this path with your actual dataset folder path DATA_DIR = "./my_custom_speech_data" # Load the full dataset full_dataset = CustomSpeechDataset(DATA_DIR) # Split into training and testing sets (80-20 split example) train_size = int(0.8 * len(full_dataset)) test_size = len(full_dataset) - train_size train_set, test_set = random_split(full_dataset, [train_size, test_size]) # Test loading the first sample waveform, sample_rate, label, filename = train_set[0] print(f"Waveform shape: {waveform.shape}, Sample rate: {sample_rate}") print(f"Label: {label} (corresponding to class: {full_dataset.classes[label]}), Filename: {filename}")
- No more dependency on the official
SPEECHCOMMANDSclass: We're building the dataset from scratch, so you have full control over data loading. - Auto-detect categories: The code pulls category names directly from your folder structure, no need for
validation_list.txtortesting_list.txt(if you have pre-defined splits, you can modify the__init__method to read those files instead). - Consistent audio formatting: We add logic to standardize sample rates and convert to mono, which is essential for speech recognition models to work correctly.
- Numerical labels: Category names are mapped to integers, which is required for training classification models.
- If your dataset already has separate train/test folders, modify the
CustomSpeechDatasetto accept asplitparameter (e.g.,"train"or"test") and load files from the corresponding folder. - Add data augmentation (like background noise, random cropping) in the
__getitem__method to improve model generalization. - Use
DataLoaderfor batch processing (just like you would with the official dataset):
train_loader = DataLoader(train_set, batch_size=32, shuffle=True) test_loader = DataLoader(test_set, batch_size=32, shuffle=False)
This setup will work seamlessly with your existing model training code—you're ready to train with your own WAV files!
内容的提问来源于stack exchange,提问作者Ariya Mirzaei




