PyTorch DataLoader设置num_workers>1却仅使用单个CPU核心的问题求助
PyTorch DataLoader设置num_workers>1却仅使用单个CPU核心的问题求助
我正在尝试针对多标签分类任务(Jigsaw有毒评论)微调BERT模型。我创建了自定义数据集和DataLoader,代码如下:
class CustomDataSet(Dataset): def __init__(self, features: np.ndarray, labels: np.ndarray, token_max: int, tokenizer): self.features = features self.labels = labels self.tokenizer = tokenizer self.token_max = token_max def __len__(self): return len(self.features) def __getitem__(self, index: int): comment_id, comment_text = self.features[index] labels = self.labels[index] encoding = self.tokenizer.encode_plus( comment_text, add_special_tokens=True, max_length=self.token_max, return_token_type_ids=False, padding='max_length', truncation=True, return_attention_mask=True, return_tensors='pt') return dict( comment_text=comment_text, comment_id=comment_id, input_ids=encoding['input_ids'].squeeze(0), attention_mask=encoding['attention_mask'].squeeze(0), labels=torch.Tensor(labels))
我使用的分词器是:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
然后我用自定义类创建数据集和对应的DataLoader:
train_dataset = CustomDataSet(X_train, y_train, tokenizer=tokenizer, token_max=256) train_loader = DataLoader( train_dataset, batch_size=32, shuffle=True, pin_memory=True, num_workers=16, persistent_workers=False )
我的模型定义如下:
class MultiLabelBERT(torch.nn.Module): def __init__(self, num_labels): super(MultiLabelBERT, self).__init__() self.bert = BertModel.from_pretrained("bert-base-uncased", torch_dtype=torch.float16, attn_implementation="sdpa") self.classifier = torch.nn.Linear(self.bert.config.hidden_size, num_labels) self.classifier = self.classifier.to(torch.float16) def forward(self, input_ids, attention_mask): outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask) pooled_output = outputs.pooler_output logits = self.classifier(pooled_output) return logits first_BERT = MultiLabelBERT(6)
我先是尝试遍历训练集的所有批次并执行前向传播:
for batch_idx, item in enumerate(train_loader): input_ids = item['input_ids'].to(self.device) attention_mask = item['attention_mask'].to(self.device) labels = item['labels'].to(self.device) logits = first_BERT(input_ids=input_ids, attention_mask=attention_mask)
尽管我在DataLoader里设置了num_workers=16,但它只使用一个CPU核心来把数据加载到GPU上,这大大拖慢了整个流程。我已经尝试过以下方法:
- 减小batch size
- 减少最大token数(token_max)
- 预先对整个数据集做分词处理,确保分词器不是瓶颈所在
当我注释掉前向传播的代码时,DataLoader会正常使用所有CPU worker;但只要加上前向传播,整个流程就好像被卡住了一样。
我的GPU配置如下:
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 ... Off | 00000000:01:00.0 Off | N/A | | N/A 40C P0 588W / 115W | 9MiB / 8188MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1285 G /usr/lib/xorg/Xorg 4MiB |
有没有人知道这可能是什么原因导致的?
备注:内容来源于stack exchange,提问作者Hyppolite




