PyTorch DataLoader设置num_workers>1却仅使用单个CPU核心的问题求助

PyTorch DataLoader设置num_workers>1却仅使用单个CPU核心的问题求助

阿华AIGC实验室

2026-4-14

PyTorch DataLoader设置num_workers>1却仅使用单个CPU核心的问题求助

我正在尝试针对多标签分类任务（Jigsaw有毒评论）微调BERT模型。我创建了自定义数据集和DataLoader，代码如下：

class CustomDataSet(Dataset):
    
    def __init__(self,
                 features: np.ndarray,
                 labels: np.ndarray,
                 token_max: int,
                 tokenizer):
    
        self.features = features
        self.labels = labels  
        self.tokenizer = tokenizer
        self.token_max = token_max
        
    def __len__(self):
        return len(self.features)

    def __getitem__(self,
                    index: int):
        comment_id, comment_text = self.features[index]
        labels = self.labels[index]

        encoding = self.tokenizer.encode_plus(
            comment_text,
            add_special_tokens=True,
            max_length=self.token_max,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt')

        return dict(
            comment_text=comment_text,
            comment_id=comment_id,
            input_ids=encoding['input_ids'].squeeze(0),
            attention_mask=encoding['attention_mask'].squeeze(0),
            labels=torch.Tensor(labels))

我使用的分词器是：

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

然后我用自定义类创建数据集和对应的DataLoader：

train_dataset = CustomDataSet(X_train, y_train, tokenizer=tokenizer, token_max=256)
train_loader = DataLoader(
    train_dataset, batch_size=32, shuffle=True, pin_memory=True, num_workers=16, persistent_workers=False
)

我的模型定义如下：

class MultiLabelBERT(torch.nn.Module):
    def __init__(self, num_labels):
        super(MultiLabelBERT, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased", torch_dtype=torch.float16, attn_implementation="sdpa")
        self.classifier = torch.nn.Linear(self.bert.config.hidden_size, num_labels)
        self.classifier = self.classifier.to(torch.float16)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output 
        logits = self.classifier(pooled_output)
        return logits

first_BERT = MultiLabelBERT(6)

我先是尝试遍历训练集的所有批次并执行前向传播：

for batch_idx, item  in enumerate(train_loader):
    input_ids = item['input_ids'].to(self.device)
    attention_mask = item['attention_mask'].to(self.device)
    labels = item['labels'].to(self.device)
    logits = first_BERT(input_ids=input_ids,
                        attention_mask=attention_mask)

尽管我在DataLoader里设置了num_workers=16，但它只使用一个CPU核心来把数据加载到GPU上，这大大拖慢了整个流程。我已经尝试过以下方法：

减小batch size
减少最大token数（token_max）
预先对整个数据集做分词处理，确保分词器不是瓶颈所在

当我注释掉前向传播的代码时，DataLoader会正常使用所有CPU worker；但只要加上前向传播，整个流程就好像被卡住了一样。

我的GPU配置如下：

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   40C    P0            588W /  115W |       9MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1285      G   /usr/lib/xorg/Xorg                              4MiB |

有没有人知道这可能是什么原因导致的？

备注：内容来源于stack exchange，提问作者Hyppolite

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠