You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

PyTorch DataLoader设置num_workers>1却仅使用单个CPU核心的问题求助

PyTorch DataLoader设置num_workers>1却仅使用单个CPU核心的问题求助

我正在尝试针对多标签分类任务(Jigsaw有毒评论)微调BERT模型。我创建了自定义数据集和DataLoader,代码如下:

class CustomDataSet(Dataset):
    
    def __init__(self,
                 features: np.ndarray,
                 labels: np.ndarray,
                 token_max: int,
                 tokenizer):
    
        self.features = features
        self.labels = labels  
        self.tokenizer = tokenizer
        self.token_max = token_max
        
    def __len__(self):
        return len(self.features)

    def __getitem__(self,
                    index: int):
        comment_id, comment_text = self.features[index]
        labels = self.labels[index]

        encoding = self.tokenizer.encode_plus(
            comment_text,
            add_special_tokens=True,
            max_length=self.token_max,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt')

        return dict(
            comment_text=comment_text,
            comment_id=comment_id,
            input_ids=encoding['input_ids'].squeeze(0),
            attention_mask=encoding['attention_mask'].squeeze(0),
            labels=torch.Tensor(labels))

我使用的分词器是:

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

然后我用自定义类创建数据集和对应的DataLoader:

train_dataset = CustomDataSet(X_train, y_train, tokenizer=tokenizer, token_max=256)
train_loader = DataLoader(
    train_dataset, batch_size=32, shuffle=True, pin_memory=True, num_workers=16, persistent_workers=False
)

我的模型定义如下:

class MultiLabelBERT(torch.nn.Module):
    def __init__(self, num_labels):
        super(MultiLabelBERT, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased", torch_dtype=torch.float16, attn_implementation="sdpa")
        self.classifier = torch.nn.Linear(self.bert.config.hidden_size, num_labels)
        self.classifier = self.classifier.to(torch.float16)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output 
        logits = self.classifier(pooled_output)
        return logits

first_BERT = MultiLabelBERT(6)

我先是尝试遍历训练集的所有批次并执行前向传播:

for batch_idx, item  in enumerate(train_loader):
    input_ids = item['input_ids'].to(self.device)
    attention_mask = item['attention_mask'].to(self.device)
    labels = item['labels'].to(self.device)
    logits = first_BERT(input_ids=input_ids,
                        attention_mask=attention_mask)

尽管我在DataLoader里设置了num_workers=16,但它只使用一个CPU核心来把数据加载到GPU上,这大大拖慢了整个流程。我已经尝试过以下方法:

  • 减小batch size
  • 减少最大token数(token_max)
  • 预先对整个数据集做分词处理,确保分词器不是瓶颈所在

当我注释掉前向传播的代码时,DataLoader会正常使用所有CPU worker;但只要加上前向传播,整个流程就好像被卡住了一样。

我的GPU配置如下:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   40C    P0            588W /  115W |       9MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1285      G   /usr/lib/xorg/Xorg                              4MiB |

有没有人知道这可能是什么原因导致的?

备注:内容来源于stack exchange,提问作者Hyppolite

火山引擎 最新活动