KalelPark's LAB

[CODE] Multi-GPU 활용하기 본문

Data Science/CODE

[CODE] Multi-GPU 활용하기

kalelpark 2023. 3. 19. 17:00

Problem

SSL로 Batchsize를 최대한 늘려 학습하고자 하는데, 잘 안되어 분산처리를 봤다. 사실 연관은 없는 것 같다.. (불가능.ㅠ)

결론 저희 연구실에서는 SSL을 하려면,TeslaV100을 하나 장만해야 함을 느꼈습니다..

import os
local_rank=int(os.environ["LOCAL_RANK"])
import torch
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
import torchvision.transforms as T
from torch.utils.data import DataLoader
from torchvision.models import efficientnet_b0
from torchvision.datasets import CIFAR100
import torchvision

dist.init_process_group(backend="nccl")
transforms = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor()
])


train_path = "/home/psboys/shared/hdd_ext/nvme1/vision/imageNet/train/"
train_dataset = torchvision.datasets.ImageFolder(train_path, transforms)
train_sampler = DistributedSampler(train_dataset, shuffle=True)
train_loader = DataLoader(train_dataset, batch_size=4096, sampler=train_sampler)

model = efficientnet_b0(num_classes=1000).cuda(local_rank)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion_CE = torch.nn.CrossEntropyLoss()
for epoch in range(100):
    for img, label in train_loader:
        img = img.cuda(local_rank)
        label = label.cuda(local_rank)
        output = model(img)
        loss = criterion_CE(output, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(epoch)

# CUDA_VISIBLE_DEVICES=1,2,5,7 torchrun --nproc_per_node 3 dummy.py
# CUDA_VISIBLE_DEVICES=1,2,5,6,7 torchrun --nproc_per_node 5 ddp.py

Reference

https://rlawjdghek.github.io/pytorch%20&%20tensorflow%20&%20coding/DataParallel/

 

파이토치 DataParallel, DistributedDataParallel, apex, amp 정리

파이토치 DataParallel, DistributedDataParallel, apex, amp 정리

rlawjdghek.github.io

https://medium.com/codex/a-comprehensive-tutorial-to-pytorch-distributeddataparallel-1f4b42bb1b51#854f

 

A Comprehensive Tutorial to Pytorch DistributedDataParallel

Entire workflow for pytorch DistributedDataParallel, including Dataloader, Sampler, training, and evaluating. Insights&Codes.

medium.com

https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html

 

분산 데이터 병렬 처리 시작하기

저자: Shen Li 감수: Joe Zhu 번역: 조병근 선수과목(Prerequisites): PyTorch 분산 처리 개요, 분산 데이터 병렬 처리 API 문서, 분산 데이터 병렬 처리 문서. 분산 데이터 병렬 처리(DDP)는 여러 기기에서 실행

tutorials.pytorch.kr

 

Comments