0%

『Pytorch』混合精度运算

前言

本文的大部分内容来源于转载,出处参考文末参考资料。

AMP简介

题外话,我为什么要写这篇博客,就是因为我穷没钱!租的服务器使用多GPU时一会钱就烧没了(gpu内存不用),急需要一种trick,来降低内存加速。

回到正题,如果我们使用的数据集较大,且网络较深,则会造成训练较慢,此时我们要想加速训练可以使用Pytorch的AMPautocast与Gradscaler);本文便是依据此写出的博文,对Pytorch的AMP(autocast与Gradscaler进行对比)自动混合精度对模型训练加速

注意Pytorch1.6+,已经内置torch.cuda.amp,因此便不需要加载NVIDIA的apex库(半精度加速),为方便我们便不使用NVIDIA的apex库(安装麻烦),转而使用torch.cuda.amp

AMP (Automatic mixed precision): 自动混合精度,那什么是自动混合精度

先来梳理一下历史:先有NVIDIA的apex,之后NVIDIA的开发人员将其贡献到Pytorch 1.6+产生了torch.cuda.amp[这是笔者梳理,可能有误,请留言]

详细讲:默认情况下,大多数深度学习框架都采用32位浮点算法进行训练。2017年,NVIDIA研究了一种用于混合精度训练的方法(apex),该方法在训练网络时将单精度(FP32)与半精度(FP16)结合在一起,并使用相同的超参数实现了与FP32几乎相同的精度,且速度比之前快了不少

之后,来到了AMP时代(特指torch.cuda.amp),此有两个关键词:自动混合精度(Pytorch 1.6+中的torch.cuda.amp)其中,自动表现在Tensor的dtype类型会自动变化,框架按需自动调整tensor的dtype,可能有些地方需要手动干预;混合精度表现在采用不止一种精度的Tensor, torch.FloatTensor与torch.HalfTensor。并且从名字可以看出torch.cuda.amp,这个功能只能在cuda上使用

为什么我们要使用AMP自动混合精度

  1. 减少显存占用(FP16优势)
  2. 加快训练和推断的计算(FP16优势)
  3. 张量核心的普及(NVIDIA Tensor Core),低精度(FP16优势)
  4. 混合精度训练缓解舍入误差问题,(FP16有此劣势,但是FP32可以避免此)
  5. 损失放大,可能使用混合精度还会出现无法收敛的问题[其原因时激活梯度值较小],造成了溢出,则可以通过使用torch.cuda.amp.GradScaler放大损失来防止梯度的下溢

申明此篇博文主旨如何让网络模型加速训练,而非去了解其原理,且其以AlexNet为网络架构(其需要输入的图像大小为227x227x3),CIFAR10为数据集,Adamw为梯度下降函数,学习率机制为ReduceLROnPlateau举例。使用的电脑是2060的拯救者,虽然渣,但是还是可以搞搞这些测试。

本文从

  1. 没使用DDP与DP训练与评估代码(之后加入amp)
  2. 分布式DP训练与评估代码(之后加入amp)
  3. 单进程占用多卡DDP训练与评估代码(之后加入amp) 角度讲解

AMP使用

基本AMP

autocast

原本模型加入 autocast 的大致流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from torch.cuda.amp import autocast as autocast

...

# Create model, default torch.FloatTensor
model = Net().cuda()

# SGD,Adm, Admw,...
optim = optim.XXX(model.parameters(),..)

...

for imgs,targets in dataloader:
imgs,targets = imgs.cuda(),targets.cuda()

....
with autocast():
outputs = model(imgs)
loss = loss_fn(outputs,targets)
...
optim.zero_grad()
loss.backward()
optim.step()

...

autocast and GradScaler

原本模型加入 autocast 与 GradScaler 的大致流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from torch.cuda.amp import autocast as autocast
from torch.cuda.amp import GradScaler as GradScaler
...

# Create model, default torch.FloatTensor
model = Net().cuda()

# SGD,Adm, Admw,...
optim = optim.XXX(model.parameters(),..)
scaler = GradScaler()

...

for imgs,targets in dataloader:
imgs,targets = imgs.cuda(),targets.cuda()
...
optim.zero_grad()
....
with autocast():
outputs = model(imgs)
loss = loss_fn(outputs,targets)

scaler.scale(loss).backward()
scaler.step(optim)
scaler.update()

分布式DP与AMP

基本流程

不能仅仅按照基本 AMP 的方法实现,否则无效

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 实现方法一
Model(nn.Module):
@autocast()
def forward(self, input):
...

# 实现方法二
Model(nn.Module):
def forward(self, input):
with autocast():
...

# 训练流程
...
model = Model()
model = torch.nn.DataParallel(model)
with autocast():
output = model(imgs)
loss = loss_fn(output)
...

完整代码参考

分布式 DP + autocast + GradScaler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import time
import torch
from alexnet import alexnet
import torchvision
from torch import nn
from torch.utils.data import DataLoader
from torchvision import transforms
from torch.cuda.amp import autocast as autocast
from torch.cuda.amp import GradScaler as GradScaler
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse


def parse_args():
parser = argparse.ArgumentParser(description='CV Train')
parser.add_mutually_exclusive_group()
parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
parser.add_argument('--img_size', type=int, default=227, help='image size')
parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
parser.add_argument('--batch_size', type=int, default=64, help='batch size')
parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
return parser.parse_args()


args = parse_args()

# 1.Create SummaryWriter
if args.tensorboard:
writer = SummaryWriter(args.tensorboard_log)

# 2.Ready dataset
if args.dataset == 'CIFAR10':
train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
[transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
else:
raise ValueError("Dataset is not CIFAR10")
cuda = torch.cuda.is_available()
print('CUDA available: {}'.format(cuda))

# 3.Length
train_dataset_size = len(train_dataset)
print("the train dataset size is {}".format(train_dataset_size))

# 4.DataLoader
train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)

# 5.Create model
model = alexnet()

if args.cuda == cuda:
model = model.cuda()
model = torch.nn.DataParallel(model).cuda()
else:
model = torch.nn.DataParallel(model)

# 6.Create loss
cross_entropy_loss = nn.CrossEntropyLoss()

# 7.Optimizer
optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
scaler = GradScaler()
# 8. Set some parameters to control loop
# epoch
iter = 0
t0 = time.time()
for epoch in range(args.epochs):
t1 = time.time()
print(" -----------------the {} number of training epoch --------------".format(epoch))
model.train()
for data in train_dataloader:
loss = 0
imgs, targets = data
optim.zero_grad()
if args.cuda == cuda:
cross_entropy_loss = cross_entropy_loss.cuda()
imgs, targets = imgs.cuda(), targets.cuda()
with autocast():
outputs = model(imgs)
loss_train = cross_entropy_loss(outputs, targets)
loss = loss_train.item() + loss
if args.tensorboard:
writer.add_scalar("train_loss", loss_train.item(), iter)

scaler.scale(loss_train).backward()
scaler.step(optim)
scaler.update()

iter = iter + 1
if iter % 100 == 0:
print(
"Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
.format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
np.mean(loss)))
if args.tensorboard:
writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
scheduler.step(np.mean(loss))
t2 = time.time()
h = (t2 - t1) // 3600
m = ((t2 - t1) % 3600) // 60
s = ((t2 - t1) % 3600) % 60
print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))

if epoch % 1 == 0:
print("Save state, iter: {} ".format(epoch))
torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))

torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
t3 = time.time()
h_t = (t3 - t0) // 3600
m_t = ((t3 - t0) % 3600) // 60
s_t = ((t3 - t0) % 3600) // 60
print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
if args.tensorboard:
writer.close()

分布式DDP与AMP

autocast

训练部分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import time
import torch
from alexnet import alexnet
import torchvision
from torch import nn
import torch.distributed as dist
from torchvision import transforms
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast as autocast
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse


def parse_args():
parser = argparse.ArgumentParser(description='CV Train')
parser.add_mutually_exclusive_group()
parser.add_argument("--rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=1)
parser.add_argument("--master_addr", type=str, default="127.0.0.1")
parser.add_argument("--master_port", type=str, default="12355")
parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
parser.add_argument('--img_size', type=int, default=227, help='image size')
parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
parser.add_argument('--batch_size', type=int, default=64, help='batch size')
parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
return parser.parse_args()


args = parse_args()


def train():
dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
rank=args.rank,
world_size=args.world_size)
# 1.Create SummaryWriter
if args.tensorboard:
writer = SummaryWriter(args.tensorboard_log)

# 2.Ready dataset
if args.dataset == 'CIFAR10':
train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
[transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)

else:
raise ValueError("Dataset is not CIFAR10")

cuda = torch.cuda.is_available()
print('CUDA available: {}'.format(cuda))

# 3.Length
train_dataset_size = len(train_dataset)
print("the train dataset size is {}".format(train_dataset_size))

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
# 4.DataLoader
train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size, sampler=train_sampler,
num_workers=2,
pin_memory=True)

# 5.Create model
model = alexnet()

if args.cuda == cuda:
model = model.cuda()
model = torch.nn.parallel.DistributedDataParallel(model).cuda()
else:
model = torch.nn.parallel.DistributedDataParallel(model)

# 6.Create loss
cross_entropy_loss = nn.CrossEntropyLoss()

# 7.Optimizer
optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)

# 8. Set some parameters to control loop
# epoch
iter = 0
t0 = time.time()
for epoch in range(args.epochs):
t1 = time.time()
print(" -----------------the {} number of training epoch --------------".format(epoch))
model.train()
for data in train_dataloader:
loss = 0
imgs, targets = data
if args.cuda == cuda:
cross_entropy_loss = cross_entropy_loss.cuda()
imgs, targets = imgs.cuda(), targets.cuda()
with autocast():
outputs = model(imgs)
loss_train = cross_entropy_loss(outputs, targets)
loss = loss_train.item() + loss
if args.tensorboard:
writer.add_scalar("train_loss", loss_train.item(), iter)

optim.zero_grad()
loss_train.backward()
optim.step()
iter = iter + 1
if iter % 100 == 0:
print(
"Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
.format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
np.mean(loss)))
if args.tensorboard:
writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
scheduler.step(np.mean(loss))
t2 = time.time()
h = (t2 - t1) // 3600
m = ((t2 - t1) % 3600) // 60
s = ((t2 - t1) % 3600) % 60
print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))

if epoch % 1 == 0:
print("Save state, iter: {} ".format(epoch))
torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))

torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
t3 = time.time()
h_t = (t3 - t0) // 3600
m_t = ((t3 - t0) % 3600) // 60
s_t = ((t3 - t0) % 3600) // 60
print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
if args.tensorboard:
writer.close()


if __name__ == "__main__":
local_size = torch.cuda.device_count()
print("local_size: ".format(local_size))
train()

测试部分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import torch
import torchvision
import torch.distributed as dist
from torch.utils.data import DataLoader
from torchvision.transforms import transforms
from alexnet import alexnet
# from torchvision.models.alexnet import alexnet
import argparse


# eval
def parse_args():
parser = argparse.ArgumentParser(description='CV Evaluation')
parser.add_mutually_exclusive_group()
parser.add_argument("--rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=1)
parser.add_argument("--master_addr", type=str, default="127.0.0.1")
parser.add_argument("--master_port", type=str, default="12355")
parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
parser.add_argument('--img_size', type=int, default=227, help='image size')
parser.add_argument('--batch_size', type=int, default=64, help='batch size')
parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
return parser.parse_args()


args = parse_args()


def eval():
dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
rank=args.rank,
world_size=args.world_size)
# 1.Create model
model = alexnet()
model = torch.nn.parallel.DistributedDataParallel(model)

# 2.Ready Dataset
if args.dataset == 'CIFAR10':
test_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=False,
transform=transforms.Compose(
[transforms.Resize(args.img_size),
transforms.ToTensor()]),
download=True)

else:
raise ValueError("Dataset is not CIFAR10")

# 3.Length
test_dataset_size = len(test_dataset)
print("the test dataset size is {}".format(test_dataset_size))
test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)

# 4.DataLoader
test_dataloader = DataLoader(dataset=test_dataset, sampler=test_sampler, batch_size=args.batch_size,
num_workers=2,
pin_memory=True)

# 5. Set some parameters for testing the network
total_accuracy = 0

# test
model.eval()
with torch.no_grad():
for data in test_dataloader:
imgs, targets = data
device = torch.device('cpu')
imgs, targets = imgs.to(device), targets.to(device)
model_load = torch.load("{}/AlexNet.pth".format(args.checkpoint), map_location=device)
model.load_state_dict(model_load)
outputs = model(imgs)
outputs = outputs.to(device)
accuracy = (outputs.argmax(1) == targets).sum()
total_accuracy = total_accuracy + accuracy
accuracy = total_accuracy / test_dataset_size
print("the total accuracy is {}".format(accuracy))


if __name__ == "__main__":
local_size = torch.cuda.device_count()
print("local_size: ".format(local_size))
eval()

autocast and GradScaler

训练部分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import time
import torch
from alexnet import alexnet
import torchvision
from torch import nn
import torch.distributed as dist
from torchvision import transforms
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast as autocast
from torch.cuda.amp import GradScaler as GradScaler
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse


def parse_args():
parser = argparse.ArgumentParser(description='CV Train')
parser.add_mutually_exclusive_group()
parser.add_argument("--rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=1)
parser.add_argument("--master_addr", type=str, default="127.0.0.1")
parser.add_argument("--master_port", type=str, default="12355")
parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
parser.add_argument('--img_size', type=int, default=227, help='image size')
parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
parser.add_argument('--batch_size', type=int, default=64, help='batch size')
parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
return parser.parse_args()


args = parse_args()


def train():
dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
rank=args.rank,
world_size=args.world_size)
# 1.Create SummaryWriter
if args.tensorboard:
writer = SummaryWriter(args.tensorboard_log)

# 2.Ready dataset
if args.dataset == 'CIFAR10':
train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
[transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
else:
raise ValueError("Dataset is not CIFAR10")

cuda = torch.cuda.is_available()
print('CUDA available: {}'.format(cuda))

# 3.Length
train_dataset_size = len(train_dataset)
print("the train dataset size is {}".format(train_dataset_size))

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
# 4.DataLoader
train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size, sampler=train_sampler,
num_workers=2,
pin_memory=True)

# 5.Create model
model = alexnet()

if args.cuda == cuda:
model = model.cuda()
model = torch.nn.parallel.DistributedDataParallel(model).cuda()
else:
model = torch.nn.parallel.DistributedDataParallel(model)

# 6.Create loss
cross_entropy_loss = nn.CrossEntropyLoss()

# 7.Optimizer
optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
scaler = GradScaler()
# 8. Set some parameters to control loop
# epoch
iter = 0
t0 = time.time()
for epoch in range(args.epochs):
t1 = time.time()
print(" -----------------the {} number of training epoch --------------".format(epoch))
model.train()
for data in train_dataloader:
loss = 0
imgs, targets = data
optim.zero_grad()
if args.cuda == cuda:
cross_entropy_loss = cross_entropy_loss.cuda()
imgs, targets = imgs.cuda(), targets.cuda()
with autocast():
outputs = model(imgs)
loss_train = cross_entropy_loss(outputs, targets)
loss = loss_train.item() + loss
if args.tensorboard:
writer.add_scalar("train_loss", loss_train.item(), iter)

scaler.scale(loss_train).backward()
scaler.step(optim)
scaler.update()

iter = iter + 1
if iter % 100 == 0:
print(
"Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
.format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
np.mean(loss)))
if args.tensorboard:
writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
scheduler.step(np.mean(loss))
t2 = time.time()
h = (t2 - t1) // 3600
m = ((t2 - t1) % 3600) // 60
s = ((t2 - t1) % 3600) % 60
print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))

if epoch % 1 == 0:
print("Save state, iter: {} ".format(epoch))
torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))

torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
t3 = time.time()
h_t = (t3 - t0) // 3600
m_t = ((t3 - t0) % 3600) // 60
s_t = ((t3 - t0) % 3600) // 60
print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
if args.tensorboard:
writer.close()


if __name__ == "__main__":
local_size = torch.cuda.device_count()
print("local_size: ".format(local_size))
train()

测试部分与仅使用 autocast 一致

参考资料

  1. 深度学习训练模型时,GPU显存不够怎么办?
  2. 训练提速60%!只需5行代码
  3. Pytorch自动混合精度(AMP)训练
  4. PyTorch分布式训练基础--DDP使用
--- ♥ end ♥ ---

欢迎关注我呀~