Paper16-2. Unsupervised Feature Learning via Non-Parametric Instance Discrimination
Pytorch Code
출처: zhirongw Github
해당 Code는 많기 때문에 중요한 부분을 잘라서 살펴보면 다음과 같습니다.
Data Loader
DataLoader중 하나의 예시 입니다. 기본적인 X Input과 Label뿐만 아니라, Instance를 구별하기 위한, 각각의 Index또한 반환합니다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class CIFAR10Instance(datasets.CIFAR10):
"""CIFAR10Instance Dataset.
"""
def __getitem__(self, index):
if self.train:
img, target = self.train_data[index], self.train_labels[index]
else:
img, target = self.test_data[index], self.test_labels[index]
# doing this so that it is consistent with all other datasets
# to return a PIL Image
img = Image.fromarray(img)
if self.transform is not None:
img = self.transform(img)
if self.target_transform is not None:
target = self.target_transform(target)
return img, target, index
Model
\(v = f_{\theta}(x)\)를 출력하기 위한 ResNet Backbone입니다. 기본적으로 Output은 128차원으로 Experiment를 진행하였습니다. 또한 \(\|v\| = 1\)을 만족하기 위한 Normalization Layer를 추가로 구성하였습니다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class Normalize(nn.Module):
def __init__(self, power=2):
super(Normalize, self).__init__()
self.power = power
def forward(self, x):
norm = x.pow(self.power).sum(1, keepdim=True).pow(1./self.power)
out = x.div(norm)
return out
class ResNet(nn.Module):
def __init__(self, block, num_blocks, low_dim=128):
super(ResNet, self).__init__()
self.in_planes = 64
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
self.linear = nn.Linear(512*block.expansion, low_dim)
self.l2norm = Normalize(2)
def _make_layer(self, block, planes, num_blocks, stride):
strides = [stride] + [1]*(num_blocks-1)
layers = []
for stride in strides:
layers.append(block(self.in_planes, planes, stride))
self.in_planes = planes * block.expansion
return nn.Sequential(*layers)
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = self.layer1(out)
out = self.layer2(out)
out = self.layer3(out)
out = self.layer4(out)
out = F.avg_pool2d(out, 4)
out = out.view(out.size(0), -1)
out = self.linear(out)
out = self.l2norm(out)
return out
Noise-Constrastive Estimation - 1
$$P(i|v) = \frac{\text{exp}(v^T f_i/\tau)}{Z_i}$$
$$Z \approx Z_i \approx \frac{n}{m} \sum_{k=1}^m \text{exp} (v_{jk}^T f_i / \tau)$$
위의 식을 구현하기 위한 Code이다.
기본적으로 Weight를 곱해서 학습하는 방법이 아닌, 이전 iteration에서 학습된 \(v\)가 필요하게 되므로, Customizing하여 Forward와 Backward를 구성하였다.
참초: Pytorch Example 에서는 Forward와 Backward를 Customizing하는 법과 예제가 있다.
참조 - 1
self.register_buffer
: 일반적으로 모델 매개 변수로 간주되지 않는 버퍼를 등록하는 데 사용됩니다. 예를 들어, BatchNorm에서 ‘running_mean’은 매개 변수는 아니지만 상태로써 사용할 수 있다.
Args:
name (string): name of the buffer. The buffer can be accessed
from this module using the given name
tensor (Tensor): buffer to be registered.
참조 -2
self.save_for_backward
: Backward에서 사용할 어떤 객체도 저장(cache)해 둘 수 있다.
Code를 크게 3가지로 분류하면 다음과 같다.
-
idx = self.multinomial.draw(batchSize * (self.K+1)).view(batchSize, -1)
: Multinomial Random Sampling: NCE Average를 자세히 이해하기 위해서는 해당 논문에서 정의한 AliasMethod를 정확히 이해하여야 한다. AliasMethod를 살펴보게 되면, 간단한 Multinomial Distribution을 만들기 위한 작업이다. 해당 Output은 0~nLem(Number of sample)중 확률에 따라서 Batchsize * K+1(Number of Negative sample + 1)개의 Index를 선택하여 Output으로서 내놓게 된다. 즉, Multinomial Distribution을 Batch개가 아니라 전체 Sample에서 뽑기위한 작업이다. - NCEFunction - Forward
- 2.1:
idx.select(1,0).copy_(y.data)
: Forward에서는 idx => Batch Size * K+1에서 index0에 실제 Label을 대입하게 된다. 즉, 첫번째 sample은 Positive Sample이고 이를 제외한 모두는 Negative Sample로서 만들게 된다. - 2.2:
weight = torch.index_select(memory, 0, idx.view(-1))
\(\rightarrow\)out = torch.bmm(weight, x.data.resize_(batchSize, inputSize, 1))
= \(v^T f_i\): Weight는 이전 iteration의 저장되어있던 Memory Bank(\(v\))에서 값을 Mapping하는 역할 이다. 즉, \(f_i \rightarrow v_i\)이다. - 2.3:
out.div_(T).exp_()
= \(\text{exp}(v^Tf_i/\tau)\) - 2.4:
params[2] = out.mean() * outputSize
= \(n * \frac{1}{m}\sum_{k=1}^m \text{exp}(v_j^T f_i / \tau)\) - 2.5:
out.div_(Z).resize_(batchSize, K+1)
= \(P(i|v) = \frac{\text{exp}(v^T f_i/ \tau)}{Z_i}\)
- 2.1:
- NCEFunction - Backward -> 해당 코드는 Feature Extractor Output(\(f_{\theta}(\cdot)\))에 Gradient를 전달하기 위하여 Customizing한 Code이다.
- 3.1:
memory.index_copy_(0, y, updated_weight)
: Memory Bank에서 해당 Instance를 Update하는 Code
- 3.1:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
class NCEFunction(Function):
@staticmethod
def forward(self, x, y, memory, idx, params):
K = int(params[0].item())
T = params[1].item()
Z = params[2].item()
momentum = params[3].item()
batchSize = x.size(0)
outputSize = memory.size(0)
inputSize = memory.size(1)
# sample positives & negatives
idx.select(1,0).copy_(y.data)
# sample correspoinding weights
weight = torch.index_select(memory, 0, idx.view(-1))
weight.resize_(batchSize, K+1, inputSize)
# inner product
out = torch.bmm(weight, x.data.resize_(batchSize, inputSize, 1))
out.div_(T).exp_() # batchSize * self.K+1
x.data.resize_(batchSize, inputSize)
if Z < 0:
params[2] = out.mean() * outputSize
Z = params[2].item()
print("normalization constant Z is set to {:.1f}".format(Z))
out.div_(Z).resize_(batchSize, K+1)
self.save_for_backward(x, memory, y, weight, out, params)
return out
@staticmethod
def backward(self, gradOutput):
x, memory, y, weight, out, params = self.saved_tensors
K = int(params[0].item())
T = params[1].item()
Z = params[2].item()
momentum = params[3].item()
batchSize = gradOutput.size(0)
# gradients d Pm / d linear = exp(linear) / Z
gradOutput.data.mul_(out.data)
# add temperature
gradOutput.data.div_(T)
gradOutput.data.resize_(batchSize, 1, K+1)
# gradient of linear
gradInput = torch.bmm(gradOutput.data, weight)
gradInput.resize_as_(x)
# update the non-parametric data
weight_pos = weight.select(1, 0).resize_as_(x)
weight_pos.mul_(momentum)
weight_pos.add_(torch.mul(x.data, 1-momentum))
w_norm = weight_pos.pow(2).sum(1, keepdim=True).pow(0.5)
updated_weight = weight_pos.div(w_norm)
memory.index_copy_(0, y, updated_weight)
return gradInput, None, None, None, None
class NCEAverage(nn.Module):
def __init__(self, inputSize, outputSize, K, T=0.07, momentum=0.5, Z=None):
super(NCEAverage, self).__init__()
self.nLem = outputSize
self.unigrams = torch.ones(self.nLem)
self.multinomial = AliasMethod(self.unigrams)
self.multinomial.cuda()
self.K = K
self.register_buffer('params',torch.tensor([K, T, -1, momentum]));
stdv = 1. / math.sqrt(inputSize/3)
self.register_buffer('memory', torch.rand(outputSize, inputSize).mul_(2*stdv).add_(-stdv))
def forward(self, x, y):
batchSize = x.size(0)
idx = self.multinomial.draw(batchSize * (self.K+1)).view(batchSize, -1)
out = NCEFunction.apply(x, y, self.memory, idx, self.params)
return out
Noise-Constrastive Estimation - 2
lnPmt = torch.div(Pmt, Pmt_div)
= \(h(i,v) := P(D=1 | i,v) = \frac{P(i|V)}{P(i|V) + m P_n(i)}\)lnPon = torch.div(Pon, Pon_div)
= \(1-h(i, v^{'})\)oss = - (lnPmtsum + lnPonsum) / batchSize
= \(J_{NCE}(\theta) = -\mathbb{E}_{P_d}[\text{log}h(i,v)] - m \cdot \mathbb{E}_{P_n}[\text{log}(1-h(i,v^{'}))]\)
Proximal Regularization에 관한 수식은 Loss에서 정의되지 않고, 비슷한 효과로서 Optimization에서 이루워지게 된다. 자세한 내용은 Paper작성자의 GitHub Issue에 올려져 있다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class NCECriterion(nn.Module):
def __init__(self, nLem):
super(NCECriterion, self).__init__()
self.nLem = nLem
def forward(self, x, targets):
batchSize = x.size(0)
K = x.size(1)-1
Pnt = 1 / float(self.nLem)
Pns = 1 / float(self.nLem)
# eq 5.1 : P(origin=model) = Pmt / (Pmt + k*Pnt)
Pmt = x.select(1,0)
Pmt_div = Pmt.add(K * Pnt + eps)
lnPmt = torch.div(Pmt, Pmt_div)
# eq 5.2 : P(origin=noise) = k*Pns / (Pms + k*Pns)
Pon_div = x.narrow(1,1,K).add(K * Pns + eps)
Pon = Pon_div.clone().fill_(K * Pns)
lnPon = torch.div(Pon, Pon_div)
# equation 6 in ref. A
lnPmt.log_()
lnPon.log_()
lnPmtsum = lnPmt.sum(0)
lnPonsum = lnPon.view(-1, 1).sum(0)
loss = - (lnPmtsum + lnPonsum) / batchSize
return loss
Model Train
args.low_dim
: feature dimensionargs.nce_k
: negative samples for NCEargs.nce_t
: temperature parameter for softmaxargs.nce_m
: momentum for non-parametric updatesndata
: number of data
index
는 Dataset의 고유한 Index로서 각각의 Sample을 Instance로서 구별하기 위한 Label이라고 생각하면 된다.
feature
= \(v = f_{\theta}(x)\)
위에서 얘기한 NCEAverage와 NCECriterion을 통하여 Model을 Training한다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
lemniscate = NCEAverage(args.low_dim, ndata, args.nce_k, args.nce_t, args.nce_m).cuda()
criterion = NCECriterion(ndata).cuda()
def train(train_loader, model, lemniscate, criterion, optimizer, epoch):
# switch to train mode
model.train()
end = time.time()
optimizer.zero_grad()
for i, (input, _, index) in enumerate(train_loader):
# measure data loading time
data_time.update(time.time() - end)
index = index.cuda(async=True)
# compute output
feature = model(input)
output = lemniscate(feature, index)
loss = criterion(output, index) / args.iter_size
loss.backward()
# measure accuracy and record loss
losses.update(loss.item() * args.iter_size, input.size(0))
if (i+1) % args.iter_size == 0:
# compute gradient and do SGD step
optimizer.step()
optimizer.zero_grad()
참조: Weighte KNN
net
: \(f_{\theta}(\cdot)\)trainFeatures = lemniscate.memory.t()
: \(v_i\)dist = torch.mm(features, trainFeatures)
: \(s_i = \text{cos}(v_i, \hat{f})\) // Normalization Layer를 통하여 \(\|v_i\| = 1, \|\hat{f}\|=1\)이므로yd_transform = yd.clone().div_(sigma).exp_()
: \(\alpha_i = \text{exp}(s_i/\tau)\)probs = torch.sum(torch.mul(retrieval_one_hot.view(batchSize, -1 , C), yd_transform.view(batchSize, -1, 1)), 1)
: \(w_c = \sum_{i \in N_k} \alpha_i \cdot 1(c_i = c)\)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def kNN(epoch, net, lemniscate, trainloader, testloader, K, sigma, recompute_memory=0):
net.eval()
net_time = AverageMeter()
cls_time = AverageMeter()
total = 0
testsize = testloader.dataset.__len__()
trainFeatures = lemniscate.memory.t()
if hasattr(trainloader.dataset, 'imgs'):
trainLabels = torch.LongTensor([y for (p, y) in trainloader.dataset.imgs]).cuda()
else:
trainLabels = torch.LongTensor(trainloader.dataset.train_labels).cuda()
C = trainLabels.max() + 1
if recompute_memory:
transform_bak = trainloader.dataset.transform
trainloader.dataset.transform = testloader.dataset.transform
temploader = torch.utils.data.DataLoader(trainloader.dataset, batch_size=100, shuffle=False, num_workers=1)
for batch_idx, (inputs, targets, indexes) in enumerate(temploader):
targets = targets.cuda(async=True)
batchSize = inputs.size(0)
features = net(inputs)
trainFeatures[:, batch_idx*batchSize:batch_idx*batchSize+batchSize] = features.data.t()
trainLabels = torch.LongTensor(temploader.dataset.train_labels).cuda()
trainloader.dataset.transform = transform_bak
top1 = 0.
top5 = 0.
end = time.time()
with torch.no_grad():
retrieval_one_hot = torch.zeros(K, C).cuda()
for batch_idx, (inputs, targets, indexes) in enumerate(testloader):
end = time.time()
targets = targets.cuda(async=True)
batchSize = inputs.size(0)
features = net(inputs)
net_time.update(time.time() - end)
end = time.time()
dist = torch.mm(features, trainFeatures)
yd, yi = dist.topk(K, dim=1, largest=True, sorted=True)
candidates = trainLabels.view(1,-1).expand(batchSize, -1)
retrieval = torch.gather(candidates, 1, yi)
retrieval_one_hot.resize_(batchSize * K, C).zero_()
retrieval_one_hot.scatter_(1, retrieval.view(-1, 1), 1)
yd_transform = yd.clone().div_(sigma).exp_()
probs = torch.sum(torch.mul(retrieval_one_hot.view(batchSize, -1 , C), yd_transform.view(batchSize, -1, 1)), 1)
_, predictions = probs.sort(1, True)
# Find which predictions match the target
correct = predictions.eq(targets.data.view(-1,1))
cls_time.update(time.time() - end)
top1 = top1 + correct.narrow(1,0,1).sum().item()
top5 = top5 + correct.narrow(1,0,5).sum().item()
total += targets.size(0)
print('Test [{}/{}]\t'
'Net Time {net_time.val:.3f} ({net_time.avg:.3f})\t'
'Cls Time {cls_time.val:.3f} ({cls_time.avg:.3f})\t'
'Top1: {:.2f} Top5: {:.2f}'.format(
total, testsize, top1*100./total, top5*100./total, net_time=net_time, cls_time=cls_time))
print(top1*100./total)
return top1/total
참조: Unsupervised Feature Learning via Non-Parametric Instance Discrimination
참조: zhirongw Github
참조: Pytorch Example
코드에 문제가 있거나 궁금한 점이 있으면 wjddyd66@naver.com으로 Mail을 남겨주세요.
Leave a comment