Paper16-2. Unsupervised Feature Learning via Non-Parametric Instance Discrimination

8 minute read

Pytorch Code

해당 Code는 많기 때문에 중요한 부분을 잘라서 살펴보면 다음과 같습니다.

Data Loader

DataLoader중 하나의 예시 입니다. 기본적인 X Input과 Label뿐만 아니라, Instance를 구별하기 위한, 각각의 Index또한 반환합니다.

class CIFAR10Instance(datasets.CIFAR10):
    """CIFAR10Instance Dataset.
    """
    def __getitem__(self, index):
        if self.train:
            img, target = self.train_data[index], self.train_labels[index]
        else:
            img, target = self.test_data[index], self.test_labels[index]

        # doing this so that it is consistent with all other datasets
        # to return a PIL Image
        img = Image.fromarray(img)

        if self.transform is not None:
            img = self.transform(img)

        if self.target_transform is not None:
            target = self.target_transform(target)

        return img, target, index

Model

$v = f_{\theta}(x)$를 출력하기 위한 ResNet Backbone입니다. 기본적으로 Output은 128차원으로 Experiment를 진행하였습니다. 또한 $\|v\| = 1$을 만족하기 위한 Normalization Layer를 추가로 구성하였습니다.

class Normalize(nn.Module):

    def __init__(self, power=2):
        super(Normalize, self).__init__()
        self.power = power
    
    def forward(self, x):
        norm = x.pow(self.power).sum(1, keepdim=True).pow(1./self.power)
        out = x.div(norm)
        return out
    
class ResNet(nn.Module):
    def __init__(self, block, num_blocks, low_dim=128):
        super(ResNet, self).__init__()
        self.in_planes = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
        self.linear = nn.Linear(512*block.expansion, low_dim)
        self.l2norm = Normalize(2)

    def _make_layer(self, block, planes, num_blocks, stride):
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, stride))
            self.in_planes = planes * block.expansion
        return nn.Sequential(*layers)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = F.avg_pool2d(out, 4)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        out = self.l2norm(out)
        return out

Noise-Constrastive Estimation - 1

$$P(i|v) = \frac{\text{exp}(v^T f_i/\tau)}{Z_i}$$

$$Z \approx Z_i \approx \frac{n}{m} \sum_{k=1}^m \text{exp} (v_{jk}^T f_i / \tau)$$

위의 식을 구현하기 위한 Code이다.
기본적으로 Weight를 곱해서 학습하는 방법이 아닌, 이전 iteration에서 학습된 $v$가 필요하게 되므로, Customizing하여 Forward와 Backward를 구성하였다.

참초: Pytorch Example 에서는 Forward와 Backward를 Customizing하는 법과 예제가 있다.

참조 - 1
self.register_buffer: 일반적으로 모델 매개 변수로 간주되지 않는 버퍼를 등록하는 데 사용됩니다. 예를 들어, BatchNorm에서 ‘running_mean’은 매개 변수는 아니지만 상태로써 사용할 수 있다.

Args:
    name (string): name of the buffer. The buffer can be accessed
        from this module using the given name
    tensor (Tensor): buffer to be registered.

참조 -2
self.save_for_backward: Backward에서 사용할 어떤 객체도 저장(cache)해 둘 수 있다.

Code를 크게 3가지로 분류하면 다음과 같다.

idx = self.multinomial.draw(batchSize * (self.K+1)).view(batchSize, -1): Multinomial Random Sampling: NCE Average를 자세히 이해하기 위해서는 해당 논문에서 정의한 AliasMethod를 정확히 이해하여야 한다. AliasMethod를 살펴보게 되면, 간단한 Multinomial Distribution을 만들기 위한 작업이다. 해당 Output은 0~nLem(Number of sample)중 확률에 따라서 Batchsize * K+1(Number of Negative sample + 1)개의 Index를 선택하여 Output으로서 내놓게 된다. 즉, Multinomial Distribution을 Batch개가 아니라 전체 Sample에서 뽑기위한 작업이다.
NCEFunction - Forward
- 2.1: idx.select(1,0).copy_(y.data): Forward에서는 idx => Batch Size * K+1에서 index0에 실제 Label을 대입하게 된다. 즉, 첫번째 sample은 Positive Sample이고 이를 제외한 모두는 Negative Sample로서 만들게 된다.
- 2.2: weight = torch.index_select(memory, 0, idx.view(-1)) $\rightarrow$ out = torch.bmm(weight, x.data.resize_(batchSize, inputSize, 1)) = $v^T f_i$: Weight는 이전 iteration의 저장되어있던 Memory Bank($v$)에서 값을 Mapping하는 역할 이다. 즉, $f_i \rightarrow v_i$이다.
- 2.3: out.div_(T).exp_() = $\text{exp}(v^Tf_i/\tau)$
- 2.4: params[2] = out.mean() * outputSize = $n * \frac{1}{m}\sum_{k=1}^m \text{exp}(v_j^T f_i / \tau)$
- 2.5: out.div_(Z).resize_(batchSize, K+1) = $P(i|v) = \frac{\text{exp}(v^T f_i/ \tau)}{Z_i}$
NCEFunction - Backward -> 해당 코드는 Feature Extractor Output($f_{\theta}(\cdot)$)에 Gradient를 전달하기 위하여 Customizing한 Code이다.
- 3.1: memory.index_copy_(0, y, updated_weight): Memory Bank에서 해당 Instance를 Update하는 Code

class NCEFunction(Function):
    @staticmethod
    def forward(self, x, y, memory, idx, params):
        K = int(params[0].item())
        T = params[1].item()
        Z = params[2].item()

        momentum = params[3].item()
        batchSize = x.size(0)
        outputSize = memory.size(0)
        inputSize = memory.size(1)

        # sample positives & negatives
        idx.select(1,0).copy_(y.data)

        # sample correspoinding weights
        weight = torch.index_select(memory, 0, idx.view(-1))
        weight.resize_(batchSize, K+1, inputSize)

        # inner product
        out = torch.bmm(weight, x.data.resize_(batchSize, inputSize, 1))
        out.div_(T).exp_() # batchSize * self.K+1
        x.data.resize_(batchSize, inputSize)

        if Z < 0:
            params[2] = out.mean() * outputSize
            Z = params[2].item()
            print("normalization constant Z is set to {:.1f}".format(Z))

        out.div_(Z).resize_(batchSize, K+1)

        self.save_for_backward(x, memory, y, weight, out, params)

        return out

    @staticmethod
    def backward(self, gradOutput):
        x, memory, y, weight, out, params = self.saved_tensors
        K = int(params[0].item())
        T = params[1].item()
        Z = params[2].item()
        momentum = params[3].item()
        batchSize = gradOutput.size(0)
        
        # gradients d Pm / d linear = exp(linear) / Z
        gradOutput.data.mul_(out.data)
        # add temperature
        gradOutput.data.div_(T)

        gradOutput.data.resize_(batchSize, 1, K+1)
        
        # gradient of linear
        gradInput = torch.bmm(gradOutput.data, weight)
        gradInput.resize_as_(x)

        # update the non-parametric data
        weight_pos = weight.select(1, 0).resize_as_(x)
        weight_pos.mul_(momentum)
        weight_pos.add_(torch.mul(x.data, 1-momentum))
        w_norm = weight_pos.pow(2).sum(1, keepdim=True).pow(0.5)
        updated_weight = weight_pos.div(w_norm)
        memory.index_copy_(0, y, updated_weight)
        
        return gradInput, None, None, None, None

class NCEAverage(nn.Module):

    def __init__(self, inputSize, outputSize, K, T=0.07, momentum=0.5, Z=None):
        super(NCEAverage, self).__init__()
        self.nLem = outputSize
        self.unigrams = torch.ones(self.nLem)
        self.multinomial = AliasMethod(self.unigrams)
        self.multinomial.cuda()
        self.K = K

        self.register_buffer('params',torch.tensor([K, T, -1, momentum]));
        stdv = 1. / math.sqrt(inputSize/3)
        self.register_buffer('memory', torch.rand(outputSize, inputSize).mul_(2*stdv).add_(-stdv))
 
    def forward(self, x, y):
        batchSize = x.size(0)
        idx = self.multinomial.draw(batchSize * (self.K+1)).view(batchSize, -1)
        out = NCEFunction.apply(x, y, self.memory, idx, self.params)
        return out

Noise-Constrastive Estimation - 2

lnPmt = torch.div(Pmt, Pmt_div) = $h(i,v) := P(D=1 | i,v) = \frac{P(i|V)}{P(i|V) + m P_n(i)}$
lnPon = torch.div(Pon, Pon_div) = $1-h(i, v^{'})$
oss = - (lnPmtsum + lnPonsum) / batchSize = $J_{NCE}(\theta) = -\mathbb{E}_{P_d}[\text{log}h(i,v)] - m \cdot \mathbb{E}_{P_n}[\text{log}(1-h(i,v^{'}))]$

Proximal Regularization에 관한 수식은 Loss에서 정의되지 않고, 비슷한 효과로서 Optimization에서 이루워지게 된다. 자세한 내용은 Paper작성자의 GitHub Issue에 올려져 있다.

class NCECriterion(nn.Module):

    def __init__(self, nLem):
        super(NCECriterion, self).__init__()
        self.nLem = nLem

    def forward(self, x, targets):
        batchSize = x.size(0)
        K = x.size(1)-1
        Pnt = 1 / float(self.nLem)
        Pns = 1 / float(self.nLem)
        
        # eq 5.1 : P(origin=model) = Pmt / (Pmt + k*Pnt) 
        Pmt = x.select(1,0)
        Pmt_div = Pmt.add(K * Pnt + eps)
        lnPmt = torch.div(Pmt, Pmt_div)
        
        # eq 5.2 : P(origin=noise) = k*Pns / (Pms + k*Pns)
        Pon_div = x.narrow(1,1,K).add(K * Pns + eps)
        Pon = Pon_div.clone().fill_(K * Pns)
        lnPon = torch.div(Pon, Pon_div)
     
        # equation 6 in ref. A
        lnPmt.log_()
        lnPon.log_()
        
        lnPmtsum = lnPmt.sum(0)
        lnPonsum = lnPon.view(-1, 1).sum(0)
        
        loss = - (lnPmtsum + lnPonsum) / batchSize
        
        return loss

Model Train

args.low_dim: feature dimension
args.nce_k: negative samples for NCE
args.nce_t: temperature parameter for softmax
args.nce_m: momentum for non-parametric updates
ndata: number of data

index는 Dataset의 고유한 Index로서 각각의 Sample을 Instance로서 구별하기 위한 Label이라고 생각하면 된다.
feature = $v = f_{\theta}(x)$

위에서 얘기한 NCEAverage와 NCECriterion을 통하여 Model을 Training한다.

lemniscate = NCEAverage(args.low_dim, ndata, args.nce_k, args.nce_t, args.nce_m).cuda()
criterion = NCECriterion(ndata).cuda()

def train(train_loader, model, lemniscate, criterion, optimizer, epoch):
    # switch to train mode
    model.train()

    end = time.time()
    optimizer.zero_grad()
    for i, (input, _, index) in enumerate(train_loader):
        # measure data loading time
        data_time.update(time.time() - end)

        index = index.cuda(async=True)

        # compute output
        feature = model(input)
        output = lemniscate(feature, index)
        loss = criterion(output, index) / args.iter_size

        loss.backward()

        # measure accuracy and record loss
        losses.update(loss.item() * args.iter_size, input.size(0))

        if (i+1) % args.iter_size == 0:
            # compute gradient and do SGD step
            optimizer.step()
            optimizer.zero_grad()

참조: Weighte KNN

net: $f_{\theta}(\cdot)$
trainFeatures = lemniscate.memory.t(): $v_i$
dist = torch.mm(features, trainFeatures): $s_i = \text{cos}(v_i, \hat{f})$ // Normalization Layer를 통하여 $\|v_i\| = 1, \|\hat{f}\|=1$이므로
yd_transform = yd.clone().div_(sigma).exp_(): $\alpha_i = \text{exp}(s_i/\tau)$
probs = torch.sum(torch.mul(retrieval_one_hot.view(batchSize, -1 , C), yd_transform.view(batchSize, -1, 1)), 1): $w_c = \sum_{i \in N_k} \alpha_i \cdot 1(c_i = c)$

def kNN(epoch, net, lemniscate, trainloader, testloader, K, sigma, recompute_memory=0):
    net.eval()
    net_time = AverageMeter()
    cls_time = AverageMeter()
    total = 0
    testsize = testloader.dataset.__len__()

    trainFeatures = lemniscate.memory.t()
    if hasattr(trainloader.dataset, 'imgs'):
        trainLabels = torch.LongTensor([y for (p, y) in trainloader.dataset.imgs]).cuda()
    else:
        trainLabels = torch.LongTensor(trainloader.dataset.train_labels).cuda()
    C = trainLabels.max() + 1

    if recompute_memory:
        transform_bak = trainloader.dataset.transform
        trainloader.dataset.transform = testloader.dataset.transform
        temploader = torch.utils.data.DataLoader(trainloader.dataset, batch_size=100, shuffle=False, num_workers=1)
        for batch_idx, (inputs, targets, indexes) in enumerate(temploader):
            targets = targets.cuda(async=True)
            batchSize = inputs.size(0)
            features = net(inputs)
            trainFeatures[:, batch_idx*batchSize:batch_idx*batchSize+batchSize] = features.data.t()
        trainLabels = torch.LongTensor(temploader.dataset.train_labels).cuda()
        trainloader.dataset.transform = transform_bak
    
    top1 = 0.
    top5 = 0.
    end = time.time()
    with torch.no_grad():
        retrieval_one_hot = torch.zeros(K, C).cuda()
        for batch_idx, (inputs, targets, indexes) in enumerate(testloader):
            end = time.time()
            targets = targets.cuda(async=True)
            batchSize = inputs.size(0)
            features = net(inputs)
            net_time.update(time.time() - end)
            end = time.time()

            dist = torch.mm(features, trainFeatures)

            yd, yi = dist.topk(K, dim=1, largest=True, sorted=True)
            candidates = trainLabels.view(1,-1).expand(batchSize, -1)
            retrieval = torch.gather(candidates, 1, yi)

            retrieval_one_hot.resize_(batchSize * K, C).zero_()
            retrieval_one_hot.scatter_(1, retrieval.view(-1, 1), 1)
            yd_transform = yd.clone().div_(sigma).exp_()
            probs = torch.sum(torch.mul(retrieval_one_hot.view(batchSize, -1 , C), yd_transform.view(batchSize, -1, 1)), 1)
            _, predictions = probs.sort(1, True)

            # Find which predictions match the target
            correct = predictions.eq(targets.data.view(-1,1))
            cls_time.update(time.time() - end)

            top1 = top1 + correct.narrow(1,0,1).sum().item()
            top5 = top5 + correct.narrow(1,0,5).sum().item()

            total += targets.size(0)

            print('Test [{}/{}]\t'
                  'Net Time {net_time.val:.3f} ({net_time.avg:.3f})\t'
                  'Cls Time {cls_time.val:.3f} ({cls_time.avg:.3f})\t'
                  'Top1: {:.2f}  Top5: {:.2f}'.format(
                  total, testsize, top1*100./total, top5*100./total, net_time=net_time, cls_time=cls_time))

    print(top1*100./total)

    return top1/total

참조: Unsupervised Feature Learning via Non-Parametric Instance Discrimination
참조: zhirongw Github
참조: Pytorch Example

코드에 문제가 있거나 궁금한 점이 있으면 wjddyd66@naver.com으로 Mail을 남겨주세요.

Share on

Twitter Facebook LinkedIn

JeongYong Hwang

Paper16-2. Unsupervised Feature Learning via Non-Parametric Instance Discrimination

Pytorch Code

Data Loader

Model

Noise-Constrastive Estimation - 1

Noise-Constrastive Estimation - 2

Model Train

Share on

Leave a comment

You may also enjoy

Paper35. Personalized Transfer of User Preferences for Cross-domain Recommendation-Code

Paper35. Personalized Transfer of User Preferences for Cross-domain Recommendation

Paper34. IntTower: the Next Generation of Two-Tower Model for Pre-Ranking System-Code

Paper34. IntTower: the Next Generation of Two-Tower Model for Pre-Ranking System