『论文笔记』Facial Expression Recognition in the Wild via Deep Attentive Center Loss

Information

Title: Facial Expression Recognition in the Wild via Deep Attentive Center Loss
Author: Amir Hossein Farzaneh and Xiaojun Qi
Institution: Department of Computer Science Utah State University Logan, UT 84322, USA (美国犹他州立大学)
Year: 2021
Journal: WACV
Source: PDF, Offical code
Idea: 利用注意力机制给稀疏中心损失加权

@InProceedings{Farzaneh_2021_WACV,
    author    = {Farzaneh, Amir Hossein and Qi, Xiaojun},
    title     = {Facial Expression Recognition in the Wild via Deep Attentive Center Loss},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2021},
    pages     = {2402-2411}
}

Abstract

针对的问题：度量学习中平等的监督包括一些不相关的特征在内的所有特征会降低模型的泛化性能。

给出的解决方案：提出 DACL(Deep Attentive Center Loss) 自适应选择显著的特征元素

Introduction

利用卷积层提取的特征估计注意力权重，用于指导稀疏中心损失模块使类内靠近和类间远离。

Method

（这篇文章没有paper源码，就不敲公式了，简单总结一下）所谓中心损失就是每个样本到聚类中心的聚类之和(WCSS, Within Cluster Sum of Squares)，作者认为这其中有些特征对于我们的目标来说是不重要的，所以对样本通过 CNN 提取的 \(d\) 维特征与中心的距离进行一个加权操作，而权重是由注意力机制得到的。

如图，\(\mathcal{L}_s\) 是一个常规的分类损失，而上面虚线框中的就是作者提出的方法，其中 \(x_i^*\) 是通过 CNN 网络提取的一个 \(d\) 维的特征，在 CE-Unit 中提取相关信息 \(e_i\)，然后在多头分类器中计算权重，最后对稀疏中心损失进行加权操作。思路其实不是很复杂，但这里还有一个很关键的点就是这个注意力网络 \(\mathcal{A}\) 是怎么设计的。

注意力网络 \(\mathcal{A}\) 包含了两个模块

上下文编码单元(Context Encoder Unit, CE-Unit)，以 CNN 提取的特征图作为上下文生成隐含特征，主要是三个全连接层：\(flatten \rightarrow fc \rightarrow BN \rightarrow relu \rightarrow fc \rightarrow BN \rightarrow relu \rightarrow fc \rightarrow BN \rightarrow tanh\).
多头二元分类器，将 CE-Unit 的隐含特征作为输入估计注意力权重，每个头会输出两个分数，一个是包含重要特征的分数，一个是不包含；然后取包含的softmax 计算结果作为最终的权重。

Experiment

用了 RAF-DB 和 AffectNet 两个数据集，和在 MS-CELEB-1M 上预训练的 Resnet18 作为backbone。具体结果可以看原文，其实对比中心损失的方法改进只是有一点改进，改进不是特别大，两个数据集上都是提升了 \(1\%\) 的样子。

Conclusion

DACL 是一种通过注意力机制来自适应控制在深度度量学习中特征表达的强度的方法。此外，由可自定义的神经网络完全参数化的注意力机制通过为稀疏中心损失提供注意力权重来估计所有维度的贡献概率。

确实是有效果的，但感觉这附加的参数量和提高的效果有点得不偿失，因为显然注意力网络不算小，但提升有限，但思路还是值得参考的。

Others

代码解析

模型部分可见确实是计算量极大，第一个 fc 层有千万级别的参数

class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
                 groups=1, width_per_group=64, replace_stride_with_dilation=None,
                 norm_layer=None):
    	...
        # DACL attention network
        self.nb_head = 512
        self.attention = nn.Sequential(
            nn.Linear(512 * 7 * 7, 3584),
            nn.BatchNorm1d(3584),
            nn.ReLU(inplace=True),
            nn.Linear(3584, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(inplace=True),
            nn.Linear(512, 64),
            nn.BatchNorm1d(64),
            nn.Tanh(),
        )
        self.attention_heads = nn.Linear(64, 2 * self.nb_head)
    
    def forward(self, x):
        ...
        # DACL attention
        x_flat = torch.flatten(x, 1)
        E = self.attention(x_flat)
        A = self.attention_heads(E).reshape(-1, 512, 2).softmax(dim=-1)[:, :, 1]

        x = self.avgpool(x)
        f = torch.flatten(x, 1)
        out = self.fc(f)

        return f, out, A

损失函数

criterion = {
    'softmax': nn.CrossEntropyLoss().to(device),
    'center': SparseCenterLoss(7, feat_size).to(device)
}
optimizer = {
    'softmax': torch.optim.SGD(model.parameters(), cfg['lr'],
                               momentum=cfg['momentum'],
                               weight_decay=cfg['weight_decay']),
    'center': torch.optim.SGD(criterion['center'].parameters(), cfg['alpha'])
}
    
# ---- train ----
# compute output
feat, output, A = model(images)
l_softmax = criterion['softmax'](output, target)
l_center = criterion['center'](feat, A, target)
l_total = l_softmax + cfg['lamb'] * l_center

所以关键在这个损失函数这里，但下面的代码不算特别懂

import torch
import torch.nn as nn
from torch.autograd.function import Function


class SparseCenterLoss(nn.Module):
    def __init__(self, num_classes, feat_dim, size_average=True):
        super(SparseCenterLoss, self).__init__()
        self.centers = nn.Parameter(torch.FloatTensor(num_classes, feat_dim))
        self.sparse_centerloss = SparseCenterLossFunction.apply
        self.feat_dim = feat_dim
        self.size_average = size_average
        self.reset_params()

    def reset_params(self):
        nn.init.kaiming_normal_(self.centers.data.t())

    def forward(self, feat, A, label):
        batch_size = feat.size(0)
        feat = feat.view(batch_size, -1)
        # To check the dim of centers and features
        if feat.size(1) != self.feat_dim:
            raise ValueError("Center's dim: {0} should be equal to input feature's \
                            dim: {1}".format(self.feat_dim, feat.size(1)))
        batch_size_tensor = feat.new_empty(1).fill_(batch_size if self.size_average else 1)
        loss = self.sparse_centerloss(feat, A, label, self.centers, batch_size_tensor)
        return loss


class SparseCenterLossFunction(Function):
    @staticmethod
    def forward(ctx, feature, A, label, centers, batch_size):
        ctx.save_for_backward(feature, A, label, centers, batch_size)
        centers_batch = centers.index_select(0, label.long())
        return (A * (feature - centers_batch).pow(2)).sum() / 2.0 / batch_size

    @staticmethod
    def backward(ctx, grad_output):
        feature, A, label, centers, batch_size = ctx.saved_tensors
        centers_batch = centers.index_select(0, label.long())
        diff = feature - centers_batch
        # init every iteration
        counts = centers.new_ones(centers.size(0))
        ones = centers.new_ones(label.size(0))
        grad_centers = centers.new_zeros(centers.size())

        # A gradient
        grad_A = diff.pow(2) / 2.0 / batch_size

        counts.scatter_add_(0, label.long(), ones)
        grad_centers.scatter_add_(0, label.unsqueeze(1).expand(feature.size()).long(), - A * diff)
        grad_centers = grad_centers / counts.view(-1, 1)
        return grad_output * A * diff / batch_size, grad_output * grad_A, None, grad_centers, None

References

如果对你有帮助的话，请给我点个赞吧~

欢迎前往我的博客查看更多笔记