«上一篇
 文章快速检索 高级检索

 智能系统学报  2019, Vol. 14 Issue (6): 1064-1072  DOI: 10.11992/tis.201906045 0

### 引用本文

LIU Bing, LI Ruilin, FENG Jufu. A brief introduction to deep metric learning[J]. CAAI Transactions on Intelligent Systems, 2019, 14(6): 1064-1072. DOI: 10.11992/tis.201906045.

### 文章历史

1. 北京大学 信息科学技术学院，北京 100871;
2. 北京大学 机器感知与智能教育部重点实验室，北京 100871

A brief introduction to deep metric learning
LIU Bing 1,2, LI Ruilin 1,2, FENG Jufu 1,2
1. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China;
2. Key Laboratory of Machine Perception (MOE), Peking University, Beijing 100871, China
Abstract: Recently, deep metric learning (DML) has become one of the most attractive research areas in machine learning. Learning an effective deep metric to measure the similarity between subjects is a key problem. As to existing loss functions that rely on pairwise or triplet-wise, as training data increases, and since the number of positive and negative samples that can be combined is extremely large, a reasonable solution is to sample only positive and negative samples that are meaningful for training, also known as Difficult Case Mining. To alleviate computational complexity of mining meaningful samples, the proxy loss chooses proxy sets that are much smaller than the sample sets. This review summarizes some algorithms representative of DML, according to the time order, and discusses their relationship with softmax classification. It was found that these two seemingly parallel research methods have a consistent idea behind them. This paper explores some improved algorithms that aim to improve the softmax discriminative performance, and introduces them into metric learning, so as to further reduce intra-class distance, expand inter-class distance, and, finally, improve the discriminant performance of the algorithm.
Key words: deep metric learning    deep learning    machine learning    contrastive loss    triplet loss    proxy loss    softmax classification    temperature

1 深度度量学习

1.1 对比损失

 ${l_{\ \rm contrast}}({{{X}}_{{i}}},{{{X}}_{{j}}}) = {{{y}}_{{{i}},{{j}}}}d_{i,j}^2 + (1 - {{{y}}_{{{i}},{{j}}}}){[\alpha - d_{i,j}^2]_ + }$ (1)

1.2 三元组损失

 ${l_{\ \rm triplet}}({{{X}}_{{a}}},{{{X}}_{{b}}},{{{X}}_{{n}}}) = {[d_{a,p}^2 + m - d_{a,n}^2]_ + }$ (2)

1.3 提升结构化损失

 $\begin{gathered} \displaystyle J = \frac{1}{{2|\hat P|}}\sum\limits_{(i,j) \in \hat P} {\max {{(0,{J_{i.j}})}^2}} \\[-3pt] {J_{i,j}} = \max (\mathop {\max }\limits_{(i,j) \in \hat N} \alpha - {D_{i,k}},\mathop {\max }\limits_{(j,l) \in \hat N} \alpha - {D_{j,l}}) + {D_{i,j}} \\ \end{gathered}$ (3)

1.4 多类N元组损失

Sohn等[15]将对比损失和三元组收敛比较慢的原因归结于训练时每次只挖掘一个负样本，缺少了与其他负样本交互过程。因此他们提出多类N元组损失(multi-class N-pair loss)[15]：同类样本的距离不应每次只小于一组类间距，而应同时小于n-1组类间距离，从而实现类内对相似度显著高于所有类间对相似度。损失函数的设计借鉴了(neighborhood component analysis, NCA)[6]的表达形式，具体如式(4)所示：

 $\begin{gathered} l({{X}},{{y}}) = \frac{{ - 1}}{{|{{P}}|}}\sum\limits_{(i,j) \in P} {\log \frac{{\exp \{ {{{S}}_{{{i}},{{j}}}}\} }}{{\exp \{ {{{S}}_{{{i}},{{j}}}}\} + \displaystyle\sum\limits_{k:{{y}}[k] \ne {{y}}[i]} {\exp \{ {{{S}}_{{i}},{{k}}}}\} } }} + \\ \frac{\textit{λ}}{m} {\sum\limits_i^m {||f({{{X}}_{{i}}})||} _2} \end{gathered}$ (4)

1.5 成对聚类损失

 Download: 图 1 构建三元组时的两种不同方法[41] Fig. 1 Two different cases when building triplets[41]

 ${{{c}}^{{p}}} = \frac{1}{{{N^p}}}\sum\limits_i^{{N^p}} {f({{X}}_{{i}}^{{p}})}$ (5)

 $\begin{array}{*{20}{c}} {L({{W}},{{{X}}^{{p}}},{{{X}}^{{n}}}) = }\\ \displaystyle{\sum\limits_i^{{N^p}} {\frac{1}{2}\max \{ 0,||f({{X}}_{{i}}^{{p}}) - {{{c}}^{{p}}}|{|^2} + \alpha - |f({{X}}_*^n) - {{{c}}^{{p}}}|{|^2}\} } } \end{array}$ (6)

1.6 中心损失

 $\begin{gathered} L = {L_S} + {L_C} = \\ \;\;\;\;\;\; \displaystyle\sum\limits_{i = 1}^m {\log \frac{{\exp \{ {{W}}_{{{{y}}_{{i}}}}^{\rm{T}}{{{X}}_{{i}}} + {{{b}}_{{{{y}}_{{i}}}}}\} }}{{ \displaystyle\sum\limits_{j = 1}^n {\exp \{ {{W}}_{{{{y}}_{{i}}}}^{\rm{T}}{{{X}}_{{i}}} + {{{b}}_{{{{y}}_{{i}}}}}\} } }} + \frac{{\textit{λ}}}{2}} \sum\limits_i^m {|{{{X}}_{{i}}} - {{{c}}_{{{{y}}_{{i}}}}}||} _2^2 \\ \end{gathered}$ (7)

1.7 设备定位损失

Oh等[45]认为，当前存在的大多数方法[15, 18, 25, 46]通常只关注数据的局部区域(如：二元组、三元组或n元组)，并没有考虑到全局的结构信息，因而降低了聚类和检索的表现。

 $F({{X}},{{S}};\theta ) = - \sum\limits_{i \in |{{X}}|} {\mathop {\min }\limits_{j \in S} ||f({{{X}}_{{i}}};\theta ) - f({{{X}}_{{j}}};\theta )||}$ (8)

 $\tilde F({{X}},{{{y}}^*};\theta ) = - \sum\limits_{i \in |{{Y}}|} {\mathop {\max }\limits_{j \in \{ i:{y^*}[i] = k\} } F({{{X}}_{\{ i:{y^*}[i] = k\} }},\{ j\} ;\theta )}$ (9)

 $l({{X}},{{{y}}^*}) = [\mathop {\max }\limits_{S \subset \{ 1, \cdots ,n\} } \{ F({{X}},{{S}};\theta ) + \gamma \Delta ({{y}},{{{y}}^*})\} - \tilde F({{X}},{{{y}}^*};\theta )]$ (10)

 ${{y}}[i] = \mathop {\arg \min }\limits_j ||f({{{X}}_{{i}}};\theta ) - f({{{X}}_{\left\{ {{{j}}| \in {{S}}} \right\}}};\theta )||$
 $\begin{array}{l} \;\;\;\;\Delta ({{y}},{{{y}}^*}) = 1 - {\rm NMI}({{y}},{{{y}}^*}) \\ \displaystyle {\rm NMI}({{{y}}_{{1}}},{{{y}}_{{2}}}) = \frac{{{\rm MI}({{{y}}_{{1}}},{{{y}}_{{2}}})}}{{\sqrt {H({{{y}}_{{1}}})H({{{y}}_2})} }} \\ \end{array}$ (11)

1.8 代理损失

 ${{p}}(x) = \arg {\min _{p \in P}}{d^2}({{X}},{{p}})$ (12)

 $\displaystyle{l^{\ \rm proxy}}({{X}},{{P}}) = - \log (\frac{{\exp ( - {d^2}({{X}},{{p}}(x)))}}{{\displaystyle\sum\nolimits_{z \in {{P}}\backslash \{ {{p}}(x)\} } {\exp ( - {d^2}({{X}},{{z}})} }})$ (13)

1.9 其他损失

2 深度度量学习与softmax分类

2.1 代理损失与softmax的关系

 $l({{X}},{{P}}) = - \log (\frac{{\exp ( - {d^2}({{X}},{{p}}(x)))}}{{\displaystyle\sum\limits_{z \in {{P}}} {\exp ( - {d^2}({{X}},{{z}}))} }})$ (14)

 $q({{{p}}_{{i}}}|{{X}}) = \frac{{\exp ( - {d^2}({{X}},{{{p}}_{{i}}}))}}{{\displaystyle\sum\limits_{j = 1}^m {\exp ( - {d^2}({{X}},{{{p}}_{{j}}}))} }}$ (15)

 ${d^2}({{X}},{{{p}}_{{i}}}) = ||{{X}}||_2^2 + ||{{{p}}_{{i}}}||_2^2 - 2{{{X}}^{\rm{T}}}{{{p}}_{{i}}} = 2{{{s}}^2} - 2{{{X}}^{\rm{T}}}{{{p}}_{{i}}}$ (16)

 $q({{{p}}_{{i}}}|{{X}}) = \frac{{\exp (2{{{X}}^{\rm{T}}}{{{p}}_{{i}}} - 2{{{s}}^2})}}{{\displaystyle\sum\limits_{j = 1}^m {\exp (2{{{X}}^{\rm{T}}}{{{p}}_{{j}}} - 2{{{s}}^2})} }} = \frac{{\exp (2{{{X}}^{\rm{T}}}{{{p}}_{{i}}})}}{{\displaystyle\sum\limits_{j = 1}^m {\exp (2{{{X}}^{\rm{T}}}{{{p}}_{{j}}})} }}$ (17)

2.2 温度放缩

softmax损失函数对不同类别的特征有着较好的分离性，却不具有足够的判别性。因此，现阶段的工作提出了几种变体[44, 63-71]来增强其判别性。最早在2015年，Hinton为解决模型压缩[72]、知识迁移等问题，提出了温度值[28]的概念。他认为不同类别间的关系不应是非0即1的问题(如：将猫误判为狗的损失直观上应该要比将猫误判为汽车的损失小)，因此，粗暴地使用one-hot编码丢失了类间和类内关联性的额外信息。由此作者提出带温度值的softmax函数，弱化不同类别之间的差异。损失函数：

 ${L_{\rm soft}} = - \sum\limits_{i = 1}^K {{{{p}}_{{i}}}\log {{{q}}_{{i}}} = - \sum\limits_{i = 1}^K {{{{p}}_{{i}}}\log \frac{{{{\rm{e}}^{{{{z}}_{{i}}}/T}}}}{{\displaystyle\sum\limits_j {{{\rm{e}}^{{{{z}}_{{j}}}/T}}} }}} }$ (18)

 ${L_{\rm cls}}({{X}},{{{p}}_{{y}}},{{{p}}_{{Z}}},\sigma ) = - \log (\frac{{\exp ({{{X}}^{\bf{T}}}{{{p}}_{{y}}}/\sigma )}}{{\displaystyle\sum\limits_{{{{p}}_{{z}}} \in ({{{p}}_{{z}}} \cup {{{p}}_{{y}}})} {\exp ({{{X}}^{\bf{T}}}{{{p}}_{{z}}}/\sigma )} }})$ (19)

Zhang等[29]从样本特征梯度的角度分析了温度值如何影响特征分布。为方便起见，作者令 $\alpha = 1/T$ ，即：

 $q(m|x) = \log \frac{{{{\rm{e}}^{{{{z}}_{{m}}}/T}}}}{{\displaystyle\sum\limits_{j = 1}^M {{{\rm{e}}^{{{{z}}_{{j}}}/T}}} }} = \log \frac{{{{\rm{e}}^{\alpha {{{z}}_{{m}}}}}}}{{\displaystyle\sum\limits_{j = 1}^M {{{\rm{e}}^{\alpha {{{z}}_{{j}}}}}} }}$ (20)

 $L({{X}},\alpha ) = - \sum\limits_{m = 1}^M {\log (q(m|{{X}},\alpha ))p(m|{{X}})}$ (21)

 $\frac{{\partial L}}{{\partial {{{z}}_{{m}}}}} = \alpha (q(m|{{X}}) - p(m|{{X}}))$ (22)

 $\frac{{\partial L}}{{\partial { f}}} = \sum\limits_{m = 1}^M {\frac{{\partial L}}{{\partial {{{z}}_{{m}}}}}\frac{{\partial {{{z}}_{{m}}}}}{{\partial { f}}}} = \alpha \sum\limits_{m = 1}^M {(q(m|{{X}}) - p(m|{{X}})){{{w}}_{{m}}}}$ (23)

3 结束语