Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Comparing the two models in Sec. 3.4, we believe the The objective of Eq. 5 is known as the primal form
performance gain is largely due to the superior regu- problem of L1-SVM, with the standard hinge loss.
larization effects of the SVM loss function, rather than Since L1-SVM is not differentiable, a popular variation
an advantage from better parameter optimization. is known as the L2-SVM which minimizes the squared
hinge loss:
2. The model 1 T XN
min w w+C max(1 wT xn tn , 0)2 (6)
2.1. Softmax w 2 n=1
For classification problems using deep learning tech- L2-SVM is differentiable and imposes a bigger
niques, it is standard to use the softmax or 1-of-K (quadratic vs. linear) loss for points which violate the
encoding at the top. For example, given 10 possible margin. To predict the class label of a test data x:
classes, the softmax layer has 10 nodes denoted by pi ,
where i = 1, . . . , 10. pi P
specifies a discrete probability arg max(wT x)t (7)
10 t
distribution, therefore, i pi = 1.
Let h be the activation of the penultimate layer nodes, For Kernel SVMs, optimization must be performed in
W is the weight connecting the penultimate layer to the dual. However, scalability is a problem with Kernel
the softmax layer, the total input into a softmax layer, SVMs, and in this paper we will be only using linear
given by a, is SVMs with standard deep learning models.
X
ai = hk Wki , (1) 2.3. Multiclass SVMs
k
The simplest way to extend SVMs for multiclass prob-
then we have lems is using the so-called one-vs-rest approach (Vap-
exp(ai ) nik, 1995). For K class problems, K linear SVMs
pi = P10 (2) will be trained independently, where the data from
j exp(aj ) the other classes form the negative cases. Hsu & Lin
(2002) discusses other alternative multiclass SVM ap-
The predicted class i would be
proaches, but we leave those to future work.
i = arg max pi Denoting the output of the k-th SVM as
i
= arg max ai (3) ak (x) = wT x (8)
i
The predicted class is
2.2. Support Vector Machines
arg max ak (x) (9)
k
Linear support vector machines (SVM) is originally
formulated for binary classification. Given train- Note that prediction using SVMs is exactly the same
ing data and its corresponding labels (xn , yn ), n = as using a softmax Eq. 3. The only difference between
1, . . . , N , xn RD , tn {1, +1}, SVMs learning softmax and multiclass SVMs is in their objectives
consists of the following constrained optimization: parametrized by all of the weight matrices W. Soft-
N
max layer minimizes cross-entropy or maximizes the
1 T X log-likelihood, while SVMs simply try to find the max-
min w w+C n (4)
w,n 2 imum margin between data points of different classes.
n=1
s.t. wT xn tn 1 n n
2.4. Deep Learning with Support Vector
n 0 n Machines
n are slack variables which penalizes data points Most deep learning methods for classification using
which violate the margin requirements. Note that we fully connected layers and convolutional layers have
can include the bias by augment all data vectors xn used softmax layer objective to learn the lower level
with a scalar value of 1. The corresponding uncon- parameters. There are exceptions, notably in papers
strained optimization problem is the following: by (Zhong & Ghosh, 2000; Collobert & Bengio, 2004;
N
Nagi et al., 2012), supervised embedding with nonlin-
1 T X
ear NCA (Salakhutdinov & Hinton, 2007), and semi-
min w w+C max(1 wT xn tn , 0) (5)
w 2 n=1
supervised deep embedding (Weston et al., 2008). In
Deep Learning using Linear Support Vector Machines
l(w)
= Ctn w(I{1 > wT hn tn }) (10)
hn
Where I{} is the indicator function. Likewise, for the
L2-SVM, we have
Softmax DLSVM L2
Training cross validation 67.6% 68.9%
Public leaderboard 69.3% 69.4%
Private leaderboard 70.1% 71.2%
ConvNet+Softmax ConvNet+SVM
References
Test error 14.0% 11.9%
Boser, Bernhard E., Guyon, Isabelle M., and Vapnik,
Table 2. Comparisons of the models in terms of % error on Vladimir N. A training algorithm for optimal margin
the test set. classifiers. In Proceedings of the 5th Annual ACM Work-
shop on Computational Learning Theory, pp. 144152.
ACM Press, 1992.
In literature, the state-of-the-art (at the time of writ-
Ciresan, D., Meier, U., Masci, J., Gambardella, L. M., and
ing) result is around 9.5% by (Snoeck et al. 2012). Schmidhuber, J. High-performance neural networks for
However, that model is different as it includes con- visual object classification. CoRR, abs/1102.0183, 2011.
trast normalization layers as well as used Bayesian op-
Coates, Adam, Ng, Andrew Y., and Lee, Honglak. An
timization to tune its hyperparameters. analysis of single-layer networks in unsupervised feature
learning. Journal of Machine Learning Research - Pro-
3.4. Regularization or Optimization ceedings Track, 15:215223, 2011.
To see whether the gain in DLSVM is due to the su- Collobert, R. and Bengio, S. A gentle hessian for efficient
periority of the objective function or to the ability to gradient descent. In IEEE International Conference on
Acoustic, Speech, and Signal Processing, ICASSP, 2004.
better optimize, We looked at the two final models
loss under its own objective functions as well as the Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E.
other objective. The results are in Table 3. Phone recognition with the mean-covariance restricted
Boltzmann machine. In NIPS 23. 2010.
ConvNet ConvNet Hadsell, Raia, Chopra, Sumit, and Lecun, Yann. Dimen-
sionality reduction by learning an invariant mapping. In
+Softmax +SVM In Proc. Computer Vision and Pattern Recognition Con-
Test error 14.0% 11.9% ference (CVPR06. IEEE Press, 2006.
Avg. cross entropy 0.072 0.353
Hinton, Geoffrey E. Connectionist learning procedures. Ar-
Hinge loss squared 213.2 0.313 tif. Intell., 40(1-3):185234, 1989.
Table 3. Training objective including the weight costs. Hsu, Chih-Wei and Lin, Chih-Jen. A comparison of meth-
ods for multiclass support vector machines. IEEE Trans-
actions on Neural Networks, 13(2):415425, 2002.
It is interesting to note here that lower cross entropy
Huang, F. J. and LeCun, Y. Large-scale learning with SVM
actually led a higher error in the middle row. In ad- and convolutional for generic object categorization. In
dition, we also initialized a ConvNet+Softmax model CVPR, pp. I: 284291, 2006. URL http://dx.doi.org/
with the weights of the DLSVM that had 11.9% error. 10.1109/CVPR.2006.164.
As further training is performed, the networks error
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,
rate gradually increased towards 14%. Y. What is the best multi-stage architecture for object
This gives limited evidence that the gain of DLSVM recognition? In Proc. Intl. Conf. on Computer Vision
(ICCV09). IEEE, 2009.
is largely due to a better objective function.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Imagenet classification with deep convolutional neural
4. Conclusions networks. In NIPS, pp. 11061114, 2012.
In conclusion, we have shown that DLSVM works bet- Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Convo-
ter than softmax on 2 standard datasets and a recent lutional deep belief networks for scalable unsupervised
dataset. Switching from softmax to SVMs is incredibly learning of hierarchical representations. In Intl. Conf.
simple and appears to be useful for classification tasks. on Machine Learning, pp. 609616, 2009.
Further research is needed to explore other multiclass Lee, Yuh-Jye and Mangasarian, O. L. Ssvm: A smooth
SVM formulations and better understand where and support vector machine for classification. Comp. Opt.
how much the gain is obtained. and Appl., 20(1):522, 2001.
Deep Learning using Linear Support Vector Machines
Quoc, L., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., and
Ng, A. Tiled convolutional neural networks. In NIPS
23. 2010.