搜档网
当前位置:搜档网 › 5548-discriminative-unsupervised-feature-learning-with-convolutional-neural-networks

5548-discriminative-unsupervised-feature-learning-with-convolutional-neural-networks

5548-discriminative-unsupervised-feature-learning-with-convolutional-neural-networks
5548-discriminative-unsupervised-feature-learning-with-convolutional-neural-networks

Discriminative Unsupervised Feature Learning with Convolutional Neural Networks

Alexey Dosovitskiy,Jost Tobias Springenberg,Martin Riedmiller and Thomas Brox

Department of Computer Science

University of Freiburg

79110,Freiburg im Breisgau,Germany

{dosovits,springj,riedmiller,brox}@cs.uni-freiburg.de

Abstract

Current methods for training convolutional neural networks depend on large

amounts of labeled samples for supervised training.In this paper we present an

approach for training a convolutional neural network using only unlabeled data.

We train the network to discriminate between a set of surrogate classes.Each

surrogate class is formed by applying a variety of transformations to a randomly

sampled’seed’image patch.We?nd that this simple feature learning algorithm

is surprisingly successful when applied to visual object recognition.The feature

representation learned by our algorithm achieves classi?cation results matching

or outperforming the current state-of-the-art for unsupervised learning on several

popular datasets(STL-10,CIFAR-10,Caltech-101).

1Introduction

Convolutional neural networks(CNNs)trained via backpropagation were recently shown to perform well on image classi?cation tasks with millions of training images and thousands of categories[1, 2].The feature representation learned by these networks achieves state-of-the-art performance not only on the classi?cation task for which the network was trained,but also on various other visual recognition tasks,for example:classi?cation on Caltech-101[2,3],Caltech-256[2]and the Caltech-UCSD birds dataset[3];scene recognition on the SUN-397database[3];detection on the PASCAL VOC dataset[4].This capability to generalize to new datasets makes supervised CNN training an attractive approach for generic visual feature learning.

The downside of supervised training is the need for expensive labeling,as the amount of required labeled samples grows quickly the larger the model gets.The large performance increase achieved by methods based on the work of Krizhevsky et al.[1]was,for example,only possible due to massive efforts on manually annotating millions of images.For this reason,unsupervised learning –although currently underperforming–remains an appealing paradigm,since it can make use of raw unlabeled images and videos.Furthermore,on vision tasks outside classi?cation it is not even certain whether training based on object class labels is advantageous.For example,unsupervised feature learning is known to be bene?cial for image restoration[5]and recent results show that it outperforms supervised feature learning also on descriptor matching[6].

In this work we combine the power of a discriminative objective with the major advantage of un-supervised feature learning:cheap data acquisition.We introduce a novel training procedure for convolutional neural networks that does not require any labeled data.It rather relies on an auto-matically generated surrogate task.The task is created by taking the idea of data augmentation–which is commonly used in supervised learning–to the extreme.Starting with trivial surrogate classes consisting of one random image patch each,we augment the data by applying a random set of transformations to each patch.Then we train a CNN to classify these surrogate classes.We refer to this method as exemplar training of convolutional neural networks(Exemplar-CNN).

The feature representation learned by Exemplar-CNN is,by construction,discriminative and in-variant to typical transformations.We con?rm this both theoretically and empirically,showing that this approach matches or outperforms all previous unsupervised feature learning methods on the standard image classi?cation benchmarks STL-10,CIFAR-10,and Caltech-101.

1.1Related Work

Our approach is related to a large body of work on unsupervised learning of invariant features and training of convolutional neural networks.

Convolutional training is commonly used in both supervised and unsupervised methods to utilize the invariance of image statistics to translations(e.g.LeCun et al.[7],Kavukcuoglu et al.[8], Krizhevsky et al.[1]).Similar to our approach the current surge of successful methods employing convolutional neural networks for object recognition often rely on data augmentation to generate additional training samples for their classi?cation objective(e.g.Krizhevsky et al.[1],Zeiler and Fergus[2]).While we share the architecture(a convolutional neural network)with these approaches, our method does not rely on any labeled training data.

In unsupervised learning,several studies on learning invariant representations exist.Denoising au-toencoders[9],for example,learn features that are robust to noise by trying to reconstruct data from randomly perturbed input samples.Zou et al.[10]learn invariant features from video by enforcing a temporal slowness constraint on the feature representation learned by a linear autoencoder.Sohn and Lee[11]and Hui[12]learn features invariant to local image transformations.In contrast to our discriminative approach,all these methods rely on directly modeling the input distribution and are typically hard to use for jointly training multiple layers of a CNN.

The idea of learning features that are invariant to transformations has also been explored for super-vised training of neural networks.The research most similar to ours is early work on tangent prop-agation[13](and the related double backpropagation[14])which aims to learn invariance to small prede?ned transformations in a neural network by directly penalizing the derivative of the output with respect to the magnitude of the transformations.In contrast,our algorithm does not regularize the derivative explicitly.Thus it is less sensitive to the magnitude of the applied transformation. This work is also loosely related to the use of unlabeled data for regularizing supervised algorithms, for example self-training[15]or entropy regularization[16].In contrast to these semi-supervised methods,Exemplar-CNN training does not require any labeled data.

Finally,the idea of creating an auxiliary task in order to learn a good data representation was used by Ahmed et al.[17],Collobert et al.[18].

2Creating Surrogate Training Data

The input to the training procedure is a set of unlabeled images,which come from roughly the same distribution as the images to which we later aim to apply the learned features.We randomly sample N∈[50,32000]patches of size32×32pixels from different images at varying positions and scales forming the initial training set X={x1,...x N}.We are interested in patches containing objects or parts of objects,hence we sample only from regions containing considerable gradients.

We de?ne a family of transformations{Tα|α∈A}parameterized by vectorsα∈A,where A is the set of all possible parameter vectors.Each transformation Tαis a composition of elementary transformations from the following list:

?translation:vertical or horizontal translation by a distance within0.2of the patch size;

?scaling:multiplication of the patch scale by a factor between0.7and1.4;

?rotation:rotation of the image by an angle up to20degrees;

?contrast1:multiply the projection of each patch pixel onto the principal components of the set of all pixels by a factor between0.5and2(factors are independent for each principal component and the same for all pixels within a patch);

?contrast2:raise saturation and value(S and V components of the HSV color representation) of all pixels to a power between0.25and4(same for all pixels within a patch),multiply these values by a factor between0.7and1.4,add to them a value between?0.1and0.1;

Figure 1:Exemplary patches sampled from

the STL unlabeled dataset which are later

augmented by various transformations to ob-

tain surrogate data for the CNN

training.Figure 2:Several random transformations applied to one of the patches extracted from the STL unlabeled dataset.The original (’seed’)patch is in the top left corner.

?color:add a value between ?0.1and 0.1to the hue (H component of the HSV color repre-sentation)of all pixels in the patch (the same value is used for all pixels within a patch).All numerical parameters of elementary transformations,when concatenated together,form a single parameter vector α.For each initial patch x i ∈X we sample K ∈[1,300]random parameter vectors {α1i ,...,αK i }and apply the corresponding transformations T i ={T α1i ,...,T αK i }to the patch x i .This yields the set of its transformed versions S x i =T i x i ={T x i |T ∈T i }.Afterwards we subtract the mean of each pixel over the whole resulting dataset.We do not apply any other preprocessing.Exemplary patches sampled from the STL-10unlabeled dataset are shown in Fig.1.Examples of transformed versions of one patch are shown in Fig.2.

3Learning Algorithm

Given the sets of transformed image patches,we declare each of these sets to be a class by assigning label i to the class S x i .We next train a CNN to discriminate between these surrogate classes.Formally,we minimize the following loss function:L (X )= x i ∈X T ∈T i

l (i,T x i ),(1)

where l (i,T x i )is the loss on the transformed sample T x i with (surrogate)true label i .We use a CNN with a softmax output layer and optimize the multinomial negative log likelihood of the network output,hence in our case

l (i,T x i )=M (e i ,f (T x i )),

M (y ,f )=? y ,log f =?

k y k log f k ,(2)

where f (·)denotes the function computing the values of the output layer of the CNN given the input data,and e i is the i th standard basis vector.We note that in the limit of an in?nite number of transformations per surrogate class,the objective function (1)takes the form L (X )= x i ∈X

E α[l (i,T αx i )],(3)

which we shall analyze in the next section.

Intuitively,the classi?cation problem described above serves to ensure that different input samples can be distinguished.At the same time,it enforces invariance to the speci?ed transformations.In the following sections we provide a foundation for this intuition.We ?rst present a formal analysis of the objective,separating it into a well de?ned classi?cation problem and a regularizer that enforces invariance (resembling the analysis in Wager et al.[19]).We then discuss the derived properties of this classi?cation problem and compare it to common practices for unsupervised feature learning.

3.1Formal Analysis

We denote by α∈A the random vector of transformation parameters,by g (x )the vector of activa-tions of the second-to-last layer of the network when presented the input patch x ,by W the matrix

of the weights of the last network layer,by h (x )=W g (x )the last layer activations before applying the softmax,and by f (x )=softmax (h (x ))the output of the network.By plugging in the de?nition of the softmax activation function

softmax (z )=exp(z )/ exp(z ) 1

(4)the objective function (3)with loss (2)takes the form x i ∈X E α ? e i ,h (T αx i ) +log exp(h (T αx i )) 1 .

(5)

With g

i =E α[g (T αx i )]being the average feature representation of transformed versions of the image patch x i we can rewrite Eq.(5)as x i ∈X

? e i ,W g

i +log exp(W g i ) 1 + x i ∈X

E α[log exp(h (T αx i )) 1]?log exp(W g i ) 1 .(6)

The ?rst sum is the objective function of a multinomial logistic regression problem with input-target

pairs ( g i ,e i ).This objective falls back to the transformation-free instance classi?cation problem L (X )= x i ∈X l (i,x i )if g (x i )=E α[g (T αx )].In general,this equality does not hold and thus the ?rst sum enforces correct classi?cation of the average representation E α[g (T αx i )]for a given input sample.For a truly invariant representation,however,the equality is achieved.Similarly,if we suppose that T αx =x for α=0,that for small values of αthe feature representation g (T αx i )is approximately linear with respect to αand that the random variable αis centered,i.e.E α[α]=0,

then g

i =E α[g (T αx i )]≈E α[g (x i )+?α(g (T αx i ))|α=0α]=g (x i ).The second sum in Eq.(6)can be seen as a regularizer enforcing all h (T αx i )to be close to their average value,i.e.,the feature representation is sought to be approximately invariant to the transfor-mations T α.To show this we use the convexity of the function log exp(·) 1and Jensen’s inequality,which yields (proof in supplementary material)

E α[log exp(h (T αx i )) 1]?log exp(W g i ) 1≥0.(7)

If the feature representation is perfectly invariant,then h (T αx i )=W g

i and inequality (7)turns to equality,meaning that the regularizer reaches its global minimum.

3.2Conceptual Comparison to Previous Unsupervised Learning Methods

Suppose we want to unsupervisedly learn a feature representation useful for a recognition task,for example classi?cation.The mapping from input images x to a feature representation g (x )should then satisfy two requirements:(1)there must be at least one feature that is similar for images of the same category y (invariance);(2)there must be at least one feature that is suf?ciently different for images of different categories (ability to discriminate).

Most unsupervised feature learning methods aim to learn such a representation by modeling the input distribution p (x ).This is based on the assumption that a good model of p (x )contains infor-mation about the category distribution p (y |x ).That is,if a representation is learned,from which a given sample can be reconstructed perfectly,then the representation is expected to also encode information about the category of the sample (ability to discriminate).Additionally,the learned representation should be invariant to variations in the samples that are irrelevant for the classi?ca-tion task,i.e.,it should adhere to the manifold hypothesis (see e.g.Rifai et al.[20]for a recent discussion).Invariance is classically achieved by regularization of the latent representation,e.g.,by enforcing sparsity [8]or robustness to noise [9].

In contrast,the discriminative objective in Eq.(1)does not directly model the input distribution p (x )but learns a representation that discriminates between input samples.The representation is not required to reconstruct the input,which is unnecessary in a recognition or matching task.This leaves more degrees of freedom to model the desired variability of a sample.As shown in our analysis (see Eq.(7)),we achieve partial invariance to transformations applied during surrogate data creation by forcing the representation g (T αx i )of the transformed image patch to be predictive of the surrogate label assigned to the original image patch x i .

It should be noted that this approach assumes that the transformations Tαdo not change the identity of the image content.If we,for example,use a color transformation we will force the network to be invariant to this change and cannot expect the extracted features to perform well in a task relying on color information(such as differentiating black panthers from pumas)1.

4Experiments

To compare our discriminative approach to previous unsupervised feature learning methods,we re-port classi?cation results on the STL-10[21],CIFAR-10[22]and Caltech-101[23]datasets.More-over,we assess the in?uence of the augmentation parameters on the classi?cation performance and study the invariance properties of the network.

4.1Experimental Setup

The datasets we test on differ in the number of classes(10for CIFAR and STL,101for Caltech) and the number of samples per class.STL is especially well suited for unsupervised learning as it contains a large set of100,000unlabeled samples.In all experiments(except for the dataset transfer experiment in the supplementary material)we extracted surrogate training data from the unlabeled subset of STL-10.When testing on CIFAR-10,we resized the images from32×32pixels to64×64 pixels so that the scale of depicted objects roughly matches the two other datasets.

We worked with two network architectures.A“small”network was used to evaluate the in?uence of different components of the augmentation procedure on classi?cation performance.It consists of two convolutional layers with64?lters each followed by a fully connected layer with128neurons. This last layer is succeeded by a softmax layer,which serves as the network output.A“large”network,consisting of three convolutional layers with64,128and256?lters respectively followed by a fully connected layer with512neurons,was trained to compare our method to the state-of-the-art.In both models all convolutional?lters are connected to a5×5region of their input.2×2max-pooling was performed after the?rst and second convolutional layers.Dropout[24]was applied to the fully connected layers.We trained the networks using an implementation based on Caffe[25]. Details on the training,the hyperparameter settings,and an analysis of the performance depending on the network architecture is provided in the supplementary material.Our code and training data are available at https://www.sodocs.net/doc/3118284929.html,rmatik.uni-freiburg.de/resources.

We applied the feature representation to images of arbitrary size by convolutionally computing the responses of all the network layers except the top softmax.To each feature map,we applied the pool-ing method that is commonly used for the respective dataset:1)4-quadrant max-pooling,resulting in 4values per feature map,which is the standard procedure for STL-10and CIFAR-10[26,10,27,12];

2)3-layer spatial pyramid,i.e.max-pooling over the whole image as well as within4quadrants and within the cells of a4×4grid,resulting in1+4+16=21values per feature map,which is the standard for Caltech-101[28,10,29].Finally,we trained a linear support vector machine(SVM)on the pooled features.

On all datasets we used the standard training and test protocols.On STL-10the SVM was trained on 10pre-de?ned folds of the training data.We report the mean and standard deviation achieved on the ?xed test set.For CIFAR-10we report two results:(1)training the SVM on the whole CIFAR-10 training set(’CIFAR-10’);(2)the average over10random selections of400training samples per class(’CIFAR-10(400)’).For Caltech-101we followed the usual protocol of selecting30random samples per class for training and not more than50samples per class for testing.This was repeated 10times.

4.2Classi?cation Results

In Table1we compare Exemplar-CNN to several unsupervised feature learning methods,including the current state-of-the-art on each dataset.We also list the state-of-the-art for supervised learning (which is not directly comparable).Additionally we show the dimensionality of the feature vectors 1Such cases could be covered either by careful selection of applied transformations or by combining features from multiple networks trained with different sets of transformations and letting the?nal classi?er choose which features to use.

Table1:Classi?cation accuracies on several datasets(in percent).?Average per-class accu-racy278.0%±0.4%.?Average per-class accuracy84.4%±0.6%.

Algorithm STL-10CIFAR-10(400)CIFAR-10Caltech-101#features Convolutional K-means Network[26]60.1±170.7±0.782.0—8000 Multi-way local pooling[28]———77.3±0.61024×64 Slowness on videos[10]61.0——74.6556 Hierarchical Matching Pursuit(HMP)[27]64.5±1———1000 Multipath HMP[29]———82.5±0.55000 View-Invariant K-means[12]63.772.6±0.781.9—6400 Exemplar-CNN(64c5-64c5-128f)67.1±0.369.7±0.375.779.8±0.5?256 Exemplar-CNN(64c5-128c5-256c5-512f)72.8±0.475.3±0.282.085.5±0.4?960 Supervised state of the art70.1[30]—91.2[31]91.44[32]—produced by each method before?nal pooling.The small network was trained on8000surrogate classes containing150samples each and the large one on16000classes with100samples each. The features extracted from the larger network match or outperform the best prior result on all datasets.This is despite the fact that the dimensionality of the feature vector is smaller than that of most other approaches and that the networks are trained on the STL-10unlabeled dataset(i.e.they are used in a transfer learning manner when applied to CIFAR-10and Caltech101).The increase in performance is especially pronounced when only few labeled samples are available for training the SVM(as is the case for all the datasets except full CIFAR-10).This is in agreement with previous evidence that with increasing feature vector dimensionality and number of labeled samples,training an SVM becomes less dependent on the quality of the features[26,12].Remarkably,on STL-10we achieve an accuracy of72.8%,which is a large improvement over all previously reported results. 4.3Detailed Analysis

We performed additional experiments(using the“small”network)to study the effect of three design choices in Exemplar-CNN training and validate the invariance properties of the learned features. Experiments on sampling’seed’patches from different datasets can be found in the supplementary.

4.3.1Number of Surrogate Classes

We varied the number N of surrogate classes between50and32000.As a sanity check,we also tried classi?cation with random?lters.The results are shown in Fig.3.

Clearly,the classi?cation accuracy increases with the number of surrogate classes until it reaches an optimum at about8000surrogate classes after which it did not change or even decreased.This is to be expected:the larger the number of surrogate classes,the more likely it is to draw very similar or even identical samples,which are hard or impossible to discriminate.Few such cases are not detrimental to the classi?cation performance,but as soon as such collisions dominate the set of surrogate labels,the discriminative loss is no longer reasonable and training the network to the surrogate task no longer succeeds.To check the validity of this explanation we also plot in Fig.3the classi?cation error on the validation set(taken from the surrogate data)computed after training the network.It rapidly grows as the number of surrogate classes increases.We also observed that the optimal number of surrogate classes increases with the size of the network(not shown in the?gure), but eventually saturates.This demonstrates the main limitation of our approach to randomly sample ’seed’patches:it does not scale to arbitrarily large amounts of unlabeled data.However,we do not see this as a fundamental restriction and discuss possible solutions in Section5.

4.3.2Number of Samples per Surrogate Class

Fig.4shows the classi?cation accuracy when the number K of training samples per surrogate class varies between1and300.The performance improves with more samples per surrogate class and 2On Caltech-101one can either measure average accuracy over all samples(average overall accuracy)or calculate the accuracy for each class and then average these values(average per-class accuracy).These differ, as some classes contain fewer than50test samples.Most researchers in ML use average overall accuracy.

Number of classes (log scale)C l a s s i f i c a t i o n a c c u r a c y o n S T L ?10E r r o r o n v a l i d a t i o n d a t a Figure 3:In?uence of the

number of surro-

gate training classes.The validation error on

the surrogate data is shown in red.Note the

different y-axes for the two curves.Number of samples per class (log scale)C l a s s i f i c a t i o n a c c u r a c y o n S T L ?10Figure 4:Classi?cation performance on STL for different numbers of samples per class.Random ?lters can be seen as ’0samples per class’.

saturates at around 100

samples.This indicates that this amount is suf?cient to approximate the formal objective from Eq.(3),hence further increasing the number of samples does not signi?cantly change the optimization problem.On the other hand,if the number of samples is too small,there is insuf?cient data to learn the desired invariance properties.

4.3.3Types of Transformations

D i f f e r e n c e i n c l a s s i f i c a t i o n a c c u r a c y Figure 5:In?uence of removing groups of trans-formations during generation of the surrogate training data.Baseline (’0’value)is applying all transformations.Each group of three bars corre-sponds to removing some of the transformations.We varied the transformations used for creating

the surrogate data to analyze their in?uence on

the ?nal classi?cation performance.The set of

’seed’patches was ?xed.The result is shown

in Fig.5.The value ’0’corresponds to ap-

plying random compositions of all elementary

transformations:scaling,rotation,translation,

color variation,and contrast variation.Differ-

ent columns of the plot show the difference in

classi?cation accuracy as we discarded some

types of elementary transformations.Several tendencies can be observed.First,ro-tation and scaling have only a minor impact on the performance,while translations,color vari-ations and contrast variations are signi?cantly more important.Secondly,the results on STL-

10and CIFAR-10consistently show that spatial invariance and color-contrast invariance are ap-proximately of equal importance for the classi?cation performance.This indicates that variations in color and contrast,though often neglected,may also improve performance in a supervised learn-ing scenario.Thirdly,on Caltech-101color and contrast transformations are much more important compared to spatial transformations than on the two other datasets.This is not surprising,since Caltech-101images are often well aligned,and this dataset bias makes spatial invariance less useful.

4.3.4Invariance Properties of the Learned Representation

In a ?nal experiment,we analyzed to which extent the representation learned by the network is invariant to the transformations applied during training.We randomly sampled 500images from the STL-10test set and applied a range of transformations (translation,rotation,contrast,color)to each image.To avoid empty regions beyond the image boundaries when applying spatial transformations,we cropped the central 64×64pixel sub-patch from each 96×96pixel image.We then applied two measures of invariance to these patches.

First,as an explicit measure of invariance,we calculated the normalized Euclidean distance be-tween normalized feature vectors of the original image patch and the transformed one [10](see the supplementary material for details).The downside of this approach is that the distance between extracted features does not take into account how informative and discriminative they are.We there-

D i s t a n c e b e t w e e n f e a t u r e v e c t o r s Figure 6:Invariance properties of the feature representation learned by Exemplar-CNN.(a):Nor-malized Euclidean distance between feature vectors of the original and the translated image patches vs.the magnitude of the translation,(b)-(d):classi?cation performance on transformed image patches vs.the magnitude of the transformation for various magnitudes of transformations applied for creating surrogate data.(b):rotation,(c):additive color change,(d):multiplicative contrast change.

fore evaluated a second measure –classi?cation performance depending on the magnitude of the transformation applied to the classi?ed patches –which does not come with this problem.To com-pute the classi?cation accuracy,we trained an SVM on the central 64×64pixel patches from one fold of the STL-10training set and measured classi?cation performance on all transformed versions of 500samples from the test set.

The results of both experiments are shown in Fig.6.Due to space restrictions we show only few representative plots.Overall the experiment empirically con?rms that the Exemplar-CNN objec-tive leads to learning invariant features.Features in the third layer and the ?nal pooled feature representation compare favorably to a HOG baseline (Fig.6(a)).Furthermore,adding stronger transformations in the surrogate training data leads to more invariant classi?cation with respect to these transformations (Fig.6(b)-(d)).However,adding too much contrast variation may deteriorate classi?cation performance (Fig.6(d)).One possible reason is that level of contrast can be a useful feature:for example,strong edges in an image are usually more important than weak ones.

5Discussion

We have proposed a discriminative objective for unsupervised feature learning by training a CNN without class labels.The core idea is to generate a set of surrogate labels via data augmentation.The features learned by the network yield a large improvement in classi?cation accuracy compared to features obtained with previous unsupervised methods.These results strongly indicate that a discriminative objective is superior to objectives previously used for unsupervised feature learning.One potential shortcoming of the proposed method is that in its current state it does not scale to ar-bitrarily large datasets.Two probable reasons for this are that (1)as the number of surrogate classes grows larger,many of them become similar,which contradicts the discriminative objective,and (2)the surrogate task we use is relatively simple and does not allow the network to learn invariance to complex variations,such as 3D viewpoint changes or inter-instance variation.We hypothesize that the presented approach could learn more powerful higher-level features,if the surrogate data were more diverse.This could be achieved by using additional weak supervision,for example,by means of video or a small number of labeled samples.Another possible way of obtaining richer surro-gate training data and at the same time avoiding similar surrogate classes would be (unsupervised)merging of similar surrogate classes.We see these as interesting directions for future work.

Acknowledgements

We acknowledge funding by the ERC Starting Grant VideoLearn (279401);the work was also partly supported by the BrainLinks-BrainTools Cluster of Excellence funded by the German Research Foundation (DFG,grant number EXC 1086).

References

[1] A.Krizhevsky,I.Sutskever,and G.E.Hinton.ImageNet classi?cation with deep convolutional neural

networks.In NIPS ,pages 1106–1114,2012.

[2]M.D.Zeiler and R.Fergus.Visualizing and understanding convolutional networks.In ECCV,2014.

[3]J.Donahue,Y.Jia,O.Vinyals,J.Hoffman,N.Zhang,E.Tzeng,and T.Darrell.DeCAF:A deep convo-

lutional activation feature for generic visual recognition.In ICML,2014.

[4]R.Girshick,J.Donahue,T.Darrell,and J.Malik.Rich feature hierarchies for accurate object detection

and semantic segmentation.In CVPR,2014.

[5]K.Cho.Simple sparsi?cation improves sparse denoising autoencoders in denoising highly corrupted

images.In ICML.JMLR Workshop and Conference Proceedings,2013.

[6]P.Fischer,A.Dosovitskiy,and T.Brox.Descriptor matching with convolutional neural networks:a

comparison to SIFT.2014.pre-print,arXiv:1405.5769v1[cs.CV].

[7]Y.LeCun,B.Boser,J.S.Denker,D.Henderson,R.E.Howard,W.Hubbard,and L.D.Jackel.Backprop-

agation applied to handwritten zip code recognition.Neural Computation,1(4):541–551,1989.

[8]K.Kavukcuoglu,P.Sermanet,Y.Boureau,K.Gregor,M.Mathieu,and Y.LeCun.Learning convolutional

feature hierachies for visual recognition.In NIPS,2010.

[9]P.Vincent,https://www.sodocs.net/doc/3118284929.html,rochelle,Y.Bengio,and P.-A.Manzagol.Extracting and composing robust features with

denoising autoencoders.In ICML,pages1096–1103,2008.

[10]W.Y.Zou,A.Y.Ng,S.Zhu,and K.Yu.Deep learning of invariant features via simulated?xations in

video.In NIPS,pages3212–3220,2012.

[11]K.Sohn and H.Lee.Learning invariant representations with local transformations.In ICML,2012.

[12]K.Y.Hui.Direct modeling of complex invariances for visual object features.In ICML,2013.

[13]P.Simard,B.Victorri,Y.LeCun,and J.S.Denker.Tangent Prop-A formalism for specifying selected

invariances in an adaptive network.In NIPS,1992.

[14]H.Drucker and Y.LeCun.Improving generalization performance using double backpropagation.IEEE

Transactions on Neural Networks,3(6):991–997,1992.

[15]M.-R.Amini and P.Gallinari.Semi supervised logistic regression.In ECAI,pages390–394,2002.

[16]Y.Grandvalet and Y.Bengio.Entropy regularization.In Semi-Supervised Learning,pages151–168.MIT

Press,2006.

[17] A.Ahmed,K.Yu,W.Xu,Y.Gong,and E.Xing.Training hierarchical feed-forward visual recognition

models using transfer learning from pseudo-tasks.In ECCV(3),pages69–82,2008.

[18]R.Collobert,J.Weston,L.Bottou,M.Karlen,K.Kavukcuoglu,and P.Kuksa.Natural language process-

ing(almost)from scratch.Journal of Machine Learning Research,12:2493–2537,2011.

[19]S.Wager,S.Wang,and P.Liang.Dropout training as adaptive regularization.In NIPS.2013.

[20]S.Rifai,Y.N.Dauphin,P.Vincent,Y.Bengio,and X.Muller.The manifold tangent classi?er.In NIPS.

2011.

[21] A.Coates,H.Lee,and A.Y.Ng.An analysis of single-layer networks in unsupervised feature learning.

AISTATS,2011.

[22] A.Krizhevsky and G.Hinton.Learning multiple layers of features from tiny images.Master’s thesis,

Department of Computer Science,University of Toronto,2009.

[23]L.Fei-Fei,R.Fergus,and P.Perona.Learning generative visual models from few training examples:An

incremental bayesian approach tested on101object categories.In CVPR WGMBV,2004.

[24]G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,and R.R.Salakhutdinov.Improving neural

networks by preventing co-adaptation of feature detectors.2012.pre-print,arxiv:cs/1207.0580v3. [25]Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Girshick,S.Guadarrama,and T.Darrell.Caffe:

Convolutional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093,2014.

[26] A.Coates and A.Y.Ng.Selecting receptive?elds in deep networks.In NIPS,pages2528–2536,2011.

[27]L.Bo,X.Ren,and D.Fox.Unsupervised feature learning for RGB-D based object recognition.In ISER,

June2012.

[28]Y.Boureau,N.Le Roux,F.Bach,J.Ponce,and Y.LeCun.Ask the locals:multi-way local pooling for

image recognition.In ICCV’11.IEEE,2011.

[29]L.Bo,X.Ren,and D.Fox.Multipath sparse coding using hierarchical matching pursuit.In CVPR,pages

660–667,2013.

[30]K.Swersky,J.Snoek,and R.P.Adams.Multi-task bayesian optimization.In NIPS,2013.

[31]M.Lin,Q.Chen,and https://www.sodocs.net/doc/3118284929.html,work in network.In ICLR,2014.

[32]K.He,X.Zhang,S.Ren,and J.Sun.Spatial pyramid pooling in deep convolutional networks for visual

recognition.In ECCV,2014.

相关主题