搜档网
当前位置:搜档网 › Hierarchical Convolutional Features for Visual Tracking

Hierarchical Convolutional Features for Visual Tracking

Hierarchical Convolutional Features for Visual Tracking

Chao Ma

SJTU chaoma@https://www.sodocs.net/doc/d217013021.html,

Jia-Bin Huang

UIUC

jbhuang1@https://www.sodocs.net/doc/d217013021.html,

Xiaokang Yang

SJTU

xkyang@https://www.sodocs.net/doc/d217013021.html,

Ming-Hsuan Yang

UC Merced

mhyang@https://www.sodocs.net/doc/d217013021.html,

Abstract

Visual object tracking is challenging as target objects of-ten undergo signi?cant appearance changes caused by de-formation,abrupt motion,background clutter and occlu-sion.In this paper,we exploit features extracted from deep convolutional neural networks trained on object recognition datasets to improve tracking accuracy and robustness.The outputs of the last convolutional layers encode the semantic information of targets and such representations are robust to signi?cant appearance variations.However,their spatial resolution is too coarse to precisely localize targets.In con-trast,earlier convolutional layers provide more precise lo-calization but are less invariant to appearance changes.We interpret the hierarchies of convolutional layers as a non-linear counterpart of an image pyramid representation and exploit these multiple levels of abstraction for visual track-ing.Speci?cally,we adaptively learn correlation?lters on each convolutional layer to encode the target appearance. We hierarchically infer the maximum response of each layer to locate targets.Extensive experimental results on a large-scale benchmark dataset show that the proposed algorithm performs favorably against state-of-the-art methods.

1.Introduction

Visual object tracking is one of the fundamental prob-lems in computer vision with numerous applications[33, 26].A typical scenario of visual tracking is to track an un-known target object,speci?ed by a bounding box in the?rst frame.Despite signi?cant progress in the recent decades, visual tracking is still a challenging problem,mainly due to large appearance changes caused by occlusion,deforma-tion,abrupt motion,illumination variation,and background clutter.Recently,features based on convolutional neural networks(CNNs)have demonstrated state-of-the-art results on a wide range of visual recognition tasks[20,11].It is thus of great interest to understand how to best exploit the rich feature hierarchies in CNNs for robust visual tracking.

Existing deep learning based trackers[30,21,29, 18]typically draw positive and negative training samples around the estimated target location to incrementally learn a classi?er over features extracted from a CNN.Two issues ensue with such approaches.The?rst issue lies in the use of neural networks as an online classi?er following recent object recognition algorithms,where only the outputs of the last layer are used to represent targets.For high-level visual recognition problems,it is effective to use features from the last layer as they are most closely related to category-level semantics and most invariant to nuisance variables such as intra-class variations and precise location.However,the ob-jective of visual tracking is to locate targets precisely rather than to infer their semantic https://www.sodocs.net/doc/d217013021.html,ing only the features from the last layer is thus not the optimal representation for targets.The second issue is concerned with extracting train-ing samples.Training a robust classi?er requires a consider-ably large number of positive and negative samples,which is not available in visual tracking.In addition,there lies am-biguity in determining a decision boundary since positive and negative samples are highly correlated due to sampling near a target.

In this work,we address these two issues by(i)using the features from hierarchical layers of CNNs rather than only the last layer to represent targets;(ii)learning adaptive cor-relation?lters on each CNN layer without the need of sam-pling.Our approach builds on the observation that although the last layers of CNNs are more effective to capture seman-tics,they are insuf?cient for capturing?ne-grained spatial details such as object positions.The earlier layers,on the other hand,are precise in localization but do not capture semantics as illustrated in Figure1.This observation sug-gests that reasoning with multiple layers of CNN features for visual tracking is of great importance as semantics are robust to signi?cant appearance variations and spatial de-tails are effective for precise localization.We exploit both the hierarchical features from the recent advances in CNNs and the inference approach across multiple levels in clas-sical computer vision problems.For example,computing optical?ow from the coarse levels of the image pyramid are ef?cient,but?ner levels are required for obtaining an accurate and detailed?ow?eld.A coarse-to-?ne search-ing strategy is often adopted for best results[23].In light

Semantics Spatial Details Our Approach

Early layers of CNNs e.g., intensity, Gabor filter

Last layer of CNNs

e.g., fc7in AlexNet Exploiting both spatial

details and semantics

Figure1.Convolutional layers of a typical CNN model,e.g., AlexNet[20],provide multiple levels of abstraction in the fea-ture hierarchies.The features in the earlier layers retain higher spatial resolution for precise localization with low-level visual in-formation similar to the response map of Gabor?lters[4].On the other hand,features in the latter layers capture more seman-tic information and less?ne-grained spatial details.Our approach exploit the semantic information of last layers to handle large ap-pearance changes and alleviate drifting by using features of earlier layers for precise localization.

of this connection,we learn one adaptive correlation?lter [3,16,7,35,6,17]over features extracted from each CNN layer and use these multi-level correlation response maps to collaboratively infer the target location.We consider all the shifted versions of features as training samples and regress them to a Gaussian function with a small spatial bandwidth, thereby alleviating the sampling ambiguity of training a bi-nary discriminative classi?er.

We make the following three contributions.First,we propose to use the rich feature hierarchies of CNNs as tar-get representations for visual tracking,where both seman-tics and?ne-grained details are simultaneously exploited to handle large appearance variations and avoid drifting.Sec-ond,we adaptively learn linear correlation?lters on each CNN layer to alleviate the sampling ambiguity.We infer the target location using the multi-level correlation response maps in a coarse-to-?ne fashion.Third,we carry out exten-sive experiments on a large-scale benchmark dataset[32] with100challenging image sequences and demonstrate that the proposed tracking algorithm performs favorably against existing state-of-the-art methods in terms of accuracy and robustness.

2.Related Work

In this section,we discuss tracking methods closely re-lated to this work.We refer the readers to a comprehensive review on visual tracking in[33,22,26].

Tracking by Binary Classi?ers.Visual tracking can be posed as a repeated detection problem in a local window (known as tracking-by-detection),where classi?ers are of-ten learned online.For each frame,a set of positive and neg-ative training samples are collected for incrementally learn-ing a discriminative classi?er to separate a target from its backgrounds.However,the sampling ambiguity problem arises with such approaches that draw samples for learn-ing online classi?ers.Slight inaccuracies in labeling sam-ples affect the classi?er and gradually cause the trackers to drift.Considerable efforts have been made to alleviate these model update problems caused by sample ambiguity.The core idea of these algorithms lie in how to properly update a discriminative classi?er to reduce drifts.Examples in-clude multiple instance learning(MIL)[1],semi-supervised learning[10,12],and P-N learning[19].Instead of learning only one single classi?er,Zhang et al.[34]combine multi-ple classi?ers with different learning rates.On the other hand,Hare et al.[13]show that the objective of label pre-diction using a classi?er is not explicitly coupled to the ob-jective of tracking(accurate position estimation)and pose tracking as a joint structured output prediction problem.By alleviating the sampling ambiguity problem,these methods perform well in a recent benchmark study[31].We address the sample ambiguity with correlation?lters where training samples are regressed to soft labels of a Gaussian function rather than binary labels for discriminative classi?er learn-ing.

Tracking by Correlation Filters.Correlation?lters for visual tracking have attracted considerable attention due to its high computational ef?ciency with the use of fast Fourier transforms.Tracking methods based on correlation?lters regress all the circular-shifted versions of input features to a target Gaussian function and thus no hard-thresholded sam-ples of target appearance are needed.Bolme et al.[3]learn a minimum output sum of squared error?lter over lumi-nance channel for fast visual tracking.Several extensions have been proposed to considerably improves tracking ac-curacy,including kernelized correlation?lters[16],multi-dimensional features[17,7],context learning[35]and scale estimation[6].In this work,we propose to learn correlation ?lters over multi-dimensional features in a way similar to existing methods[7,17].The main differences lie in the use of learned CNN features rather than hand-crafted features (e.g.,HOG[5]or color-attributes[7])and we construct mul-tiple correlation?lters on hierarchical convolutional layers as opposed to only one single?lter by existing approaches. Tracking by CNNs.Visual representations are of great importance for object tracking.Numerous hand-crafted features have been used to represent the target appear-ance such as subspace representation[24]and color his-tograms[37].The recent years have witnessed signi?cant advances of CNNs on visual recognition problems.Wang and Yeung[30]propose a deep learning tracker(DLT)us-ing a multi-layer autoencoder network.This network is pre-trained on part of the80M Tiny Image dataset[27]in an un-supervised fashion.On the other hand,Wang et al.[29]pro-pose to learn a two-layer neural network on a video repos-itory[39],where temporally slowness constraints are im-posed for feature learning.Li et al.[21]construct mul-tiple CNN classi?ers on different instances of target ob-jects to rule out noisy samples during model update.The

Input

Conv3

Conv4

Conv5

0.1

0.15

0.2

m a l i z e d I n t e n s i t y

Figure 2.Visualization of the CNN features of an image with a horizontal step edge.The ?rst three principle components of fea-tures from three layers at the dash lines are visualized.Note that the conv5-4layer is less effective to locate the step edge due to its low spatial resolution while the conv3-4layer is more useful for precise localization.

DeepTrack [21]learns two-layer CNN classi?ers from bi-nary samples and does not require a pre-training procedure.Hong et al.[18]learns target-speci?c saliency map using a pre-trained CNN.We note that the aforementioned CNN trackers all rely on positive and negative training samples and only exploit the features from the last layer.In contrast,our approach builds on adaptive correlation ?lters which regress the dense,circularly shifted samples with soft labels and effectively alleviate sampling ambiguity.In addition,we exploit the features from multiple convolutional layers to encode target appearance.We extract CNN features us-ing the VGG-Net [25],which is trained on the large-scale ImageNet dataset [8]with category-level label.We also note that the DLT [30]and DeepTrack [21]methods update the appearance models by ?netuning CNNs online,while Wang et al.[29]and our algorithm use classi?er learning for model update.

3.Overview

Our approach builds on the observation that the last lay-ers of CNNs encode semantic abstraction of targets and their outputs are robust to appearance variations.On the other hand,the early layers retain more ?ne-grained spatial details and thus are useful for precise localization.We show in Figure 2an image of a horizontal step edge and visual-ize the CNN features on the third,fourth,and ?fth convo-lutional layers,where the ?fth convolutional layer is less effective to locate the sharp boundary due to its low spatial resolution while the third layer is more useful to locate it precisely.Our goal is to exploit the best of both semantics and ?ne-grained details for visual object tracking.Figure 3illustrates the main steps of our algorithm:we learn adap-tive linear correlation ?lter over the outputs of each convo-lutional layer and coarse-to-?ne search the multi-level cor-relation response maps to infer the location of targets.

Cropped Search Window Conv3

Conv4

Conv5

Tracking Output

w (1)Position in last frame

Estimated position

w (2)w (3)

Figure 3.Main steps of the proposed algorithm.Given an image,we ?rst crop the search window centered at the estimated position in the previous frame.We use the third,fourth and ?fth convolu-tional layers as our target representations.Each layer indexed by i is then convolved with the learned linear correlation ?lter w (i )to generate a response map,whose location of the maximum value indicates the estimated target position.We search the multi-level response maps to infer the target location in a coarse-to-?ne fash-ion.

4.Proposed Algorithm

In this section,we ?rst present the CNN features used in this work,technical details on learning linear correlation ?l-ters,and the coarse-to-?ne searching strategy.We introduce the online model update in the end.

4.1.Convolutional Features

We uses the convolutional feature maps from a CNN,e.g.,AlexNet [20]or VGG-Net [25],to encode target ap-pearance.Along with the CNN forward propagation,the semantical discrimination between objects from different categories is strengthened,as well as a gradual reduction of spatial resolution for precise localization (See also Fig-ure 1).For visual object tracking,we are interested in ac-curate locations of a target object.We thus ignore the fully-connected layers as they show little spatial resolution,i.e.,1×1.

Due to the pooling operators used in the CNNs,spa-tial resolution is gradually reduced with the increase of the depth of convolutional layers.For example,the convolu-tional feature maps of pool5in the VGG-Net are of spatial

size 7×7,which is 1

32of the input image size 224×224.Such low spatial resolution is insuf?cient to locate targets accurately.We alleviate this issue by resizing each feature map to a ?xed larger size with bilinear interpolation.Let h denote the feature map and x be the upsampled feature map,the feature vector for the i -th location is:

x i =∑k

αik h k ,

(1)

where the interpolation weight αik depends on the position of i and k neighboring feature vectors respectively.Note

#015

#040

#060

#100

(a)Input(b)conv3-4(c)conv4-4(d)conv5-4 Figure4.Visualization of convolutional layers.(a)four frames from the challenging MotorRolling sequence.(b)-(d)features are extracted on convolutional layers conv3-4,conv4-4,and conv5-4 using the VGG-Net[25].The yellow bounding boxes indicate the tracking results by our method.Notice that although the appear-ance of the target changes signi?cantly,the features using the out-put of the conv5-4convolution layer(d)is able to discriminate it readily even the background has dramatically changed.The conv4-4(c)and conv3-4(b)layers encode more?ne-grained details and are useful to locate target precisely.

that this interpolation takes place in the spatial space and can be seen as an interpolation of location.We visualize the upsampled outputs of the third,fourth,and?fth lay-ers by projecting the features onto their corresponding?rst three principal components on the MotorRolling sequence in Figure4.As shown in Figure4(d),features on the?fth layer are effective in discriminating the targets even with dramatic background changes.We note that this insight is also exploited in[14]for segmentation and?ne-grained lo-calization using CNNs in which features from multiple lay-ers are concatenated together.However,this feature repre-sentation ignores the coarse-to-?ne hierarchy in the CNN architecture,and does not work well for visual tracking as shown in our experimental validation(See Section6).

4.2.Correlation Filters

A typical correlation tracker[3,16,7,35,6]learns a dis-criminative classi?er and estmate the translation of target objects by searching for the maximum value of correlation response map.In this work,the outputs of each convolu-tional layer are used as multi-channel features[9,2,17]. Denote x as the l-th layer of feature vector of size M×N×D,where M,N,and D indicates the width,height,and the number of channels,respectively.Here we denote x(l)con-cisely as x and ignore the dependence of M,N,and D on the layer index l.We consider all the circular shifts of x along the M and N dimensions as training samples.Each shifted sample x m,n,(m,n)∈{0,1,...,M?1}×{0,1,...,N?1},has a Gaussian function label y(m,n)=e?

(m?M/2)2+(n?N/2)2

2σ2, whereσis the kernel width.A correlation?lter w with the same size of x is then learned by solving the following minimization problem:

w?=argmin

w

m,n

w·x m,n?y(m,n) 2+λ w 22,(2)

whereλis a regularization parameter(λ≥0)and the inner product is induced by a linear kernel in the Hilbert space, e.g.,w·x m,n=∑D d=1w m,n,d x m,n,d.As the label y(m,n)is soft(not binary),so no hard-thresholded sample is required. Notice that the minimization problem in(2)is akin to train-ing the vector correlation?lters in[2],and can be solved in each indivual feature channel using fast Fourier transfor-mation(FFT).Let the capital letters be the corresponding Fourier transformed signals.The learned?lter in the fre-quency domain on the d-th(d∈{1,...,D})channel can be written as

W d=

Y ˉX d

∑D i=1X i ˉX i+λ

.(3)

In(3),Y is the Fourier transformation form of y=

y(m,n)|(m,n)∈{0,1,...,M?1}×{0,1,...,N?1}

and the bar means complex conjugation.The operator is the Hadamard(element-wise)product.Given an image patch in the next frame,the feature vector on the l-th layer is denoted by z and of size M×N×D.The l-th correlation response map is computed by

f l=F?1

D∑

d=1

W d ˉZ d

.(4)

The operator F?1denotes the inverse FFT transform.The target location on the l-th convolution layer can then be es-timated by searching for the position of maximum value of the correlation response map f l of size M×N.

4.3.Coarse-to-Fine Translation Estimation

Given the set of correlation response maps{f l},we hier-archically infer the target translation of each layer,i.e.,the location of maximum value in last layer is used as a regular-ization to search for the maximum value of the earlier layer. Let(?m,?n)=argmax m,n f l(m,n)indicate the location of the maximum values on the l-th layer,the optimal location of target in the(l?1)-th layer is formulated as:

argmax

m,n

f l?1(m,n)+γf l(m,n),(5)

s.t.|m??m|+|n??n|≤r.

The constraint indicates that only the r×r neighboring re-gions of(?m,?n)are searched in the(l?1)-th correlation response map.The response values from the last layers

are weighted by a regularization termγand then back-propagated to the response maps of early layers.The target location is?nally estimated by maximizing(5)on the layer with the?nest spatial resolution.

4.4.Model Update

An optimal?lter on l-th layer can be updated by mini-mizing the output error over all tracked results so far as de-scribed in[2].However,this involves solving a D×D linear system of equations per location at(m,n),which is compu-tationally expensive as the channel number is usually large with the CNN features(e.g.,D=512in the conv5-4and conv4-4layers in the VGG-Net).To obtain a robust approx-imation,we update the numerator A d and denominator B d of the correlation?lter W d in(3)separately using a moving average:

A d t=(1?η)A d t?1+ηY ˉX d t;(6a)

B d t=(1?η)B d t?1+η

D

i=1

X i t ˉX i t;(6b)

W d t=

A d t

B d t+λ

,(6c)

where t is the frame index andηis a learning rate.

5.Implementation Details

We present the main steps of the proposed tracking al-gorithm in Algorithm1and the implementation details as follows.We adopt the VGG-Net-19[25]trained on Ima-geNet[8]for feature extraction.We?rst remove the fully-connected layers and use the outputs of the conv3-4,conv4-4and conv5-4convolutional layer as our features.Notice that we do not use the outputs of the pooling layers because we want to retain more spatial resolution on each convolu-tional layer.Given an image frame with searching window of size M×N(e.g.,1.8times of the target size),we set a

?xed spatial size of M

4×N

4

to resize the features from each

covolutional layer.

The parameters for training correlation?lters on each layer are kept the same.We set the regularization parameter of(2)toλ=10?4,and use a kernel width of0.1for gener-ating the Gaussian function labels.Their learning rateηin (6)is set to0.01.To remove the boundary discontinuities, the extracted feature channels of each convolutional layer are weighted by a cosine window[3].We set value ofγas 1,0.5and0.02for the conv4-4,conv3-4,conv5-4layers,re-spectively.We observe that the results are not sensitive to the parameter r for the neighborhood search constraint This amounts to simply sum over the weighted response maps from multiple layers to infer the target location.

Algorithm1:Proposed tracking algorithm.

Input:Initial target position p0,

Output:Estimated object positition p t=(x t,y t),and

learned correlation?lters{w l t},l∈{5,4,3}.

1repeat

2Crop out the searching window in frame t centered at(x t?1,y t?1)and extract covolutional features

with spatial interpolation using(1);

3foreach layer l do computing con?dence score f l using w(l)t?1and(4);

4Coarse-to-?ne estimate the new position(x t,y t)on response map set{f l}using(5);

5Crop out new patch centered at p t=(x t,y t)and extract convolutional features with interpolation;

6foreach layer l do updating correlation?lters {w l t}using(6);

7until End of video sequences;

6.Experimental Validations

We evaluate the proposed method on a large benchmark dataset[32]containing100videos with comparisons to state-of-the-art methods.For completeness,we also report the results on the benchmark dataset[31]with50videos(a subset of[32]).We quantitatively evaluate trackers using distance precision rate,overlap ratio,and center location error.We follow the protocol in[32]and use same param-eter values for all the sequences and all sensitivity analysis. More results can be found in the supplementary material. We implement our tracker in MATLAB on an Intel I7-4770 3.40GHz CPU with32GB RAM,and use the MatConvNet toolbox[28],where the computation of forward propaga-tion on CNNs is transferred to a GeForce GTX Titan GPU. The source code is publicly available on our project page1. Quantitative Evaluation.We evaluate the proposed al-gorithm with comparisons to12state-of-the-art track-ers.These trackers can be broadly categorized into three classes:(i)deep learning tracker DLT[30](ii)correla-tion?lter trackers including the CSK[16],STC[35],and KCF[17];and(iii)representative tracking algorithms using single or multiple online classi?ers,including the MIL[1], Struck[13],CT[36],LSHT[15],TLD[19],SCM[38], MEEM[34],and TGPR[10]methods.

Figure5shows the results under one-pass evaluation (OPE),temporal robustness evaluation(TRE)and spatial robustness evaluation(SRE)using the distance precision rate and overlap success rate.Additional comparisons about OPE,SRE and TRE on the?rst50sequences can be found in the supplementary materials.Overall,the proposed algo-1https://https://www.sodocs.net/doc/d217013021.html,/site/chaoma99/

iccv15-tracking

Table https://www.sodocs.net/doc/d217013021.html,parisons with state-of-the-art trackers on the ?rst 50(I)[31]and entire 100(II)[32]benchmark sequences.Our approach performs favorably against existing methods in distance precision (DP)rate at a threshold of 20pixels,overlap success (OS)rate at an overlap threshold of 0.5and center location error (CLE).The ?rst and second best values are highlighted by bold and underline.

Ours

DLT KCF STC Struck SCM CT LSHT CSK MIL TLD MEEM TGPR [30][17][35][13][38][36][15][16][1][19][34][10]

DP rate (%)

I 89.154.874.154.765.664.940.656.154.547.560.883.0

70.5II 83.752.669.250.763.557.235.949.751.643.959.278.164.3OS rate (%)

I

74.047.862.236.555.961.634.145.744.337.352.169.662.8II 65.543.054.831.451.651.227.838.841.333.149.762.253.5CLE (pixel)

I

15.765.235.580.550.654.178.955.788.862.348.120.951.3

II 22.866.545.086.247.161.680.168.230572.160.027.7

55.5Speed (FPS)

I

11.08.5924568710.00.3738.839.626928.121.720.80.66II 10.48.432436539.840.3644.439.924828.023.320.80.64

P r e c i s i o n r a t e

P r e c i s i o n r a t e

P r e c i s i o n r a t e

Overlap threshold

overlap success plots over 100benchmark sequences using one-pass evaluation (OPE),tempo-ral robustness evaluation (TRE)and spatial robustness evaluation (SRE).The legend of distance precision contains threshold scores at 20pixels while the legend of overlap success contains area-under-the-curve score for each tracker.Our proposed algorithm performs favorably against the state-of-the-art trackers.

rithm perform favorably against the state-of-the-art meth-ods in all three metrics:OPE,TRE and SRE.We present the quantitative comparisons of distance precision rate at 20pixels,overlap success rate at 0.5,center location er-rors,and tracking speed in Table 1.We report both the results on the ?rst 50sequences (Benchmark I)[31]and

all 100sequences (Benchmark II)[32].Table 1shows that our algorithm performs well against state-of-the-art track-ers in distance precision (DP)rate,overlap success (OS)rate and center location error (CLE).Notice that with entire 100sequences,Benchmark II is more challenging where all the compared trackers perform worse than on Bench-mark I.Among the state-of-the-art trackers,the MEEM method [34]achieves the second best results.The proposed method achieves lower CLE of 22.8pixels over 100video sequences,compared to the second best result from the MEEM tracker with 27.7pixels.Our tracker runs at around 10frames per second.The main computational load of our tracker is the forward propagation process to extract fea-tures (around 45%of the computing time for each frame).Attribute-based Evaluation.We further analyze the tracker performance under different video attributes (e.g.,background clutter,occlusion,fast motion)annotated in the benchmark [32].Figure 6shows the OPE for eight main video attributes.From Figure 6,we have the following ob-servations.First,our method is effective in handling back-ground clutters which can be explained by considering fea-tures with semantics and spatial details from the hierarchi-cal layers of CNNs.In contrast,the DLT method pre-trains the network with an unsupervised model and only uses the output of last layer of the trained neural network as fea-tures.This suggests CNN features (e.g.,VGG-Net)learned with category-level supervision are more effective to dis-criminate targets from background.Second,our method performs well in the presence of scale variations as the last layer of the pre-trained model retains semantic information insensitive to scale changes.Third,our method does not perform as well in the presence of occlusion and object deformation.This can be attributed to the holistic feature representation used in our model.Re-detection modules or part-based models will be considered in our future work.Feature Analysis.To analyze the effectiveness of the pro-posed algorithm,we compare the performance of using dif-ferent convolutional layers as features on the benchmark

P r e c i s i o n r a t e

P r e c i s i o n r a t e

P r e c i s i o n r a t e

P r e c i s i o n r a t e

P r e c i s i o n r a t e

P r e c i s i o n r a t e

P r e c i s i o n r a t e

P r e c i s i o n r a t e

tion variation,occlusion,in-plane rotation,and low resolution.The legend contains the scores at a threshold of 20pixels for each tracker.Our proposed algorithm performs favorably against the state-of-the-art trackers with these eight challenging

attributes.

P r e c i s i o n r a t e

layers as features.Each single layer (c5,c4and c3),the combi-nation of the conv5-4and conv4-4layers (c5-c4),and the concate-nation of three layers (c543)are evaluated using both VGG-Net and AlexNet.

with 100sequences.We ?rst test on each single layer (c5,c4and c3),and then perform coarse-to-?ne search on the ?fth and fourth layers (c5-c4).We also concatenate these three layers together (c543)as the hypercolumns used in [14].However,such concatenation breaks down the hier-archies over CNN layers and thus does not perform well for visual tracking.In addition,we test the features extracted from the AlexNet [20]with a same scheme.Figure 7shows the top 10performing methods with different features us-ing OPE,where the values at the legends with DP is based on the threshold of 20pixels while values at the legend of OS is based on area under the curve (AUC).Note that the features extracted from the VGG-Net are more effective than the AlexNet for tracking because strengthened seman-tic with deeper architecture is more insensitive to signi?cant appearance change.In addition,the tracking performance is improved with hierarchically inference on the translation

cues using multiple CNN layers.

Qualitative Evaluation.Figure 8shows some track-ing results of the top performing tracking methods:MEEM [34],KCF [17],DLT [30],Struck [13],and the pro-posed algorithm on 12challenging sequences.The MEEM tracker performs well in sequences with deformation,rota-tion and occlusion (Basketball ,Bolt ,Jogging-1,and Skiing ),but fails when background clutter and fast motion occur (Board ,Soccer ,Diving ,MotorRolling ,and Human9)as the quantized color channels features are less effective in han-dling cluttered backgrounds.We also note that the MEEM drifts quickly as it use only luminance intensity features in the Freeman4sequence.The KCF tracker learns a kernel-ized correlation ?lter with a Gaussian kernel over HOG fea-tures.It performs well in sequences with partial deforma-tion and fast motion (Basketball ,Bolt ),but drifts when tar-get objects undergo heavy occlusions (Jogging-1)and rota-tions (MotorRolling and Skiing ).The DLT method does not fully exploit the semantic and ?ne-grained information as we did,thus fails to track the targets on the selected chal-lenging sequences.The Struck method does not perform well in sequences with deformation,background clutter and rotation (Basketball ,Bolt ,MotorRolling and Skiing ),and heavy occlusions (Jogging-1).Although the use of struc-tured outputs effectively alleviates the sampling ambiguity issues,the representation with hand-crafted features are not effective in accounting for large appearance changes.

The reasons that the proposed algorithm performs well can be explained by two main aspects.First,the vi-sual representation using hierarchical convolutional fea-tures learned from a large-scale dataset are more effec-tive than conventional hand-crafted features.With CNN

#018#064#490#668#075#540#640#690

#056#127#226#350#070#110

#145#215

#050#080#082#100#148#180#210

#300 #200#220#242#262#010#020#055#068

#030

#050#080#110#090#140#250#450

#020#033#042#080#084#250#321#380

STC Struck

KCF TLD

Ours

Struck

KCF

MEEM DLT

Ours

STC Struck

KCF TLD

Ours

Figure8.Qualitative evaluation of the proposed algorithm,the MEEM[34],KCF[17],DLT[30],Struck[13]methods on twelve challeng-ing sequences(from left to right and top to down are Basketball,Board,Bolt,Diving,Jogging-1,Human9,Freeman4,Matrix,MotorRolling, Skating2-1,Skiing and Soccer,respectively).

features from multiple levels,our features contains both

category-level semantics and?ne-grained details,which ac-

count for appearance changes caused by deformations,rota-

tions and background clutters(Board,Soccer,Diving,Ski-

ing,and Human9).It is worth mentioning that for the most

challenging MotorRolling sequence,none of the12state-

of-the-art methods are able to track targets well whereas

our method achieves the distance precision rate of94.5%.

Second,the linear correlation?lters trained on convolu-

tional features are updated properly to account for appear-

ance variations.

Failure Cases.We show a few failure cases in Figure9.

For the Girl2and Lemming sequences,when long-term oc-

clusions occur,the proposed tracker fails to follow targets

as the proposed method is not equipped with a re-detection

module as opposed to the TLD and MEEM methods.An

alternative implementation with conservative update of cor-

relation?lters using(6)succeeds in following targets.For

the Singer2sequence,it is not effective to use semantic fea-

tures to differentiate the dark foreground from the bright

background.In such cases,the use of features extracted on

the?rst layer of CNNs alone is able to track the target well

as?ne-grained spatial details weigh more in this sequence.

7.Conclusions

In this paper,we propose an effective algorithm for vi-

sual object tracking by exploiting rich feature hierarchies of

#120#380#025

Figure9.Failure cases(Girl2,Lemming and Singer2).Red boxes

show our results and the green ones are ground truth.

CNNs learned from a large-scale dataset.The last convo-

lutional layers of CNNs retain semantics of target objects,

which are robust to signi?cant appearance variations.The

early convolutional layers encode more?ne-grained spatial

details,which are useful for precise location.Both features

with semantics and?ne-grained details are simultaneously

exploited for visual tracking.We train a liner correlation?l-

ter on each convolutional layer and infer the target position

with a coarse-to-?ne searching approach.Extensive exper-

imental results show that the proposed algorithm performs

favorably against the state-of-the-art methods in terms of

accuracy and robustness.

Acknowledgment. C.Ma and X.Yang are supported in

part by NSFC Grants(61527804,61025005,61129001,

and61221001),STCSM Grants(14XD1402100,and

135********),111Program Grant(B07022).M.-H.Yang

is supported in part by NSF CAREER Grant(1149783)and

NSF IIS Grant(1152576).

References

[1] B.Babenko,M.-H.Yang,and S.Belongie.Robust object

tracking with online multiple instance learning.TPAMI, 33(8),2011.2,5,6

[2]V.N.Boddeti,T.Kanade,and B.V.K.V.Kumar.Correlation

?lters for object alignment.In CVPR,2013.4,5

[3] D.S.Bolme,J.R.Beveridge,B.A.Draper,and Y.M.Lui.

Visual object tracking using adaptive correlation?lters.In CVPR,2010.2,4,5

[4] A.C.Bovik,M.Clark,and W.S.Geisler.Multichannel tex-

ture analysis using localized spatial?lters.TPAMI,12(1):55–73,1990.2

[5]N.Dalal and B.Triggs.Histograms of oriented gradients for

human detection.In CVPR,2005.2

[6]M.Danelljan,G.Hager,F.S.Khan,and M.Felsberg.Ac-

curate scale estimation for robust visual tracking.In BMVC, 2014.2,4

[7]M.Danelljan,F.S.Khan,M.Felsberg,and J.van de Weijer.

Adaptive color attributes for real-time visual tracking.In CVPR,2014.2,4

[8]J.Deng,W.Dong,R.Socher,L.Li,K.Li,and F.Li.Ima-

genet:A large-scale hierarchical image database.In CVPR, 2009.3,5

[9]H.K.Galoogahi,T.Sim,and S.Lucey.Multi-channel corre-

lation?lters.In ICCV,2013.4

[10]J.Gao,H.Ling,W.Hu,and J.Xing.Transfer learning based

visual tracking with gaussian processes regression.In ECCV, 2014.2,5,6

[11]R.B.Girshick,J.Donahue,T.Darrell,and J.Malik.Rich

feature hierarchies for accurate object detection and semantic segmentation.In CVPR,2014.1

[12]H.Grabner,C.Leistner,and H.Bischof.Semi-supervised

on-line boosting for robust tracking.In ECCV,2008.2 [13]S.Hare,A.Saffari,and P.H.S.Torr.Struck:Structured

output tracking with kernels.In ICCV,2011.2,5,6,7,8 [14] B.Hariharan,P.A.Arbel′a ez,R.B.Girshick,and J.Malik.

Hypercolumns for object segmentation and?ne-grained lo-calization.CVPR,2015.4,7

[15]S.He,Q.Yang,https://www.sodocs.net/doc/d217013021.html,u,J.Wang,and M.Yang.Visual

tracking via locality sensitive histograms.In CVPR,2013.5, 6

[16]J.F.Henriques,R.Caseiro,P.Martins,and J.Batista.Ex-

ploiting the circulant structure of tracking-by-detection with kernels.In ECCV,2012.2,4,5,6

[17]J.F.Henriques,R.Caseiro,P.Martins,and J.Batista.High-

speed tracking with kernelized correlation?lters.TPAMI, 37(3):583–596,2015.2,4,5,6,7,8

[18]S.Hong,T.You,S.Kwak,and B.Han.Online tracking

by learning discriminative saliency map with convolutional neural network.In ICML,2015.1,3

[19]Z.Kalal,K.Mikolajczyk,and J.Matas.Tracking-learning-

detection.TPAMI,34(7):1409–1422,2012.2,5,6

[20] A.Krizhevsky,I.Sutskever,and G.E.Hinton.Imagenet

classi?cation with deep convolutional neural networks.In NIPS,2012.1,2,3,7[21]H.Li,Y.Li,and F.Porikli.Deeptrack:Learning discrimina-

tive feature representations by convolutional neural networks for visual tracking.In BMVC,2014.1,2,3

[22]X.Li,W.Hu,C.Shen,Z.Zhang,A.R.Dick,and A.van den

Hengel.A survey of appearance models in visual object tracking.ACM TIST,4(4):58,2013.2

[23] B.D.Lucas and T.Kanade.An iterative image registra-

tion technique with an application to stereo vision.In IJCAI, 1981.1

[24] D.A.Ross,J.Lim,R.Lin,and M.Yang.Incremental learn-

ing for robust visual tracking.IJCV,77(1-3):125–141,2008.

2

[25]K.Simonyan and A.Zisserman.Very deep convolutional

networks for large-scale image recognition.ICLR,2015.3, 4,5

[26] A.W.M.Smeulders,D.M.Chu,R.Cucchiara,S.Calderara,

A.Dehghan,and M.Shah.Visual tracking:An experimental

survey.TPAMI,36(7):1442–1468,2014.1,2

[27] A.Torralba,R.Fergus,and W.T.Freeman.80million tiny

images:A large data set for nonparametric object and scene recognition.TPAMI,30(11):1958–1970,2008.2

[28] A.Vedaldi and K.Lenc.Matconvnet–convolutional neural

networks for matlab.CoRR,abs/1412.4564,2014.5 [29]L.Wang,T.Liu,G.Wang,K.L.Chan,and Q.Yang.

Video tracking using learned hierarchical features.TIP, 24(4):1424–1435,2015.1,2,3

[30]N.Wang and D.Yeung.Learning a deep compact image

representation for visual tracking.In NIPS,2013.1,2,3,5, 6,7,8

[31]Y.Wu,J.Lim,and M.-H.Yang.Online object tracking:A

benchmark.In CVPR,2013.2,5,6

[32]Y.Wu,J.Lim,and M.-H.Yang.Object tracking benchmark.

TPAMI,PrePrints.2,5,6

[33] A.Yilmaz,O.Javed,and M.Shah.Object tracking:A sur-

vey.ACM Computing Surveys,38(4),2006.1,2

[34]J.Zhang,S.Ma,and S.Sclaroff.MEEM:robust tracking

via multiple experts using entropy minimization.In ECCV, 2014.2,5,6,7,8

[35]K.Zhang,L.Zhang,Q.Liu,D.Zhang,and M.-H.Yang.Fast

visual tracking via dense spatio-temporal context learning.In ECCV,2014.2,4,5,6

[36]K.Zhang,L.Zhang,and M.-H.Yang.Real-time compres-

sive tracking.In ECCV,2012.5,6

[37]Q.Zhao,Z.Yang,and H.Tao.Differential earth mover’s

distance with its applications to visual tracking.TPAMI, 32(2):274–287,2010.2

[38]W.Zhong,H.Lu,and M.-H.Yang.Robust object tracking

via sparse collaborative appearance model.TIP,23(5):2356–2368,2014.5,6

[39]W.Y.Zou,A.Y.Ng,S.Zhu,and K.Yu.Deep learning of

invariant features via simulated?xations in video.In NIPS, 2012.2

相关主题