Online tracking based on efficient transductive learning with sample matching costs

doi:10.1016/j.neucom.2015.10.046

Neurocomputing

Volume 175, Part A, 29 January 2016, Pages 166-176

https://doi.org/10.1016/j.neucom.2015.10.046 Get rights and content

Abstract

Visual tracking has been a popular and attractive topic in computer vision for a long time. In recent decades, many challenge problems in object tracking has been effectively resolved by using learning based tracking strategies. Number of investigations carried on learning theory found that when labeled samples are limited, the learning performance can be sufficiently improved by exploiting unlabeled ones. Therefore, one of the most important issue for semi-supervised learning is how to assign the labels to the unlabeled samples, which is also the principal focus of transductive learning. Unfortunately, considering the efficiency requirement of online tracking, the optimization scheme employed by the traditional transductive learning is hard to be applied to online tracking problems because of its large computational cost during sample labeling. In this paper, we proposed an efficient transductive learning for online tracking by utilizing the correspondences among the generated unlabeled and labeled samples. Those variational correspondences are modeled by a matching costs function to achieve more efficient learning of representative separators. With a strategy of fixed budget for support vectors, the proposed learning is updated by using a weighted accumulative average of model coefficients. We evaluated the proposed tracking on benchmark database, the experiment results have demonstrated an outstanding performance via comparing with the other state-of-the-art trackers.

Introduction

Tracking arbitrary object in realistic scenarios is hard because of large and unpredictable variance of objects׳ appearance. Motivated by the solution to recognition/classification [2], [3], [4], many conventional tracking approaches assumed that the target objects to be known in advance and designed appearance models for offline learning using diverse prior knowledge [1], [5], [12]. This pre-trained tracking strategy has shown its effectiveness in limited number of vision applications. Unfortunately, since most of the realistic scenarios require objects to be specified at runtime, which means the pre-trained model need to be adaptive to appearance changes on-the-fly, therefore, the tracking failure usually happens for those pre-trained trackers. To effectively overcome the limitations of offline model adaptation, semi-supervised online tracking strategies have been brought out more and more frequently in recently proposed tracking studies [6], [22], [7].

One common way to conduct the online tracking is to update the appearance model and make it suitable for a distinction between the target and background during tracking. Because of the insufficient knowledge to the learner before learning starts, the cursory constraints for labeled/unlabeled training data might lead to performance degeneration when learning the classifiers [6]. Thus, in order to introduce more rational paradigms for classifier updating by label assignment, Kalal et al. proposed a P-N learning scheme and applied it to the problem of online tracking-by-learning strategy [6]. P-N learning establishes the structure of the training samples by exploiting the positive and negative constraints, which restricts the data labeling operation. Kalal׳s framework also helps to guide the design of more sophisticated structural constraints that can improve the learning stability. However, its limitation reside in the usage of inaccurate detection to balance the final output decision.

Tracking failure in many cases is still hard to avoid because of the inaccurate estimation between unreliable video frames. Following the label assignment in transductive learning [8] and idea of data selection in self-learning [9], Steven et al. proposed a self-paced learning approach named ‘SPLTT’ to select trustworthy frames using additional training samples. By retrospectively selecting and editing previous frames for learning, SPLTT is able to handle different challenging situations including occlusion/absence and large scale changes of object during long-term tracking. However in some cases, dealing with historical information without considering budget issue still makes SPLTT suffering from the accumulated noise information.

Besides using the constraints between the consecutive frames, learning via sample structural constraints has demonstrated promising performance for whole object tracking in recently proposed tracking works [10], [11], [12]. Considering the relative geometry relationship inside and between objects can be effectively represented and incorporated for classifiers learning, Zhang et al. proposed structure-preserving tracker (SPOT) [11] within a pictorial-structures framework [10]. The SPOT tracker not only substantially improved the tracking performance in multi-objects scenarios, but also ameliorated the single object tracking by incorporating additional object part detection in the tracking framework. Without depending on a heuristic intermediate step for producing labeled binary samples during classifier update, Hare et al. proposed a structured output tracking (Struck) strategy with learning as well [22]. Their method is able to avoid the need of intermediate classification step by explicitly allowing the output space to express different trackers׳ requirement. With kernelized structured online SVM learning, Struck is able to achieve tracking performance at state-of-the-art level in many benchmark videos containing challenge scenarios [12], [34].

From the discussion of the introduced state-of-the-art tracking strategies above, it can be summarized that, a good tracking design needs to consider the following indispensable factors: (1) unlabeled data can also play an essential role in self-learning strategies as the labeled data; (2) structural constraints of learning samples heavily influences the tracking performance through semi-supervised learning; (3) a reasonable budget scheme for historical learning results should be able to benefit classifier updating. Therefore, in this study, inspired by the large-margin theory of semi-supervised learning with representative samples in [28], a practical and robust tracking by ameliorating the online learning process with discussed factors above is introduced for the first time. The proposed learning strategy employs the variational correspondence in pixel-wise and descriptor-wise between data samples for label assignment. Compared with the sample labeling by intrinsical clustering convergence during online learning, using the matching costs constraints has demonstrated its efficiency and reliability. With a scheme of weighted accumulative average to update coefficients of a fixed budget of support vectors, the proposed tracking shows more robust in many challenging scenes including rotation, intrinsic compression/stretching, aspect ratio changes, etc.

The organization of the paper is as follows: Section 2 introduces the proposed learning strategy and Section 3 presents our online learning tracking framework. Section 4 shows experiment results and discussion and Section 5 gives the conclusion of this paper.

Section snippets

Notation definitions

In the beginning of this section, we firstly list out all the notations and the acronyms we have used in the following content for easy reference of the readers.

$TSVMs$ - transductive support vector machines
$\tilde{L}$ - unlabeled data pairs
$ϕ (x)$ - a feature mapping function
$C_{1}, C_{2}$ - the regularization parameters
$\tilde{h} (\| \cdot \|)$ - symmetric hinge loss for unlabeled data
η - a large constant
${f_{k}}_{k = 1}^{K}$ - large margin low density separators
$E_{Ψ}$ - pixel-wise matching cost
$E_{int}^{Ψ}$ - deviations penalty between points
$E_{smooth}^{Ψ}$ - a

Proposed large margin learning with sample matching costs

To handle the unpredictable appearance variation during online tracking, the unlabeled data could be effectively used to improve the learning performance when the labeled data are usually limited, which has been verified by the transductive support vector machines (TSVMs) in [13]. The scheme of TSVMs is to learn a large margin hyperplane with labeled data, and simultaneously to keep this hyperplane away from the unlabeled data at the same time. For arbitrary tracking in realistic scenarios, the

Tracking framework

We build our tracking framework by referring the tracking-learning-detection scheme, which is shown in Fig. 1. In our framework, the online tracking consists of an appearance model and a motion model. After initial location of the target is specified, motion model will be used to generate the consistent feature (e.g., brightness, SIFT or SURF) of a pixel or patch between consecutive frames, which is regarded as the variational correspondence among the learning samples. It has been verified by

Experiment and discussion

The proposed tracking is evaluated using 20 benchmark videos coming from the tracking evaluation dataset [12], which contains various tracking targets (e.g., face, pedestrian, sporter, car, toy) and different challenging scenarios, such as scale and illumination change, occlusion, fast movement, out of plane rotation and background clutters, detail situations are listed in Table 1. To demonstrate the tracking performance of the proposed tracking, we compared it with 11 representative

Conclusion

In this paper, an efficient learning strategy is proposed for online object tracking. By thoroughly investigating the variational sample correspondence for label assignment during tracking, the proposed learning strategy is able to intensively decrease the computational cost in generating the representative separators of traditional transductive learning. With a novel optional schema to balance the decisions between tracking and detection, the proposed tracking is able to achieve more robust

Acknowledgments

This research is supported by Research Fund for the Doctoral Program of Higher Education of China 20126102120055, National Natural Science Foundation of China (61301194 & 61571362 & 61231016 & 61175018), foundation Grant from NWPU 3102014JSJ0014.

Peng Zhang received the B.E. degree from the Xian Jiaotong University, China, in 2001. He received his Ph.D. from Nanyang Technological University, Singapore, in 2011. He is now an associate professor in School of Computer Science, Northwestern Polytechnical University, China. His current research interests include signal processing, multimedia security and pattern recognition. He has published more than 30 high ranked international conference and journal papers. He is a member of ACM.

References (34)

A. Yilmaz et al.
Object trackinga survey
ACM Comput. Surv.
(2006)
J.C. Niebles et al.
Unsupervised learning of human action categories using spatial-temporal words
Int. J. Comput. Vis.(IJCV)
(2008)
P. Viola et al.
Robust real-time face detection
Int. J. Comput. Vis.(IJCV)
(2004)
F.F. Li, P. Perona, A Bayesian hierarchical model for learning natural scene categories, in: IEEE International...
S. Gu, Y. Zheng, C. Tomasi, Linear time offline tracking and lower envelope algorithms, in: IEEE International...
Z. Kalal et al.
Tracking-learning-detection
IEEE Trans. Pattern Anal. Mach. Intell.(T-PAMI)
(2010)
J.S. Supancic III, D. Ramanan, Self-paced learning for long-term tracking, in: IEEE International Conference on...
V. Vapnik, The Nature of Statistical Learning Theory,...
M.P. Kumar, Benjamin Packer, Daphne Koller, Self-paced learning for latent variable models, in: J.D. Lafferty, C.K.I....
P. Felzenszwalb et al.
Object detection with discriminatively trained part based models
IEEE Trans. Pattern Anal. Mach. Intell.(T-PAMI)
(2010)

L. Zhang, L. Maaten, Structure preserving object tracker, in: IEEE International Conference on Computer Vision and...

Y. Wu, J. Lim, M-H. Yang, Online object tracking: a benchmark, in: IEEE International Conference on Computer Vision and...

R. Collobert, F. Sinz, J. Weston, L. Bottou, Large scale transductive SVMs, Int. J. Mach. Learn. Res. (JMLR), 2006, pp....

T. Brox et al.

Large displacement optical flowdescriptor matching in variational motion estimation

IEEE Trans. Pattern Anal. Mach. Intell.(T-PAMI)

(2011)

J.B. Zin, R. Dupont, A. Bartoli, A general dense image matching framework combining direct and feature-based costs, in:...

J. Kim, C. Liu, F. Sha, K. Grauman, Deformatble spatial pyramid matching for fast dense correspondence, in: IEEE...

K. Zhang, L. Zhang, Ming-Hsuan Yang, Real-time compressive tracking, European Conference on Computer Vision (ECCV),...

Cited by (10)

Robust visual tracking based on global-and-local search with confidence reliability estimation
2019, Neurocomputing
Citation Excerpt :
Frag-Track [10] represents a template object with multiple fragments and votes on the possible positions of the target by comparing an integral histogram with corresponding image patches. Zhang et al. [11] presented efficient transductive learning for online tracking by utilizing correspondences among generated unlabeled and labeled samples. Zhang and colleagues [12] introduced a novel spatial kernel-phase correlation-based tracker that only adopts the phase spectrum by using a phase correlation filter to estimate object translation.
Visual object tracking is an open and challenging problem, an online tracker must be able to keep track of the target object for a long time period even in complex scenarios, such as target drift and background occlusion. Discriminative correlation filters (DCF) have shown excellent performance in short-term target tracking problems thanks to their circular dense sampling mechanism and fast computation with a discrete Fourier transform. However, they tend to drift from the target when the target encounters drastic deformation, fast motion, or background occlusion. This can result in a bad model update since the tracker searches the target in a local region centered at the position where target was located in the previous frame. There is no recovery mechanism for target re-identification and re-location. To handle this issue, this paper proposes a global-and-local-search technique that applies a DCF-based tracking model with a novel target-aware detector in a collaborative way. Our tracking model performs the local search process with high tracking confidence, and the target-aware detector is executed to re-identify and locate the target via global search from the entire frame when the model instability and confidence fluctuation are detected by proposed tracking system. Additionally, we designed an enhanced peak-to-sidelobe ratio (EPSR) for confidence estimation, which indicates system instability and fluctuation degree. Thus, the local tracking model and target-aware detector are collaboratively applied for both final target state estimation and online model updates. This not only avoids model corruption from bad updates, but also prevents our tracker from drifting problems for long-term tracking. Experiments on OTB-100 and VOT2016 benchmarks demonstrate that the proposed tracking method achieves state-of-the-art tracking performance in terms of accuracy and robustness, with 22 fps tracking speed (close to realtime) run on a single GPU.
Online object tracking based on CNN with spatial-temporal saliency guided sampling
2017, Neurocomputing
Citation Excerpt :
Thus, to fulfil the runtime specification requirement of arbitrary tracking, online tracking-by-learning strategy has gradually become popular and achieved impressive performance in recently proposed tracking work [4]. The most popular way to conduct online tracking-by-learning is to continually update the target appearance model by supervised or semi-supervised learning [19], which has shown a promising performance in articulated object tracking such as structured SVMs [6–8]. Hare et al. proposed a structured output tracking (Struck) strategy [7] to avoid the intermediate classification step by explicitly allowing to express different trackers’ demands in their output space, but Struck tracker was robust in challenge scenarios [4] owing to its kernelized structured online SVM learning [29].
Arbitrary tracking is hard due to nonstop intrinsic and extrinsic variations in realistic scenarios. Even for the popular tracking-by-learning strategies, effective appearance modeling of the non-rigid objects is still challenging because of the targets’ articulatory deformations on-the-fly, which may heavily degrade the discriminative capability of the online generated visual features. With widely emerged deep learning showing its success for feature extraction in different recognition tasks, more and more deep models such as CNN have been demonstrated contributive to improving the performance of online tracking. However, only depending on the outputs from last layer of CNN is not an optimum representation since the coarse spatial resolution cannot guarantee an accurate localization for a qualified sampling process, especially when objects have severe deformations, sampling from the region with a pre-defined scale would inevitably guide a poor online learning. To overcome such a limitation of CNN based tracking, in this work, we incorporated spatial-temporal saliency detection to guide a more accurate target localization for qualified sampling within an inter-frame motion flow map. With an optional strategy for the output combination of intra-frame appearance correlations and inter-frame motion saliency based on a compositional energy optimization, the proposed tracking has shown a superior performance in comparison to the other state-of-art trackers on both challenging non-rigid and generic tracking benchmark datasets.
Visual object tracking with online weighted chaotic multiple instance learning
2017, Neurocomputing
Citation Excerpt :
Therefore, twenty components are used to describe the appearance model. In this section, we compare our algorithm with some tracking by detection methods and stochastic methods such as VTD [15], struck [21], TLD [20], CT [22], local region sparse appearance model (SAM) [49], transductive learning with sample matching costs (TL) [50], and sparse hashing (SH) tracking [51]. As shown in these tables, the error rates and success ratios of algorithms indicate that our chaotic algorithm is the best tracker under appearance challenges.
In this paper, a chaotic multiple instance learning tracker based on chaos theory for a robust and efficient online tracking is introduced. In this method, chaotic characteristics can be utilized for representing the target as well as the updating appearance model, which has not been used for the tracking task. The computational architecture of the method is organized as follows. (1) Chaotic representation: a chaotic model can capture the complex dynamics of the target region to train the weak classifiers. Our representation can balance the global and local features to handle fast motion, partial occlusion, and illumination changes. (2) Importance of instance: fractal dimension of the dynamic model can be adjusted as instance weight for efficient online learning. (3) Chaotic approximation: A robust chaotic approximation to update the appearance model is introduced, which is crucial to select the discriminative and robust features. Chaotic online learning quickly explores the feature space to update the appearance model of the target by means of a chaotic map. The experimental results reveal that the proposed method is more effective and robust than the state-of-the-art trackers on various challenging sequences. Indeed, the efficiency of the proposed method is attributed to its strong online updating of chaotic policy as well as desirable target representation of chaotic model.
Chaotic target representation for robust object tracking
2017, Signal Processing: Image Communication
Citation Excerpt :
The MIL and WMIL trackers quickly calculate Haar-like features by using an integral image, but they require a high-dimensional pool of features in an online process. In this section, the performance of our algorithm is compared with that of detection-based and stochastic algorithms, namely, VTD [46], Struck [47], TLD [48], CT [49], fast compressive sensing (FCT) [50], transductive learning with sample matching costs (TL) [51], and sparse hashing (SH) [52]. Table 5 indicates that the performance of the proposed algorithm is superior to that of the other algorithms in most of the video sequences.
In this paper, a new object representation method is introduced as an appearance model based on chaos theory. For robust object tracking, the theory is used to extract a deterministic model from irregular patterns of pixel amplitudes in a target region. The object tracking algorithm that accompanies the proposed method involves two steps. First, fractal theory is applied to a compressive sensing method intended to embed an image into a two-dimensional state space during tracking by detection. After an object representation is extracted from an instance, the fractal dimension of the state space is assigned to the importance weight of the instance for efficient online multiple-instance learning. Second, a chaotic map approach is adopted to update the appearance model. Such updating is a crucial step in selecting discriminative and robust features. To highlight the advantages of the algorithm put forward in this work, its accuracy is validated on a large dataset. Results show that the proposed algorithm is more efficient than state-of-the-art tracking algorithms, with the former outperforming the latter under rotation, illumination, and scale changes.
Large-scale robust transductive support vector machines
2017, Neurocomputing
Citation Excerpt :
This method uses the cutting plane algorithm thus it is suitable for large-scale datasets with sizes up to a few millions. There are also other methods that use deterministic annealing [15,16], branch-and-bound algorithms [17], non-smooth optimization method [18], continuation method [19], maximum entropy [20], active learning [21], random-vector functional network [22], multiple kernel learning [23] and others [24] to solve transductive learning problems. Lastly, in addition to these margin-based transductive approaches there are also methods [25,26,27,28,29] that use limited amount of labeled data to estimate labels of unlabeled data by using graph-based (spectral clustering) techniques, but we do not consider these in our study.
In this paper, we propose a robust and fast transductive support vector machine (RTSVM) classifier that can be applied to large-scale data. To this end, we use the robust Ramp loss instead of Hinge loss for labeled data samples. The resulting optimization problem is non-convex but it can be decomposed to a convex and concave parts. Therefore, the optimization is accomplished iteratively by solving a sequence of convex problems known as concave-convex procedure. Stochastic gradient (SG) is used to solve the convex problem at each iteration, thus the proposed method scales well with large training set size for the linear case (to the best of our knowledge, it is the second transductive classification method that is practical for more than a million data). To extend the proposed method to the nonlinear case, we proposed two alternatives where one uses the primal optimization problem and the other uses the dual. But in contrast to the linear case, both alternatives do not scale well with large-scale data. Experimental results show that the proposed method achieves comparable results to other related transductive SVM methods, but it is faster than other transductive learning methods and it is more robust to the noisy data.
Ordered over-relaxation based Langevin Monte Carlo sampling for visual tracking
2017, Neurocomputing
Citation Excerpt :
The main challenges for visual tracking are appearance variation induced by illumination changes, cluttered background and partial occlusions, and motion uncertainties induced by sudden dynamic change, low frame rate video and camera switching. In the past decades, visual tracking has been extensively studied by researchers and many tracking methods are proposed to tackle the challenges leading to a steady performance improvement [7–9]. These existing methods can be classified into two categories, that are deterministic methods and stochastic methods (sampling based methods).
Visual tracking is a fundamental research topic in computer vision community, which is of great importance in many application areas including augmented reality, traffic control, medical imaging and video editing. This paper presents an ordered over-relaxation Langevin Monte Carlo sampling (ORLMC) based tracking method within the Bayesian filtering framework, in which the traditional object state variable is augmented with an auxiliary momentum variable. At the proposal step, the proposal distribution is designed by simulation of the Hamiltonian dynamics. We first use the ordered over-relaxation method to draw the momentum variable which could suppress the random walk behavior in Gibbs sampling stage. Then, we leverage the gradient of the energy function of the posterior distribution to draw new samples with high acceptance ratio. The proposed tracking method could ensure that the tracker will not be trapped in local optimum of the state space. Experimental results show that the proposed tracking method successfully tracks the objects in different video sequences and outperforms several conventional methods.

View all citing articles on Scopus

Tao Zhuo received the B.S. degree in Computer Science and Technology from the Xi׳an Shiyou University, Xi׳an, China, in 2009, and the Master׳s degree in Computer Science and Technology from Northwestern Polytechnical University, Xi׳an, China, in 2012. Currently, he is a Ph.D. candidate in School of Computer Science, Northwestern Polytechnical University and is also working as a intern Ph.D. student in National University of Singapore (NUS), his current research interests include visual object tracking, machine learning and computer vision.

Yanning Zhang is currently a Professor in the School of Computer Science, Northwestern Polytechnical University, China. She received her Ph.D. from the Northwestern Polytechnical University, China, in 1996. Her current research interests are in signal processing, multimedia and computer vision. Zhang has been an active member of the technical program committee of several international conferences and a reviewer of several reputed journals and conference, such as reviewer of IEEE Transaction on Systems, Man and Cybernetics (T-SMC), Pattern Recognition Letter. She has also been the organization chair of the Ninth Asian Conference on Computer Vision (ACCV09). She is currently a Senior member of IEEE.

Dapeng Tao received a B.E. degree from Northwestern Polytechnical University and a Ph.D. degree from South China University of Technology, respectively. He is currently with School of Information Science and Engineering, Yunnan University, Kunming, China, as an engineer. He has authored and co-authored more than 30 scientific articles. He has served more than 10 international journals including IEEE TNNLS, IEEE TMM, IEEE SPL, and PLOS-ONE. Over the past years, his research interests include machine learning, computer vision and cloud computing.

Jun Cheng received the B.Eng. and M.Eng. degrees from the University of Science and Technology of China, in 1999 and 2002, and the Ph.D. degree from the Chinese University of Hong Kong, in 2006. He is currently with the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, as a Professor and the Director of the Laboratory for Human Machine Control. His current research interests include computer vision, robotics, machine intelligence, and control.

View full text

Online tracking based on efficient transductive learning with sample matching costs

Abstract

Introduction

Section snippets

Notation definitions

Proposed large margin learning with sample matching costs

Tracking framework

Experiment and discussion

Conclusion

Acknowledgments

Object trackinga survey

ACM Comput. Surv.

Unsupervised learning of human action categories using spatial-temporal words

Int. J. Comput. Vis.(IJCV)

Robust real-time face detection

Int. J. Comput. Vis.(IJCV)

Tracking-learning-detection

IEEE Trans. Pattern Anal. Mach. Intell.(T-PAMI)

Object detection with discriminatively trained part based models

IEEE Trans. Pattern Anal. Mach. Intell.(T-PAMI)

Large displacement optical flowdescriptor matching in variational motion estimation

IEEE Trans. Pattern Anal. Mach. Intell.(T-PAMI)