Elsevier

Neurocomputing

Volume 175, Part A, 29 January 2016, Pages 166-176
Neurocomputing

Online tracking based on efficient transductive learning with sample matching costs

https://doi.org/10.1016/j.neucom.2015.10.046Get rights and content

Abstract

Visual tracking has been a popular and attractive topic in computer vision for a long time. In recent decades, many challenge problems in object tracking has been effectively resolved by using learning based tracking strategies. Number of investigations carried on learning theory found that when labeled samples are limited, the learning performance can be sufficiently improved by exploiting unlabeled ones. Therefore, one of the most important issue for semi-supervised learning is how to assign the labels to the unlabeled samples, which is also the principal focus of transductive learning. Unfortunately, considering the efficiency requirement of online tracking, the optimization scheme employed by the traditional transductive learning is hard to be applied to online tracking problems because of its large computational cost during sample labeling. In this paper, we proposed an efficient transductive learning for online tracking by utilizing the correspondences among the generated unlabeled and labeled samples. Those variational correspondences are modeled by a matching costs function to achieve more efficient learning of representative separators. With a strategy of fixed budget for support vectors, the proposed learning is updated by using a weighted accumulative average of model coefficients. We evaluated the proposed tracking on benchmark database, the experiment results have demonstrated an outstanding performance via comparing with the other state-of-the-art trackers.

Introduction

Tracking arbitrary object in realistic scenarios is hard because of large and unpredictable variance of objects׳ appearance. Motivated by the solution to recognition/classification [2], [3], [4], many conventional tracking approaches assumed that the target objects to be known in advance and designed appearance models for offline learning using diverse prior knowledge [1], [5], [12]. This pre-trained tracking strategy has shown its effectiveness in limited number of vision applications. Unfortunately, since most of the realistic scenarios require objects to be specified at runtime, which means the pre-trained model need to be adaptive to appearance changes on-the-fly, therefore, the tracking failure usually happens for those pre-trained trackers. To effectively overcome the limitations of offline model adaptation, semi-supervised online tracking strategies have been brought out more and more frequently in recently proposed tracking studies [6], [22], [7].

One common way to conduct the online tracking is to update the appearance model and make it suitable for a distinction between the target and background during tracking. Because of the insufficient knowledge to the learner before learning starts, the cursory constraints for labeled/unlabeled training data might lead to performance degeneration when learning the classifiers [6]. Thus, in order to introduce more rational paradigms for classifier updating by label assignment, Kalal et al. proposed a P-N learning scheme and applied it to the problem of online tracking-by-learning strategy [6]. P-N learning establishes the structure of the training samples by exploiting the positive and negative constraints, which restricts the data labeling operation. Kalal׳s framework also helps to guide the design of more sophisticated structural constraints that can improve the learning stability. However, its limitation reside in the usage of inaccurate detection to balance the final output decision.

Tracking failure in many cases is still hard to avoid because of the inaccurate estimation between unreliable video frames. Following the label assignment in transductive learning [8] and idea of data selection in self-learning [9], Steven et al. proposed a self-paced learning approach named ‘SPLTT’ to select trustworthy frames using additional training samples. By retrospectively selecting and editing previous frames for learning, SPLTT is able to handle different challenging situations including occlusion/absence and large scale changes of object during long-term tracking. However in some cases, dealing with historical information without considering budget issue still makes SPLTT suffering from the accumulated noise information.

Besides using the constraints between the consecutive frames, learning via sample structural constraints has demonstrated promising performance for whole object tracking in recently proposed tracking works [10], [11], [12]. Considering the relative geometry relationship inside and between objects can be effectively represented and incorporated for classifiers learning, Zhang et al. proposed structure-preserving tracker (SPOT) [11] within a pictorial-structures framework [10]. The SPOT tracker not only substantially improved the tracking performance in multi-objects scenarios, but also ameliorated the single object tracking by incorporating additional object part detection in the tracking framework. Without depending on a heuristic intermediate step for producing labeled binary samples during classifier update, Hare et al. proposed a structured output tracking (Struck) strategy with learning as well [22]. Their method is able to avoid the need of intermediate classification step by explicitly allowing the output space to express different trackers׳ requirement. With kernelized structured online SVM learning, Struck is able to achieve tracking performance at state-of-the-art level in many benchmark videos containing challenge scenarios [12], [34].

From the discussion of the introduced state-of-the-art tracking strategies above, it can be summarized that, a good tracking design needs to consider the following indispensable factors: (1) unlabeled data can also play an essential role in self-learning strategies as the labeled data; (2) structural constraints of learning samples heavily influences the tracking performance through semi-supervised learning; (3) a reasonable budget scheme for historical learning results should be able to benefit classifier updating. Therefore, in this study, inspired by the large-margin theory of semi-supervised learning with representative samples in [28], a practical and robust tracking by ameliorating the online learning process with discussed factors above is introduced for the first time. The proposed learning strategy employs the variational correspondence in pixel-wise and descriptor-wise between data samples for label assignment. Compared with the sample labeling by intrinsical clustering convergence during online learning, using the matching costs constraints has demonstrated its efficiency and reliability. With a scheme of weighted accumulative average to update coefficients of a fixed budget of support vectors, the proposed tracking shows more robust in many challenging scenes including rotation, intrinsic compression/stretching, aspect ratio changes, etc.

The organization of the paper is as follows: Section 2 introduces the proposed learning strategy and Section 3 presents our online learning tracking framework. Section 4 shows experiment results and discussion and Section 5 gives the conclusion of this paper.

Section snippets

Notation definitions

In the beginning of this section, we firstly list out all the notations and the acronyms we have used in the following content for easy reference of the readers.

TSVMs - transductive support vector machines
L˜ - unlabeled data pairs
ϕ(x) - a feature mapping function
C1,C2 - the regularization parameters
h˜(|·|) - symmetric hinge loss for unlabeled data
η - a large constant
{fk}k=1K - large margin low density separators
EΨ - pixel-wise matching cost
EintΨ - deviations penalty between points
EsmoothΨ - a

Proposed large margin learning with sample matching costs

To handle the unpredictable appearance variation during online tracking, the unlabeled data could be effectively used to improve the learning performance when the labeled data are usually limited, which has been verified by the transductive support vector machines (TSVMs) in [13]. The scheme of TSVMs is to learn a large margin hyperplane with labeled data, and simultaneously to keep this hyperplane away from the unlabeled data at the same time. For arbitrary tracking in realistic scenarios, the

Tracking framework

We build our tracking framework by referring the tracking-learning-detection scheme, which is shown in Fig. 1. In our framework, the online tracking consists of an appearance model and a motion model. After initial location of the target is specified, motion model will be used to generate the consistent feature (e.g., brightness, SIFT or SURF) of a pixel or patch between consecutive frames, which is regarded as the variational correspondence among the learning samples. It has been verified by

Experiment and discussion

The proposed tracking is evaluated using 20 benchmark videos coming from the tracking evaluation dataset [12], which contains various tracking targets (e.g., face, pedestrian, sporter, car, toy) and different challenging scenarios, such as scale and illumination change, occlusion, fast movement, out of plane rotation and background clutters, detail situations are listed in Table 1. To demonstrate the tracking performance of the proposed tracking, we compared it with 11 representative

Conclusion

In this paper, an efficient learning strategy is proposed for online object tracking. By thoroughly investigating the variational sample correspondence for label assignment during tracking, the proposed learning strategy is able to intensively decrease the computational cost in generating the representative separators of traditional transductive learning. With a novel optional schema to balance the decisions between tracking and detection, the proposed tracking is able to achieve more robust

Acknowledgments

This research is supported by Research Fund for the Doctoral Program of Higher Education of China 20126102120055, National Natural Science Foundation of China (61301194 & 61571362 & 61231016 & 61175018), foundation Grant from NWPU 3102014JSJ0014.

Peng Zhang received the B.E. degree from the Xian Jiaotong University, China, in 2001. He received his Ph.D. from Nanyang Technological University, Singapore, in 2011. He is now an associate professor in School of Computer Science, Northwestern Polytechnical University, China. His current research interests include signal processing, multimedia security and pattern recognition. He has published more than 30 high ranked international conference and journal papers. He is a member of ACM.

References (34)

  • A. Yilmaz et al.

    Object trackinga survey

    ACM Comput. Surv.

    (2006)
  • J.C. Niebles et al.

    Unsupervised learning of human action categories using spatial-temporal words

    Int. J. Comput. Vis.(IJCV)

    (2008)
  • P. Viola et al.

    Robust real-time face detection

    Int. J. Comput. Vis.(IJCV)

    (2004)
  • F.F. Li, P. Perona, A Bayesian hierarchical model for learning natural scene categories, in: IEEE International...
  • S. Gu, Y. Zheng, C. Tomasi, Linear time offline tracking and lower envelope algorithms, in: IEEE International...
  • Z. Kalal et al.

    Tracking-learning-detection

    IEEE Trans. Pattern Anal. Mach. Intell.(T-PAMI)

    (2010)
  • J.S. Supancic III, D. Ramanan, Self-paced learning for long-term tracking, in: IEEE International Conference on...
  • V. Vapnik, The Nature of Statistical Learning Theory,...
  • M.P. Kumar, Benjamin Packer, Daphne Koller, Self-paced learning for latent variable models, in: J.D. Lafferty, C.K.I....
  • P. Felzenszwalb et al.

    Object detection with discriminatively trained part based models

    IEEE Trans. Pattern Anal. Mach. Intell.(T-PAMI)

    (2010)
  • L. Zhang, L. Maaten, Structure preserving object tracker, in: IEEE International Conference on Computer Vision and...
  • Y. Wu, J. Lim, M-H. Yang, Online object tracking: a benchmark, in: IEEE International Conference on Computer Vision and...
  • R. Collobert, F. Sinz, J. Weston, L. Bottou, Large scale transductive SVMs, Int. J. Mach. Learn. Res. (JMLR), 2006, pp....
  • T. Brox et al.

    Large displacement optical flowdescriptor matching in variational motion estimation

    IEEE Trans. Pattern Anal. Mach. Intell.(T-PAMI)

    (2011)
  • J.B. Zin, R. Dupont, A. Bartoli, A general dense image matching framework combining direct and feature-based costs, in:...
  • J. Kim, C. Liu, F. Sha, K. Grauman, Deformatble spatial pyramid matching for fast dense correspondence, in: IEEE...
  • K. Zhang, L. Zhang, Ming-Hsuan Yang, Real-time compressive tracking, European Conference on Computer Vision (ECCV),...
  • Cited by (10)

    • Robust visual tracking based on global-and-local search with confidence reliability estimation

      2019, Neurocomputing
      Citation Excerpt :

      Frag-Track [10] represents a template object with multiple fragments and votes on the possible positions of the target by comparing an integral histogram with corresponding image patches. Zhang et al. [11] presented efficient transductive learning for online tracking by utilizing correspondences among generated unlabeled and labeled samples. Zhang and colleagues [12] introduced a novel spatial kernel-phase correlation-based tracker that only adopts the phase spectrum by using a phase correlation filter to estimate object translation.

    • Online object tracking based on CNN with spatial-temporal saliency guided sampling

      2017, Neurocomputing
      Citation Excerpt :

      Thus, to fulfil the runtime specification requirement of arbitrary tracking, online tracking-by-learning strategy has gradually become popular and achieved impressive performance in recently proposed tracking work [4]. The most popular way to conduct online tracking-by-learning is to continually update the target appearance model by supervised or semi-supervised learning [19], which has shown a promising performance in articulated object tracking such as structured SVMs [6–8]. Hare et al. proposed a structured output tracking (Struck) strategy [7] to avoid the intermediate classification step by explicitly allowing to express different trackers’ demands in their output space, but Struck tracker was robust in challenge scenarios [4] owing to its kernelized structured online SVM learning [29].

    • Visual object tracking with online weighted chaotic multiple instance learning

      2017, Neurocomputing
      Citation Excerpt :

      Therefore, twenty components are used to describe the appearance model. In this section, we compare our algorithm with some tracking by detection methods and stochastic methods such as VTD [15], struck [21], TLD [20], CT [22], local region sparse appearance model (SAM) [49], transductive learning with sample matching costs (TL) [50], and sparse hashing (SH) tracking [51]. As shown in these tables, the error rates and success ratios of algorithms indicate that our chaotic algorithm is the best tracker under appearance challenges.

    • Chaotic target representation for robust object tracking

      2017, Signal Processing: Image Communication
      Citation Excerpt :

      The MIL and WMIL trackers quickly calculate Haar-like features by using an integral image, but they require a high-dimensional pool of features in an online process. In this section, the performance of our algorithm is compared with that of detection-based and stochastic algorithms, namely, VTD [46], Struck [47], TLD [48], CT [49], fast compressive sensing (FCT) [50], transductive learning with sample matching costs (TL) [51], and sparse hashing (SH) [52]. Table 5 indicates that the performance of the proposed algorithm is superior to that of the other algorithms in most of the video sequences.

    • Large-scale robust transductive support vector machines

      2017, Neurocomputing
      Citation Excerpt :

      This method uses the cutting plane algorithm thus it is suitable for large-scale datasets with sizes up to a few millions. There are also other methods that use deterministic annealing [15,16], branch-and-bound algorithms [17], non-smooth optimization method [18], continuation method [19], maximum entropy [20], active learning [21], random-vector functional network [22], multiple kernel learning [23] and others [24] to solve transductive learning problems. Lastly, in addition to these margin-based transductive approaches there are also methods [25,26,27,28,29] that use limited amount of labeled data to estimate labels of unlabeled data by using graph-based (spectral clustering) techniques, but we do not consider these in our study.

    • Ordered over-relaxation based Langevin Monte Carlo sampling for visual tracking

      2017, Neurocomputing
      Citation Excerpt :

      The main challenges for visual tracking are appearance variation induced by illumination changes, cluttered background and partial occlusions, and motion uncertainties induced by sudden dynamic change, low frame rate video and camera switching. In the past decades, visual tracking has been extensively studied by researchers and many tracking methods are proposed to tackle the challenges leading to a steady performance improvement [7–9]. These existing methods can be classified into two categories, that are deterministic methods and stochastic methods (sampling based methods).

    View all citing articles on Scopus

    Peng Zhang received the B.E. degree from the Xian Jiaotong University, China, in 2001. He received his Ph.D. from Nanyang Technological University, Singapore, in 2011. He is now an associate professor in School of Computer Science, Northwestern Polytechnical University, China. His current research interests include signal processing, multimedia security and pattern recognition. He has published more than 30 high ranked international conference and journal papers. He is a member of ACM.

    Tao Zhuo received the B.S. degree in Computer Science and Technology from the Xi׳an Shiyou University, Xi׳an, China, in 2009, and the Master׳s degree in Computer Science and Technology from Northwestern Polytechnical University, Xi׳an, China, in 2012. Currently, he is a Ph.D. candidate in School of Computer Science, Northwestern Polytechnical University and is also working as a intern Ph.D. student in National University of Singapore (NUS), his current research interests include visual object tracking, machine learning and computer vision.

    Yanning Zhang is currently a Professor in the School of Computer Science, Northwestern Polytechnical University, China. She received her Ph.D. from the Northwestern Polytechnical University, China, in 1996. Her current research interests are in signal processing, multimedia and computer vision. Zhang has been an active member of the technical program committee of several international conferences and a reviewer of several reputed journals and conference, such as reviewer of IEEE Transaction on Systems, Man and Cybernetics (T-SMC), Pattern Recognition Letter. She has also been the organization chair of the Ninth Asian Conference on Computer Vision (ACCV09). She is currently a Senior member of IEEE.

    Dapeng Tao received a B.E. degree from Northwestern Polytechnical University and a Ph.D. degree from South China University of Technology, respectively. He is currently with School of Information Science and Engineering, Yunnan University, Kunming, China, as an engineer. He has authored and co-authored more than 30 scientific articles. He has served more than 10 international journals including IEEE TNNLS, IEEE TMM, IEEE SPL, and PLOS-ONE. Over the past years, his research interests include machine learning, computer vision and cloud computing.

    Jun Cheng received the B.Eng. and M.Eng. degrees from the University of Science and Technology of China, in 1999 and 2002, and the Ph.D. degree from the Chinese University of Hong Kong, in 2006. He is currently with the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, as a Professor and the Director of the Laboratory for Human Machine Control. His current research interests include computer vision, robotics, machine intelligence, and control.

    View full text