Sunday 30 November 2014

Video Analytics – I

    Today every nook and corner of public place is under surveillance camera. They work round the clock and for 365 days. Thus huge amount of video data is generated. Monitoring a video for more than 20 minutes is tedious to human beings and security officers mostly fail to detect abnormal activities. Thus there is a requirement for “machine assistance” to security officers. Watching video via Internet has become a norm. Popular video servers like YouTube, Dailymotion, and Metacafe provide video free of charge. Netflix is subscription based video server. Sixty hours of video are uploaded every minute in YouTube video servers alone [1]. Thus one can imagine the quantum of video content available in Internet. In the years to come more than half of Internet traffic will be due to video. We encounter problem choosing 'right' or cherry-pick the video from the huge pile of video scattered in Internet. Here to we require some form of “machine assistance” to ordinary viewer.

Video is made up of consecutive sequences of frames or images. Each image contain large amount of pixel information. Images offer very little prior structure to work with [2]. Earlier video databases were small and manual annotation was a possible solution. Today it is not a feasible solution. Present day computing power will manage huge size of data. But automatic analysis of video requires artificial intelligence. Video analytics is the baby step in that direction. Video analytics deals with extraction of information from video with the aid of machine assistance. Video processing means performing some image processing (like resampling or colour correction) on the video content. Most of the time extracted information is overlaid on the video for better human interpretation. Big data analytics is latest buzz word in technical world. Video analytics is considered as a subset of big data analytics.


I downloaded a survey article from Internet. It was published in IEEE Transactions on Systems, Man and Cybernetics. It spans more than 20 pages and have nearly 300 references [2]. It was published in the year 2011. This will not be an impediment for novices like us. It deals with shot boundary detection, Key frame extraction and scene segmentation. Methods to extract static features (like colour, texture and shape), object and motion features from key frames are discussed.  Armed with this extracted information from video content it discusses how to mine, classify and annotate videos.   Objective of this post to provide a overall idea about video analytics with 1000 word constraint. For details please refer the [2] paper or use Google Scholar to get more papers on this topic.

Introduction to shots

Movies or video clips can be broken down hierarchically into sequences, scenes, shots and frames. Frame deals with a single image or picture present in the video. Put it in another way, video is a collection of temporal (time) images.  A shot consists of consecutive sequence of frames which are captured in a single 'record' instance of a video camera.  In film industry it is called as take. It may last for few seconds to few minutes. Within a shot frames contain high level of correlation. This information is extensively used in video annotation and video retrieval tasks. Scene is a collection of contiguous shots that are correlated in some sense. The semantic levels of scenes are higher than shots. Collection of scenes is called as sequence. For further information on shots-scenes-sequence, please refer [3].

Shots are weaved to make a movie. Techniques like cut, dissolve, crossfade and wipes are used to combine shots. Cut has abrupt change and others have gradual change from one shot to another. Thus detection of cut is easier than other combining techniques. Each technique conveys a unique meaning: for example dissolve is used to represent 'passage of time'.  Figure 1 comprises four frames that border two shots. First shot shows natural water pool and second shot contains a bird near water front.  If the scene contained frame 1 (Fig 1(a)) and frame 4 (Fig. 1(d)) only it means 'cut' technique is employed. As figure contains frame 2 (Fig 1(b)) and frame 4 (Fig 1(d)) scene has employed 'dissolve' technique. Here dissolve technique do not represent passage of time.
Figure 1. Two shots combined with dissolve transition. Image courtesy: Wikepedia
Shot boundary detection
Shot boundary detection algorithms have three phases viz. visual feature extraction from each frame, similarity measurement and decision making. Similarity between extracted visual features of frames is calculated. Based on the similarity score decision is arrived whether shot boundary is detected or not. Features like colour histogram, edge change ratio, Scale Invariant Feature Transform (SIFT), corner points and motion vectors are extracted from frames. Each feature comes with its own merit and demerit. Colour histograms are relative easily extractable feature and insensitive to small camera motion. For large camera motion it becomes very sensitive. Edge and motion-vector features are more robust to illumination changes. But they are computationally complex. 

The popular similarity metrics are Euclidean distance, the histogram intersection, chi-squared similarity, earth mover's distance and mutual information. Similarity measurement between frames within a window (i.e. more than two frames) is also used. This technique is robust to local noises but computationally intensive.

With the use of similarity measures decision is made. The decision making algorithms can be largely grouped to following categories namely; Threshold, Supervised learning and Unsupervised learning methods. Threshold based approach as expected uses a predefined threshold to arrive at a decision. The calculation of threshold may use global or local technique. Local technique uses sliding window and threshold value varies within windows.   

Supervised learning and unsupervised learning are collective called as Statistical techniques or machine learning techniques. They view boundary detection problem as a classification problem. Within the supervised learning technique Support Vector Machine (SVM) and Adaboost classifiers are extensively used.  The kernel function of SVM uses several visual features to overcome the influence of illumination and movement of objects. Normally class labeled (presence of shot or absence of shot) training data is provided to SVM to learn. Each training data will be in a form of vector; for example {501, 12, 67, 56}. Each numeric value in vector represents a visual feature. For more information refer [4]. For gradual transition two sliding window SVM or SVM with threshold techniques are reported in literature [2]. Merits of SVM are they handle large number of features for classification and maintain good generalization. Like SVM Adaboost classifier can also be used extensively. Other supervised learning algorithms like k Nearest Neighbour (kNN) and Hidden Markov Models (HMM) are used for boundary detection. Supervised learning methods don't require to set threshold and detection accuracy can be improved by right combination of visual features.

 Unsupervised algorithms learn to classify from dataset itself. They don't require labeled training data for learning.  We know visual similarity within a shot will be high. Thus successive frames were compared within a shot then their similarity score will be high. If a pair consists of shot boundary then similarity score will be poor. Cluster algorithms like K-means and fuzzy K-means are used [2]. Windows which span several frames are used for classification. Size of window is always lesser than shot. K-means algorithms are used classify windows. If the window falls within a shot then similarity score will be high else it will be low. Unsupervised shot boundary detection can be performed in compressed or uncompressed formats. Compressed domain detection avoids time-consuming and computation intensive decompression. They use features like Discrete Cosine Transform (DCT) coefficients, Macro Block (MB) (an image block of 16 x 16 pixels) and motion vectors. The demerits are detection performance depends upon compression standards and less accurate than uncompression method.

To be continued…

Source

[1] 30 Mind Numbing YouTube Facts,Figures and Statistics – Infographic [Online] http://www.jeffbullas.com/2012/05/23/35-mind-numbing-youtube-facts-figures-and-statistics-infographic/
[2] Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank, “A Survey on Visual Content-Based Video Indexing and Retrieval,” IEEE Transactions on Systems, Man and Cybernetics - Part C: Applications and Reviews, vol. 41, no. 6, pp. 797-819,  Nov. 2011.  (http://www.dcs.bbk.ac.uk/~sjmaybank/survey video indexing.pdf, PDF, 360 KB)
[3] Sequence-Scene Definitions - ScreenWriting Science [Online] http://screenwritingscience.com/sequence-scene-definitions/
[4] Introduction to Support Vector Machines — OpenCV 2.4.9.0 documentation [Online] http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html