Wednesday 31 December 2014

Video Indexing and Retrieval

Advancement in digitalization, video technology and communication networks made Internet as a huge repository for text, audio and video documents. In the years to come searching video content in Internet and playing it on a mobile is going to be the norm. Implementation of video search engine poses big challenge because of two reasons. First, size of the video is very big and the next reason is structure of video is not as explicit as text. A text document can be easily broken down i.e. parsed into paragraphs, lines, and words. Much of the video search engine concepts have evolved from text search engine. 


In 1990s search engines for text documents evolved. They have three stages namely; Web crawling, indexing and searching. In the Web crawling stage search engine visits all possible Web sites. Next search engine performs 'full-text search' on crawled pages and collects a list of possible search terms called index. When a query is posted to a search engine, the engine examines its index and provides a list of hyperlinks. This stage is called 'search.' Each search engine ranks Web pages or text document by their own criteria and best matching Web pages are presented to user. 

Video indexing is complicated process as structure of video is not explicit. Video search engine developers can use visual features extracted from video, audio features and text extracted from video (ex. in news text is displayed as well as read out) as index.  In this post, visual features namely; key frames, objects and motions are discussed and a discussion on types of video query reported in literature [1]. For quick grasp of the subject refer [2].
Visual query from client to video server. Adapted from [3]
1. Keyframe feature
Key frame is an image that can be a treated as representative frame of the video sequence. By using key frame technique the video retrieval can be reduced to 'image retrieval.' Thus existing image retrieval techniques can be harnessed and video retrieval can be carried out. Key frames among the frames can be found by following methods namely; colour, texture and shape. Every object can be precisely characterised by three attributes namely; colour, texture and shape. 

(i) Colour
Colour histograms, colour correlograms and colour moments are potential candidates for colour based features. The features can be extracted from various types of colour spaces like RGB, HSV and YCbCr. The colour feature can be extracted from image blocks or entire image. Colour histogram and colour moments are simple to compute and very good descriptors. Colour-based features mimic human visual perception. The computational complexity of feature extraction is low. The limitation is it lacks the texture and shape information. 

(ii) Texture
Surfaces of bricks, wood, tiger skin and marble exhibit a distinct pattern. In all surfaces one can find element of randomness as well as orderliness. The pattern is colour independent and shape independent. These random patterns are called as texture. It is a property that is associated with pixel neighbourhood [4]. 

Texture can be characterised in spatial or perceptual mode. In spatial mode statistics based, stochastic model based and structure based approaches are used. In statistic approach, spatial properties like contrast, entropy and homogeneity are found from the image. In stochastic approach, texture is treated as outcome of a random process. By varying the parameters of the process different textures are generated. Thus parameters are called texture descriptors. In structured approach entire textured area is assumed to be generated by repetitive placing of a texture element. A brick wall or tiled wall is a good example for structured texture.  

Texture is a good descriptor and important feature for human perception. So, image classification or categorization algorithms extensively utilize texture. In 1978 H. Tamura and his associates [5] developed a list of texture features that can be easily computed from the given image. They are coarseness, contrast, directionality, linelikeness, regularity and roughness. Tamura features are very close to human perception and undoubtedly widely used. Co-occurrence matrices and simultaneous autoregressive models are used to find textual features. 

(iii) Shape 
Shape is also an important feature that is used by human perception. The usual approach is to first detect edges in images and then build a boundary. From the boundary it is possible to extract objects from the image. In the literature Edge Histogram Descriptor (EHD) is used to capture the distribution of edges in the video sequence [1]. 

2. Object Features 
Human face can be treated as an object. An object can have following attributes namely; colour, texture, and shape. Human colour (from African to Caucasian race) occupies a small area in the colour space. Putting it in a colloquial way human skin colour varies from dark chocolate colour to very light chocolate colour. Human facial skin texture will be smooth. Human face is oval in shape with two sockets for eyes. This description fits well to all human beings. Video retrieval systems use face as a query image and video segments containing the face is returned as a result. In the same manner other objects are characterized

3. Motion Features
Motion is the feature that distinguishes video from image. Motion features are more appropriate for semantic concepts. Any video frame can be partitioned into foreground and background. Foreground contains an object or a collection of objects that changes its' position from frame to frame. Background contains the rest of the image that is nearly static in a video shot. In cartoon videos characters (objects) will move and background will be static. Motion-based features can be broadly classified as camera based and object based

Camera movements like 'zoom in or zoom out,' pan left or pan right,' and 'tilt up or tilt down' changes the background and aptly called as camera based motion.  Object-based features deal with movement of object or objects in the foreground. In literature object-based motion features are dealt with greater interest. Object-based motion features can represent either statistical parameters or trajectory information or spatial relationship among objects within a video clip. 

Query and Retrieval
On receiving a query (text or video), a similarity measure is performed between query and the indices (plural of index). An index that closely matches to the query is selected and corresponding video contents are listed to the user. The result list is optimized either by explicit or implicit feedback from users. 

Video query can be broadly classified as semantic and non-semantic query. Literal meaning of 'semantic' is study of meaning. If a query to search engine is 'push up' then the engine has to understand (semantics) the query and respond with push up related training videos. Query by example, query by objects and query by sketch falls under non-semantic query. Semantic query includes query by keywords and query by natural language.

In the query by example i.e. visual query method, a sample video is posted as a query and server return list of videos similar to sample video. (Query by example image is available with Google. Please try once.) In this technique, key frames or visual features are extracted from sample video and matched with the stored indices. In the query by sketch, trajectory of motion is provided as a query. Server returns a list of video that closely matches to the given trajectory. In the query by objects, an object is given as a query and the server is expected to return the relevant videos as well as locations where the object appears in the videos. 

Video meta-data, visual concepts and transcripts are treated as key words. If these words are entered as query then it is called query by keywords. Query by natural language is the most natural and very convenient to the user. But extracting semantics from the natural language is very difficult. Next, extraction of semantics from video and representing them as an index is real challenge.

Source
[1] Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank, “A Survey on Visual Content-Based Video Indexing and Retrieval,” IEEE Transactions on Systems, Man and Cybernetics - Part C: Applications and Reviews, vol. 41, no. 6, pp. 797-819,  Nov. 2011.(http://www.dcs.bbk.ac.uk/~sjmaybank/survey video indexing.pdf, PDF, 360 KB)
[2] Shih-Fu Chang, "Video Indexing, Summarization, and Adaptation," Presentation
http://www.ee.columbia.edu/~sfchang/papers/talk-dimacs-video-mining-1102.pdf
[3] BilVideo-7: MPEG-7 Compatible Multimodal Video Indexing and Retrieval System, http://www.cs.bilkent.edu.tr/~bilmdg/bilvideo-7/
[4] Texture, http://www.micc.unifi.it/delbimbo/wp-content/uploads/2011/10/slide_corso/A12_texture_detectors.pdf (PDF, 2.6 MB)
[5] Hideyuki Tamura, Shunji Mori, Takashi Yamawaki, Textural Features Corresponding to Visual Perception, IEEE Transactions on Systems, Man, & Cybernetics, vol. 8, no. 6, Jun 1978, pp. 460-473.