M in such videos. Many action recognition methods

 M Sai Raghu Ram,                                                  M Raghavendra                                                       S IniyanSRM Institute of Science & Technology,                SRM Institute of Science &Technology,                Asst. ProfessorKattankulathur, Chennai                                          Kattankulathur, Chennai                                          SRM Institute of Science &                                                                                                                                                                Technology,Chennai                                                            [email protected]                    [email protected]

in                  [email protected]

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

ac.in INTRODUCTION:            WITH the rapid advance of Internetand smart phone, action recognition in personal videos produced by users hasbecome an important research topic due to its wide applications, such asautomatic video tracking 1, 2 and video annotation 3, etc. Consumervideos on the Web are uploaded by users and produced by hand-held cameras orsmart phones, which may contain considerable camera shake, occlusion, andcluttered background.

Thus, these videos contain large intra class variationswithin the same semantic category. It is now a challenging task to recognizehuman actions in such videos. Many action recognition methods followed theconventional framework. First, a large number of local motion features (e.g.,space-time interest points (STIP) 4, 5, motion scale invariant featuretransform (MoSIFT) 6, etc.

) are extracted from videos. Then, all localfeatures are quantized into a histogram vector using bag-of-words (BoWs)representation 7, 8. Finally, the vector-based classifiers (e.g., supportvector machine 9) are used to perform recognition in testing videos. When thevideos are simple, these action recognition methods have achieved promisingresults.

However, noises and uncorrelated information may be incorporated intothe BoW during the extraction and quantization of the local features 10.Therefore, these methods are usually not robust and could not be generalizedwell when the videos contain considerable camera shake, occlusion, clutteredbackground, and so on. In order to improve the recognition accuracy, meaningfulcomponents of actions, e.g., related objects, human appearance, posture, and soon, should be utilized to form a clearer semantic interpretation of humanactions. Recent efforts 11, 12 have demonstrated the effectiveness of leveragingrelated objects or human poses.

However, these methods may require a trainingprocess with large amount of videos to obtain good performance, especially forreal world videos. However, it is quite challenging to collect enough labeledvideos that cover a diverse range of action poses.  SCOPE OF THE PROJECT:Currently,most of the knowledge adaptation algorithms require sufficient labeled data inthe target domain. In real world applications, however, most videos areunlabeled or weak-labeled. Collecting well-labeled videos is time consuming andlabor intensive. For example, 111 researchers from 23 institutes spent morethan 220 h to collect only 63 h of Trecvid 2003 development corpus 18.Previous studies 19, 20 have shown that simultaneously utilizing labeledand unlabeled data is beneficial for video action recognition.

In order toenhance the performance of action recognition, we explore how to utilizesemi-supervised learning to leverage unlabeled data and thus to learn a moreaccurate classifier. LITERATURE SURVEY:1.  MoSIFT: Recognizing Human Actions inSurveillance Videos CMU-CS-09-161 Ming-yu Chen and Alex Hauptmann:        The goal of this paper is to buildrobust human action recognition for real world surveillance videos. Localspatio-temporal features around interest points provide compact but descriptiverepresentations for video analysis and motion recognition. The older approachestend to extend spatial descriptions.

This captures the motion informationimplicitly and this is done by adding temporal component for appearancedescriptor. In our approach we propose MoSIFT algorithm that detects interestpoints and explicitly models local motion apart from encoding their localappearance. The detection of distinctive local feature through local appearanceand motion is the main idea. We construct MoSIFT feature descriptors in thespirit of the well-known SIFT descriptors to be robust to small deformationsthrough grid aggregation. In order to capture more global structure of actions,we introduce a bigram model to construct co relation between local features. Themethod advances the result on the KTH dataset to an accuracy of 95.8%.

We alsoapplied our approach to nearly 100 hours of surveillance data as part of theTRECVID Event Detection task with very promising results on recognizing humanactions in the real world surveillance videos. 2.     Observing Human-Object Interactions:Using Spatial and Functional Compatibility for Recognition Abhinav Gupta,Member, IEEE, Aniruddha Kembhavi, Member, IEEE, and Larry S. Davis, Fellow,IEEE      Interpretation of images and videoscontaining humans interacting with different objects is a daunting task. Itinvolves understanding scene/event, analyzing human movements, recognizingmanipulable objects, and observing the effect of the human movement on thoseobjects. While each of these perceptual tasks can be conducted independently,recognition rate improves when interactions between them are considered.Motivated by psychological studies of human perception, we present a Bayesianapproach which integrates various perceptual tasks involved in understandinghuman-object interactions.

Previous approaches to object and action recognitionrely on static shape/appearance feature matching and motion analysis,respectively. Our approach goes beyond these traditional approaches and appliesspatial and functional constraints on each of the perceptual elements forcoherent semantic interpretation. Such constraints allow us to recognizeobjects and actions when the appearances are not discriminative enough. We alsodemonstrate the use of such constraints in recognition of actions from staticimages without using any motion information.

3.     Visual Event Recognition in Videosby Learning from Web Data Lixin Duan Dong Xu Ivor W. Tsang School of ComputerEngineering Nanyang Technological University {S080003, DongXu,IvorTsang}@ntu.

edu.sg Jiebo Luo Kodak Research Labs Eastman Kodak Company,Rochester, NY, USA      We propose a visual event recognitionframework for consumer domain videos by leveraging a large amount of looselylabeled web videos (e.g., from YouTube). First, we propose a new alignedspace-time pyramid matching method to measure the distances between two videoclips, where each video clip is divided into space-time volumes over multiplelevels. We calculate the pair-wise distances between any two volumes andfurther integrate the information from different volumes with Integer-flowEarth Mover’s Distance (EMD) to explicitly align the volumes.

Second, wepropose a new cross-domain learning method in order to 1) fuse the informationfrom multiple pyramid levels and features (i.e., space-time feature and staticSIFT feature) and 2) cope with the considerable variation in featuredistributions between videos from two domains (i.e., web domain and consumerdomain). For each pyramid level and each type of local features, we train a setof SVM classi- fiers based on the combined training set from two domains usingmultiple base kernels of different kernel types and parameters, which are fusedwith equal weights to obtain an average classifier.

Finally, we propose across-domain learning method, referred to as Adaptive Multiple Kernel Learning(A-MKL), to learn an adapted classifier based on multiple base kernels and theprelearned average classi- fiers by minimizing both the structural riskfunctional and the mismatch between data distributions from two domains.Extensive experiments demonstrate the effectiveness of our proposed frameworkthat requires only a small number of labeled consumer videos by leveraging webdata.4.     Action Recognition Using NonnegativeAction Component Representation and Sparse Basis Selection Haoran Wang,Chunfeng Yuan, Weiming Hu, Haibin Ling, Wankou Yang, and Changyin Sun.      In this paper, we propose using high-levelaction units to represent human actions in videos and, based on such units, anovel sparse model is developed for human action recognition. There are threeinterconnected components in our approach.

First, we propose a newcontext-aware spatialtemporal descriptor, named locally weighted word context,to improve the discriminability of the traditionally used localspatial-temporal descriptors. Second, from the statistics of the context-awaredescriptors, we learn action units using the graph regularized nonnegativematrix factorization, which leads to a part-based representation and encodesthe geometrical information. These units effectively bridge the semantic gap inaction recognition.

Third, we propose a sparse model based on a joint l2,1-normto preserve the representative items and suppress noise in the action units.Intuitively, when learning the dictionary for action representation, the sparsemodel captures the fact that actions from the same class share similar units.The proposed approach is evaluated on several publicly available data sets. Theexperimental results and analysis clearly demonstrate the effectiveness of theproposed approach.5.     Recognizing human actions in stillimages: a study of bag-of-features and part-based representations.      Recognition of human actions is usuallyaddressed in the scope of video interpretation. Meanwhile, common human actionssuch as “reading a book”, “playing a guitar” or “writing notes” also provide anatural description for many still images.

In addition, some actions in videosuch as “taking a photograph” are static by their nature and may requirerecognition methods based on static cues only. Motivated by the potentialimpact of recognizing actions in still images and the little attention thisproblem has received in computer vision so far, we address recognition of humanactions in consumer photographs. We construct a new dataset with seven classesof actions in 968 Flickr images representing natural variations of humanactions in terms of camera view-point, human pose, clothing, occlusions andscene background.

We study action recognition in still images using thestate-of-the-art bag-of-features methods as well as their combination with thepart-based Latent SVM approach of Felzenszwalb et al. 6. In particular, weinvestigate the role of background scene context and demonstrate that improvedaction recognition performance can be achieved by (i) combining the statisticaland part-based representations, and (ii) integrating person-centric descriptionwith the background scene context. We show results on our newly collecteddataset of seven common actions as well as demonstrate improved performanceover existing methods on the datasets of Gupta et al. 8 and Yao and Fei Fei20.MODULES:1.      Image Feature Extraction.

2.      Video Feature Extraction.3.      PCA.4.      Classifier.                  BLOCKDIAGRAM:          MODULE DESCRIPTION:1.     Image Feature Extraction:In our method, we extract the image (static) feature fromboth images and  key frames of videos.

Considering computational efficiency, we extract key frames by a shot boundarydetection algorithm 36. The example of key frames extraction is shown in Fig.3.

The main steps of the key frames extraction include the following. First, thecolor histogram of every five frames is calculated. Second, the histogram issubtracted with that of the previous frame. Third, the frame is a shot boundaryif the subtracted value is larger than an empirically set threshold. Once weget the shot, the frame in the middle of the shot is considered as a key frame.2.

     Video Feature Extraction:The video(motion) feature is extracted from the video domain and is combined with imagefeature. Therefore, the image feature is a subset of the combined feature. 3.     PCA:Principal component analysis (PCA) is a statistical procedure that uses an orthogonaltransformation to convert a set of observations of possibly correlatedvariables into a set of values of linearly uncorrelated variablescalled principal components (orsometimes, principal modes of variation). The number of principal components isless than or equal to the smaller of the number of original variables or thenumber of observations.

This transformation is defined in such a way that thefirst principal component has the largest possible variance (that is,accounts for as much of the variability in the data as possible), and eachsucceeding component in turn has the highest variance possible under theconstraint that it is orthogonal to the preceding components. Theresulting vectors are an uncorrelated orthogonal basis set. PCA issensitive to the relative scaling of the original variables. In the proposedframework, we use the kernel principal component analysis (KPCA) method 37 tomap the combined features and image features. The KPCA method can explore theprincipal knowledge of the mapped Hilbert spaces. Therefore, the trainingprocess is more efficient, which makes the IVA more suitable for real-worldapplications.

We can get the common feature A by mapping the image featuresinto one Hilbert space. In order to obtain the heterogeneous feature AB, the combinedfeatures are mapped into another Hilbert space. The knowledge can be adaptedbased on such shared space of the common features, and then used to optimizethe classifier A. 4.     Classifier:            Theknowledge can be adapted based on such shared space of the common features, andthen used to optimize the classifier A.

In order to make use of unlabeledvideos, a semi supervised classifier AB is trained based on the heterogeneous featuresin videos domain. We integrate the two classifiers into a joint optimizationframework. The final recognition results of testing videos are improved byfusing the results of aforementioned two classifiers.SOFTWARE REQUIREMENT:Ø  MATLAB 7.14 Version R2012MATLAB            The MATLAB high-performancelanguage for technical computing integrates computation, visualization, andprogramming in an easy-to-use environment where problems and solutions areexpressed in familiar mathematical notation.v  Data Exploration ,Acquisition,Analyzing v  Engineering drawing and Scientificgraphicsv  Analyzing of algorithmic designingand developmentv  Mathematical functions andComputational functionsv  Simulating problems prototyping andmodelingv  Application development programmingusing GUI building environment.

            Using MATLAB, you can solvetechnical computing problems faster than with traditional programminglanguages, such as C, C++, and FORTRAN.segmentationof the retinal vasculature.FUTUREENHANCEMENT            To emphasize that the reliabilityand computational efficiency of the proposed method allows the creation of aneffective tool that can easily be incorporated in clinical practice.

 CONCLUSION:Toachieve good performance of video action recognition, we propose an classifierof IVA, which can borrow the knowledge adapted from images based on the commonvisual features. Meanwhile, it can fully utilize the heterogeneous features ofunlabeled videos to enhance the performance of action recognition in videos. Inour experiments, we validate that the knowledge learned from images caninfluence the recognition accuracy of videos and that different recognitionresults are obtained by using different visual cues. Experimental results showthat the proposed IVA has better performance of video action recognition,compared to the state-of-the-art methods. And the performance of IVA ispromising when only few labeled training videos are available.REFERENCES:1 B. Ma, L.

Huang, J. Shen, andL. Shao, “Discriminative tracking using tensor pooling,” IEEE Trans.

Cybern.,to be published, doi: 10.1109/TCYB.2015.2477879.2 L. Liu, L. Shao, X.

Li, and K.Lu, “Learning spatio-temporal representations for action recognition: A geneticprogramming approach,” IEEE Trans. Cybern., vol. 46, no. 1, pp.

158–170,Jan. 2016.3 A. Khan, D.

Windridge, and J.Kittler, “Multilevel Chinese takeaway process and label-based processes forrule induction in the context of automated sports video annotation,” IEEETrans. Cybern., vol.

44, no. 10, pp. 1910–1923, Oct. 2014.

4 H. Wang, M. M.

Ullah, A.Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal featuresfor action recognition,” in Proc. Brit. Mach. Vis. Conf.

, London, U.K.,2009, pp. 124.1–124.11.5 L.

Shao, X. Zhen, D. Tao, andX.

Li, “Spatio-temporal Laplacian pyramid coding for action recognition,” IEEETrans. Cybern., vol. 44, no. 6, pp. 817–827, Jun. 2014.6 M.

-Y. Chen and A. Hauptmann, “MoSIFT:Recognizing human actions in surveillance videos,” School Comput. Sci.,Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep.

CMU-CS-09-161, 2009.7 M. Yu, L.

Liu, and L. Shao,”Structure-preserving binary representations for RGB-D action recognition,” IEEETrans. Pattern Anal. Mach. Intell., to be published, doi:10.1109/TPAMI.

2015.2491925.8 L. Shao, L.

Liu, and M. Yu,”Kernelized multiview projection for robust action recognition,” Int. J.Comput. Vis.

, 2015, doi: 10.1007/s11263-015-0861-6.9 C.

-C. Chang and C.-J. Lin,”LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst.Technol., vol.

2, no. 3, pp. 1–27, Apr.

2011.10 Y. Han et al.

,”Semisupervised feature selection via spline regression for video semanticrecognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no.

2, pp.252–264, Feb. 2015. 

x

Hi!
I'm Mary!

Would you like to get a custom essay? How about receiving a customized one?

Check it out