TrackedFeatures
t5042tdy4002−2−4−4−2dxPyryMatikainen1,MartialHebert1,RahulSukthankar2,1
1
RoboticsInstitute,CarnegieMellonUniversity2IntelLabsPittsburgh
pmatikai@cs.cmu.edu,hebert@ri.cmu.edu,rahuls@cs.cmu.edu
Abstract
Thedefiningfeatureofvideocomparedtostillimagesismotion,andassuchtheselectionofgoodmotionfea-turesforactionrecognitioniscrucial,especiallyforbagofwordstechniquesthatrelyheavilyontheirfeatures.Exist-ingmotiontechniqueseitherassumethatadifficultproblemlikebackground/foregroundsegmentationhasalreadybeensolved(contour/silhouettebasedtechniques)orarecompu-tationallyexpensiveandpronetonoise(opticalflow).Wepresentatechniqueformotionbasedonquantizedtrajec-torysnippetsoftrackedfeatures.Thesequantizedsnippets,ortrajectons,relyonlyonsimplefeaturetrackingandarecomputationallyefficient.Wedemonstratethatwithinabagofwordsframeworktrajectonscanmatchstateoftheartre-sults,slightlyoutperforminghistogramofopticalflowfea-turesontheHollywoodActionsdataset.Additionally,wepresentqualitativeresultsinavideosearchtaskonacus-tomdatasetofchallengingYouTubevideos.
Figure1.SimpleKLTfeaturetrackingisusedtotrackasmanyfeaturesaspossiblewithinavideo.EachtrackedpointproducesafixedlengthtrajectorysnippeteveryframeconsistingofthelastL(usuallyten)positionsinitstrajectory.Thesesnippetsarequan-tizedtoalibraryoftrajectons.
1.Introduction
Therecentriseinpopularityofthebag-of-wordsparadigmforactionrecognitioninvideohasledtosig-nificantgainsinperformanceandtheintroductionofmorechallengingdatasetstotackle.Indeed,someofthesetech-niquescanachievenearperfectperformancewheremoreprincipledapproachesonlyproducemediocreresults.Nev-ertheless,theirsuccesshashighlightedthefactthatsignif-icantworkremainstobedoneintheareaoffeaturesforvideo,particularlywithregardstomotion.
Comingfromworkwithstaticimages,itisnosurprisethatthefamiliartechniquesofthatareahavecometobeap-pliedhere,suchastheever-popularhistogramsoforientedgradients.Whendealingwiththeappearanceofframesinvideo,thesetypesoffeaturesareentirelyappropriate.Butactionsinvideocomprisebothanappearanceaspectandamotionaspect,andmotionfeatureshavenothadthebenefitinheritingasetofcanonicaltechniqueshonedoverdecades.
Thisleadsustobelievethattherearestillsignificantgainstobecapturedwithmotionfeatures,andsinceitseemsthatactionsaredefinedmorebytheirmotionthanbytheirinci-dentalappearance,thisproblemisdoublyimportant.
Presentbag-of-wordstechniquesformotionarelargelyextensionsofappearancebasedtechniques.Sincemotioninformationcannotbereaddirectlyfromthevideoasinthecaseofappearance,itisnecessarytoeitherexplicitlyes-timatethislatentinformationortoimplicitlyencapsulateitwithinanappearancemeasure.Inthelattercase,imagederivativestreatingtimeassimplyanotherdimension(pro-ducinganimagecube)implicitlyencodesomeamountofmotioninformation;thisapproachamountstoaugmenting2Dto3Dfilters.However,astheframerate(thatis,thesamplingfrequencyinthetimedimension)israthercoarsecomparedtotheimageresolution,anysignificantmotionwillbepoorlyrepresentedthrough3Dimagegradients.Theformerapproach,explicitestimationofmotion,hasprovedtobemorepopular,mostoftenintheformofopticalflow.However,opticalflowisitselfadifficultproblemto
tackle,andeventhebestalgorithmsarenoisyandcomputa-tionallyexpensive.Whileinprincipleitshouldbepossibletotrackafeatureusingopticalflowbysimplyintegratingatrajectoryoverthetimevaryingflowfield,inpracticethesignificantnoisewillquicklyoverwhelmanyestimatedtra-jectory.Incontrast,evensimpleKLTtrackers[2]can,withrelativerobustness,trackasparsesetoffeaturesoverrela-tivelylargeperiodsoftimeatonlyafractionofthecompu-tationalcostofopticalflow.
Comparedwithacubeofopticalflowvectors,thekeyadvantageofatrajectoryisthatitisattachedtoapartic-ularmovingfeature.Thatis,invideoderivingfromthemovementofphysicalbodiesthroughspace,aproperlytrackedfeature(andhencetrajectory)automaticallygainsforeground-backgroundseparation.Incontrast,histogram-mingoveracubeofopticalflowvectorswillblendthevar-ioussourcesofmotionwithinthatcube.Forsimplevideos,suchasasinglepersonagainstastaticbackground,thiscon-flationofforegroundandbackgroundmaynotmatter.Formorecomplicatedvideoswithbackgroundmotionandmul-tipleoccludingactors,thisconflationcomesatacost.
Tothisendweintroducenewtrajectory-basedmotionfeatures,whichwecalltrajectons.Wedemonstratethatwithinabag-of-wordsframeworkoursimpleandcompu-tationallyefficienttrajectonfeaturesarecompetitivewithstate-of-the-artmotionfeaturesonsimpledatasetsandout-performthemoncomplicatedones.
2.RelatedWork
Theideaoftextons,orquantizedfilterbankresponses,originatedwithtextureclassificationbutquicklygrewtobeappliedtoobjectandscenerecognitioninstaticim-ages[21,6].Theseapproachesarebackedbysomedegreeofpsychologicalresearchsuggestingthatevenhumanvi-sionmayemployunstructuredstatisticalapproachesatearlystages[15],andcanscalewelleventolargedatasetswhenusedwithsparsefeatures[5].Byanalogytotheideathattextualworksarecomposedfromasetofdiscreterepeatingelements(words),techniquesthatmodeldatausingalibraryofquantizedfeaturesaregenerallyknownasbags-of-wordsapproaches.
Recentlybagofwordstechniqueshavegainedsignifi-cantpopularityinvideointerpretation[11].Inthecaseofvideoappearance,thesamefeaturesthatworkforstaticim-agesstillapplytovideo(e.g.,histogramsoforientedgra-dients,filterresponses).However,inthecaseofmotionitisnotpossibletodirectlyreadoffthemotionofthevideointhesamewayasappearancecanbedirectlyreadfrompixels,andsothechoiceofmotionfeaturesiscomplicatedsignificantly.
Thephilosophicallyclosestmeasuretomotionpixelsisdenseopticalflow,whichhasbeenaverycommonrep-resentationofmotion[11,8,10,18].Denseopticalflow
hasthebenefitofintuitivelyrepresentingournotionofmo-tioninadirectlyanalogouswaytopixelsandappearance,buttheactualcalculationofopticalflowfromsequentialframesisitselfadifficultproblemandeventhebestalgo-rithmsareplaguedbysevereartifactsandnoise.Inanat-tempttosidestepthisproblem,manytechniquesavoidtheactualcalculationofopticalflow,eitherbyimplicitlyencod-ingmotionthroughtemporalderivatives[16,12,7],orbyproducingtheinformationthatwouldberequiredtocom-puteopticalflowbutrefrainingfromthefinalstep[18].Amoreradicalapproachistodiscardthenotionofadensemotionfieldcomputedovertheentirevideoandonlycomputemotionatlocationswhereitcanbeeffectivelyesti-mated.Sincemostapplicationsareinterestedinclassifyingthemotionofactorsdistinctfromtheirbackgrounds,anat-uraldesireistoonlycomputethatmotionwhichisrelevanttothoseactors.Duetotheintuitiveappealoftheideathattheevolutionofasilhouettethroughtimeisenoughtocap-turethemotionofanaction,silhouettebasedfeatureshavebeencommonaswell[23,3,10,19].However,silhouettescannotrepresentmotionthatoccurswithinthesilhouetteboundary(likeapersonclappingwithbothhandsinfrontofherbody),sothenaturalextensionshasbeenfromsil-houettestovisualhulls[22,20].
Assilhouetteextractionisnottrivialeither,thenextstepistodiscardbothdensityandsemanticmeaningtosimplyfindspeciallocationsofinterest(wherevertheymaybe,onapersonornot)forwhichthemotioncanbeeffectivelycom-puted.Whentheselocationsarepointlike,thatis,occur-ingatasingletime,theresultisspacetimeinterestpointsaroundwhichvariousmeasurescanbecalculated,suchasopticalflow[7,23].
Whentheselocationsextendintimetheybecometrajec-tories,mostfrequentlyarisingfromtrackedfeatures.Oftenitisassumedthatthesetrajectoriesarefromknown,fixedfeatures,suchasparticularlandmarks(e.g.,elbows,hands)onahumanbody[1,14,9].Ifthetrajectoriesarenotonknownfeatures,theniftheyareverylongandrobust,itispotentiallypossibletoextractfull3Dinformationevenfromasingleview[13].Thedurationandcoherenceoftra-jectoriesmeansthateachpotentiallycontainsagreatdealofinformation.
Ourcontributionistobringtrajectoriesintoabagofwordsframeworkwhileavoidingthepitfallsofexistingtrajectory-basedmotionfeatures:theassumptionthattra-jectoriesarelong,noisefree,ortrackingknownbodyland-marks;atthesametime,wedonotdiscardthefundamentaltime-seriesnatureoftrajectoriesbytreatingthemasmerelyasunrelatedseriesofderivativestobebinned.Ourmethoddealswithshort(<20frames)andinconsistenttrajectories,andweareabletousecomputationally-efficientstockfea-turetrackingmethodssuchasKLTevenoncomplexvideo.Sinceweareabletomatchtheperformanceofopticalflow
10510101050−40−2005−10200−80−600−5024−2010−5−10−15−10−15−20−100−100−301010510520405005−200−20basedmethods8642even−1000−20withour−40naive0102030system,webelievetherearestillsignificantgainstobemadewiththecombination10of5trackedfeature10pointtrajectories1055andbagofwordstech-4020niques.100010200−2−44−62−8−50560−20−4−8−6−4−223.10Method105105Ourmethod52030201proceeds−1003020−2−400−6−1−8−30−20according100102030tothestandardbag-of-words−2−2approach:02−4first,featuresaretrackedoverthevideousingaKLTtracker[2]toproducefeaturetrajectories(xandypositionsovertime)foranumberoffeatures.Thesetrajectoriesareslightlytransformed(croppedandfiltered,asdescribedlater)toproduceanumberoftrajectorysnip-petsforeachvideo.Givenatrainingsetofvideos,firstadictionaryoftrajectorywordsortrajectonsisproducedbyclusteringasamplesetoftrajectorysnippetsintoaspecifiednumber(k)ofclusters,thecentersofwhichareretainedasthetrajectonlibrary.Next,foreachvideo,eithertrainingortest,itstrajectorysnippetsareassignedthelabelofthenearestcenterinthetrajectonlibrary,andtheselabelsareaccumulatedoverthevideotoproduceahistogramwithkbins.Thisklengthvectorisnormalizedtosumtooneforeachvideo,andthetrainingsetofhistogramsalongwithtrainingclasslabelsisusedtotrainasupportvectorma-chine(SVM)toclassifyvideosintoactioncategories.ThisSVMisusedtoclassifythetestset.TheexperimentsshownbelowemploythestandardLIBSVM[4]implementationofsupportvectormachines.
Weproposetwovariantsoftrajectons,whichdifferintheirconstructionoftrajectorysnippets.Forvanillatrajec-tons,eachtrajectorysnippetissimplyaconcatenatedvectorof(dx,dy)derivativesforthetrajectory,whereasinAffine-Augmented(AA)trajectons,thisvectorofderivativesisconcatenatedwithavectoroflocalaffinetransformstode-scribethemotionaroundeachtrajectorypoint.
3.1.Vanillatrajectons
3.1.1FeatureTracking
AstandardKLTtrackerisusedtotrackfeatures(using“goodfeaturestotrack”)overavideo.Inourimplemen-tation,wetrackafixednumberoffeatures(typically100),withfeaturesreplacedasnecessarywhentracksarelost.Theoutputofthistrackingisatraceof(x,y)pairsforeachfeature.Forconvenienceofnotation,wecanassumethatfeatureindicesareneverreused;thenwecanexpressafea-turei’spositionattimetasXit=(xti,yt
i).3.1.2
TrajectorySnippetProduction
Then,foreachframe,eachfeaturethatexistsduringthisframeproducesatrajectorysnippetthatconsistsofthedis-cretederivativesofthefeaturepointlocationsintime.In
Figure2.ExampletrajectonsinthetrajectonlibrarycomputedfromtheKTHdataset.Many,suchasthelongandstraighttra-jectoriesandthecurvingarcs,correspondtostereotypicalportionsofspecificactionsintheKTHdataset(running,wavingarms).
otherwords,givenaframetimetandfeaturei,andamax-imumsnippetlengthL,thetrajectorysnippetproducedis:
Tit={Xit
−Xit−1,Xit−1−Xit−2,...,Xit−L+1−Xit−L},whereifXjidoesnotexistforagiventimeanytermscon-tainingitaresettozero.
SinceXiincludesbothxandyposition,thefullflat-tenedvectorwillbeoflength2L.Ifthenumberoftrackedfeaturesisfixedatn,andavideohasfframes,thismeansthatthetotalnumberoftrajectorysnippets(andhenceeven-tuallytrajectons)willbenf.Also,notethatifafeatureistrackedforlongerthanLframes,everywindowofsizeLinthattrajectoryproducesitsownsnippet.
3.1.3TrajectorySnippetClusteringandQuantization
Next,thesetrajectorysnippetsareclusteredintoalibraryandthatlibraryisusedtoquantizesnippetstoasetoflabels.Examples,selectedatrandom,fromthetrajectonlibrarycomputedfortheKTHdataset[17]areshowninFig.2.Givenasamplesetoftrajectorysnippets(vectorsoflength2L),thesesnippetsareclusteredusingk-meanswiththestandardEuclideandistancemetricintokclustersandtheclustercentersstoredinalibraryofktrajectons.Thesetrajectonsrepresentarchtypicaltrajectorieswithinthevideoset.
Thetrajectorysnippetsofeachvideo(bothtrainingandtest)arequantizedusingthetrajectonlibrarybyassigningeachtrajectorysnippettheindexofthetrajectontowhichithasthesmallestEuclideandistance.
Notethatnoattemptismadetoexplicitlyinducescaleinvariance,eitherspatialortemporal.Ifaparticularaction
1052−4−20−21056420−5−2−101054200−2201050−2−4010Figure3.Examplecomputedmotionclustersforavideoofamanjogging.Pointcolorandshapeindicatesclusterassignment.Thejoggingpersonisoversegmentedintofourclusters,howeveraseachcluster’spointsarelargelycorrectthisoversegmentationwillhavenoeffectontheendresult.Notethateachframe’smotionsegmentationisindependent.
occursatdifferentscalesorspeeds,theninstancesofthatactionareinitiallyrepresentedbydifferentsetsoftrajec-tons,anditisattheclassificationstagethattheseinstancesaregroupedtogetherunderasinglelabel.Thisideaiscon-sistentwithtypicalbagofwordsapproachesandallowstherepresentationtodiscriminatebetweensimilartypesofmo-tionwhennecessary(e.g.,runningvs.jogging).3.1.4
VideoClassification
Followingthestandardbagofwordsframework,thesetra-jectonlabelsarebinnedtoproduceafixed-lengthhistogramthatisthefinalfeaturevectorforavideo.Giventhenftrajectorysnippetsandassociatedtrajectonlabelforeachvideo,thetrajectonlabelsareaccumulatedovertheentirevideointoahistogramwithonebinperlabel,forktotalbins.Eachvideo’shistogramisnormalizedtosumtoone.Finally,videosareclassifiedusingsupportvectorma-chines.Amulti-classSVMistrainedonthesetoftraininghistogramstoproducevideoactionclassifications.
3.2.Affine-AugmentedTrajectons
Vanillatrajectonssufferfromthedeficit,relativetohis-togramsofopticalflow,thateachtrajectoncontainsonlyinformationaboutasinglepointwhileignoringthemotionofneighboringpoints.Sincewewanttopreservetheprop-ertythatatrajectonencodesinformationthatisattachedtoaparticularbody,wecannotsimplyhistogramderivativesofnearbytrajectoriessincethatwouldconfusethetrajec-toriesofpointsco-locatedonthesamebodyandforeigntrajectories.Instead,weproposetofirstclusterthemo-tionswithinthevideointosetsoftrajectorieswhichcanbewelldescribedwithmutuallysharedtransforms;thesemo-
tionclustersideallyfallwithinasinglemovingbody.Eachtrajectoncanthencalculatelocalmovementarounditselfaccordingtothetransformsforitsmotioncluster.SomeexamplemotionclusterscanbeseeninFig.3inwhichapersonmovinghisheadandarminindependentwayshasthemproperlyassignedintodifferentclusters.
Forkmotionclusters,thegoalistoproduceasetofassignmentsoftrajectorysnippetstoclustersandclustertransformssuchthattheerrorbetweenhowatrajectoryisexpectedtoevolve(ascalculatedbysuccessivelytransform-ingthefirstfoundlocationofagivenfeaturepointaccord-ingtoaclustercenter)anditsactualhistoricalrecord.Thisgoalisachievedinak-meanslikemannerinwhichtrajec-torysnippetsarefirstassignedtothecentersthatminimizeerror,andthencentertransformsarerefinedaccordingtotheirassignedtrajectorysnippets.Thesetwostepsrepeatuntileitherconvergenceisreachedorafixednumberofit-erationshaveelapsed(inourimplementationwelimitto20iterations).3.2.1
PointtoCenterAssignment
Inthecenterassignmentstep,eachtrajectorysnippetisas-signedtothecenterwhichminimizestheerrorbetweenitspredictedtrajectoryaccordingtothosetransformsanditsactualtrajectory.
ForagiventrajectoryXi,letXti
= locationof0isthecurrentframe. LetTj a→bbethetransformforcenterjfromtimetato timetb.Inparticular,letTj timeta→0bethecumulativetransformfromagivenatothecurrentframet0. ThentheerrorforatrajectoryXitoacenterTjisgivenby 0je(Xi,Tj )= t=si,j+1||Tsi,j→tXsii,j−Xti |||si,j|,wheresi,jistheearliesttimeforwhichboththetrajectoryandthecenterhaveinformation.ThisissimplytheaverageEuclideandistancebetweenwhereatrajectory’snextpointisexpectedtobeaccordingtoitstransformsanditsrecordedposition. Eachtrajectoryissimplyassignedthecentertowhichithastheleasterror: ai=argminj e(Xi,Tj). 3.2.2CenterRefinement Inthecenterrefinementstep,givenanumberofassignedtrajectorysnippets,eachcenter’stransforms(asetofaffinetransforms,oneperframe)isre-estimatedbysolvingtheleastsquaresminimizationforthetransforms. GivenacenterwithtransformsTjandassignedpointsX1,X2...Xk,wecanrefinethetransformsbysolvingforthecumulativetransformsTtj→0accordingto(Ttj→0) 1Xt 2Xt ··· kXt = 1X02X0 ··· kX0 . Anyotherneededtransformscanbecalculatedfromthecumulativetransforms.3.2.3 TrajectorySnippetAugmentation Eachtrajectorysnippetnowhasanassociatedcenteras-signment.Foreachframeinthesnippet,alongwiththe(dx,dy)informationthatitalreadycontains,thesnippetisaugmentedwiththeaffinematrixAtatthecurrentframe.CEach(i)forthecenterasso-ciatedwithittrajectorysnippetisoflength6L,2LforthederivativesforeachofLframesand4Lfortheparametersoftheaffinematrix. 4.Experiments WeevaluateourproposedmethodquantitativelyontheHollywoodActions[11]datasetandqualitativelyonacus-tomYouTubedataset. 4.1.HollywoodActions Table1.HollywoodActionsResults ActionOursOurs(Lax)Laptevetal.[11]Total31.1%27.2%27.1%SitDown4.5%13.6%20.7%StandUp69.0%42.9%40.0%Kiss71.4%42.9%36.5%AnswerPhone0.0%35.0%24.6%HugPerson0.0%23.5%17.4%HandShake5.3%5.3%12.1%SitUp11.1%11.1%5.7%GetOutCar7.7%7.7%14.9% WeevaluateontheHollywoodActionsdataset[11]inordertogaugetheperformanceinadifficultscenario.Wetrainandtestonthe“clean”(manuallyannotated)trainingandtestingsets.Wetrack100featureswhichareclusteredinto1000trajectonsusingtheAA-trajectonmethodwithsixmotionclustersperframe.ClassificationisperformedusingaSVMwithalinearkernel. Per-classclassificationaccuraciesarepresentedinTa-ble1withacomparisontoLaptevetal.’sHOFfeatures.Ascanbeseen,withaggressiveSVMsettingsweoutperformHOFatthecostofconcentratingmostofthediscriminativeabilityintoafewclasses(HollywoodActionsisanimbal-anceddatasetwithsomeactionsrepresentingsignificantlymorethan1/8ofthetotalinstances).Withlessaggressive Figure4.SampleframesfromourYouTubedataset SVMsettings(labeled“lax”inTable1),westilloutper-formHOF,andourgainscanbeseeninfiveoutoftheeightclasses. 4.2.YouTubeDataset Asanexplorationofhowourtrajectonsmotionrepresen-tationcanfareinadifficultretrievaltask,weevaluatequal-itativesearchresultsonacustomYouTubedataset.Thisdatasetiscomposedof2019YouTubevideosofanaveragelengthofapproximately2minuteseach(orapproximately66hoursofvideo),someframesofwhichcanbeseeninFig.4.EachrawYouTubevideoissplitintooverlappingsequencesof2ssuchthatonesequencestartseachsecondofeachvideo.Alibraryof400trajectonswasused,using100trackedfeaturesand6motionclustersusingtheAA-trajectonmethod.Trajectonhistogramsareaccumulatedoverthese2swindows,andsearchisperformedoverallofthesewindows,andasaresultweareeffectivelysearchingovermorethan230,000clips.4.2.1 VideoSimilaritywithTrajectons Forvideosearch,sincewearedoingadirectcomparisonbetweenvideoswithoutanintermediatemachinelearningstep,directtrajectonhistogramcomparisonswithachi-squareddistancewouldbedominatedbythemorecommontrajectonbins.Toaccountforthis,followingChumetal.[5]wedownweighttrajectonsthatoccurfrequentlyusingtf-idfweighting,theweightsbeingplacedintothediagonalma-trixW(orequivalently,allthetrajectonhistogramsareele-mentwisemultipliedwiththeweightingvectorw),andthehistogramdistancesimplygivenbythechi-squareddistancebetweenhistogramsweightedbyw.Thedistancebetweentwovideosisthengivenby d(m1,m2)=chisqr(Wm1,Wm2), (1) wheremiisthetrajectonhistogramforavideoi.Theactualsearchisperformedbycalculatingthemotionmifeaturevectorforthequeryvideoclipandthencomparingittoallthedatasetvideoclipsusingthedistancemetricineq.1.Thenvideoclipswiththesmallestdistancearereturnedasmatches. Werandomlychose25windowstoactasqueries;wehaveidentifiedanumberofcommonandinterestingcasesintheclips.4.2.2 HumanBodyMotion Performanceonwhole-bodyhumanmotions(suchasrais-ingone’shandsup)isreasonable.Inthepicturedexam-ple(toprowofFig.5),thebestmatchisafalsepositiveofpeoplebouncingupanddowninarollercoaster(mimickingtheupanddownmotionofthequeryvideoperson’sarms).However,thefollowingtwomatchesarecorrect.4.2.3 CameraMotion(notshownhere) Performanceinfindingsimilarcameramovements(e.g.,panning,zooming)isverygood,asexpected.However,sincecameramotionsaresosimplethereisoftenlittlese-manticagreementbetweenthematches(e.g.,aPOVshotfromacardrivingalongaroadismotion-wiseamatchforazoomingcamerashotofasoccergame,butissemanticallyunrelated).4.2.4 SmallMotion(ShakyCamFailureCase) Manyscenesarelargelystaticbutfilmedwithhandheldcameraswhichintroducessmallwhole-scenemovements.Sincethesemovementsarerelativelystrongandcorrelatedovertheentirescene,thewholesceneshakingdominatesthemotiondistance.Anysemanticallyimportantmotionsarelostinthemovement,andthereturnedmatchesaresceneswherethecamerashakingsharessimilarstatisti-calpropertiesratherthansemanticallyinterestingmatches.Thetf-idfweightingisunabletohelp,sincetrajectonspro-ducedbystrongrandommotionareuncorrelatedwitheachotherandhenceindividuallyuncommon,sotheywillnotbedownweightedbytf-idf. Inthepicturedexample(thirdrowofFig.5),avideoofababytakenwithahandheldcameraisincorrectlymatchedtoanumberofunrelatedscenesalsofilmedthroughhand-heldcameras.Thethirdmatchofthemotiononlycaseisarguablyacorrectmatch.Thesefailurecasescanlargelybeavoidedbypreprocessingthevideousingastandardstabi-lizationtechniquethateliminatesdominantframemotion.4.2.5 StaticScenes(notshownhere) ThemostcommontypesofvideoinourYouTubesamplearethoseshowingstaticimagestoaccompanymusic.Suchstaticscenesaretriviallymatchedtootherstaticscenesbyourtechnique,buttrajectonsalone(sincetheyfocusexclu-sivelyonmotion)areinsufficienttodeterminewhetherthecontentpresentindifferentstaticscenesissemanticallyre-lated. 4.2.6Interviews Anothercommoncaseisapersonorgroupofpeopletalk-ingdirectlytoacamera.Thisresultsinacombinationofhandheldcameramovementandperson-likemovement.Inthe“Interview1”sectionofFig.5,asmallvideoofacow-boymusicianinterviewhasbeencorrectlymatchedtotwointerviewsandincorrectlytoabasketballgamewithsimilarcameramovement. Inthe“Interview2(talking)”examplethecameraiscompletelystationaryandthepersonisquitestillaswell,withmouthmotionbeingthemostprominent.Thisiscor-rectlymatchedtotwoscenesofpeopletalking,andsur-prisinglycorrectlymatchedtoacartooncharactertalking(likelybecauseoftheexaggeratedmouthmovementspro-ducedbycartooncharacters). 5.Conclusions Wepresentanovelandconcisemethodforusingthetrajectoriesoftrackedfeaturepointsinabagofwordsparadigmforvideoactionrecognition.Comparedtoex-istingmotionfeatures(opticalflow,silhouettes,derivatives)ourquantizedtrajectorywordsortrajectonsareabletotakeadvantageofthepositivefeaturesofeachclass:thecom-putationalefficiencyofderivativesandsparsefeatures,theperformanceofopticalflow,andthedeepstructureofsil-houettes.However,onthisthirdpointwehavebarelyscratchedthesurfaceofusingthestructuredinformationcontentoftrajectories,andwebelievethattherearestillsignificantgainstobemadeinthisarea.Futureworkwillconcentrateonhowtousetheinformationwithintrajecto-rieswithoutmakingassumptionsontrajectoryconsistencyorlocation. 5.1.Acknowledgements ThisworkwassupportedinpartbyNSFGrantIIS-0534962andbyERCprogramunderGrantNo.EEC-0540865. References [1]S.Ali,A.Basharat,andM.Shah.Chaoticinvari-antsforhumanactionrecognition.InProceedingsofIEEEInternationalConferenceonComputerVisionandPatternRecognition,2007. [2]S.Birchfield.KLT:AnimplementationoftheKanade-Lucas-Tomasifeaturetracker,2007. [3]M.Blank,L.Gorelick,E.Shechtman,M.Irani,and R.Basri.Actionsasspace-timeshapes.InProceed-ingsofInternationalConferenceonComputerVision,2005. [4]C.-C.ChangandC.-J.Lin.LIBSVM–alibraryfor supportvectormachines,2001. Figure5.TopmatchesforqueryvideosontheYouTubedataset. [5]O.Chum,J.Philbin,J.Sivic,M.Isard,andA.Zisser-man.TotalRecall:Automaticqueryexpansionwithagenerativefeaturemodelforobjectretrieval.InPro-ceedingsofInternationalConferenceonComputerVi-sion,2007.[6]G.Csurka,C.R.Dance,L.Fan,J.Willamowski,and C.Bray.Visualcategorizationwithbagsofkeypoints.InInWorkshoponStatisticalLearninginComputerVision,ECCV,2004.[7]P.Dollar,V.Rabaud,G.Cottrell,andS.Belongie.Be-haviorrecognitionviasparsespatio-temporalfeatures.InVS-PETS,2005.[8]A.FathiandG.Mori.Actionrecognitionbylearning mid-levelmotionfeatures.InProceedingsofIEEEIn-ternationalConferenceonComputerVisionandPat-ternRecognition,2008.[9]I.N.Junejo,E.Dexter,I.Laptev,andP.P´erez. Cross-viewactionrecognitionfromtemporalself-similarities.InProceedingsofEuropeanConferenceonComputerVision,2008.[10]Y.Ke,R.Sukthankar,andM.Hebert.Eventdetection incrowdedvideos.InProceedingsofInternationalConferenceonComputerVision,June2007.[11]I.Laptev,M.Marszalek,C.Schmid,andB.Rozen-feld.Learningrealistichumanactionsfrommovies.InProceedingsofIEEEInternationalConferenceonComputerVisionandPatternRecognition,2008.[12]O.MasoudandN.Papanikolopoulos.Amethodfor humanactionrecognition.ImageandVisionComput-ing,21:729–743,2003.[13]V.RabaudandS.Belongie.Re-thinkingnon-rigid structurefrommotion.InProceedingsofIEEEInter-nationalConferenceonComputerVisionandPatternRecognition,2008.[14]C.Rao,A.Yilmaz,andM.Shah.View-invariantrep-resentationandrecognitionofactions.InternationalJournalofComputerVision,50(2),2002.[15]L.W.RenningerandJ.Malik.Whenissceneiden-tificationjusttexturerecognition?VisionResearch,44(19),2004.[16]M.Rodriguez,J.Ahmed,andM.Shah.Action MACH:aspatio-temporalmaximumaveragecorrela-tionheightfilterforactionrecognition.InProceedingsofIEEEInternationalConferenceonComputerVisionandPatternRecognition,2008.[17]C.Schuldt,I.Laptev,andB.Caputo.Recognizing humanactions:alocalsvmapproach.InProceedingsofInternationalConferenceonPatternRecognition,2004. [18]E.ShechtmanandM.Irani.Space-timebehavior-basedcorrelation—or—howtotelliftwounderlyingmotionfieldsaresimilarwithoutcomputingthem?IEEETransactionsonPatternAnalysisandMachineIntelligence,29(11),2007.[19]S.Vitaladevuni,V.Kellokumpu,andL.Davis.Action recognitionusingballisticdynamics.InProceedingsofIEEEInternationalConferenceonComputerVisionandPatternRecognition,2008.[20]D.Weinland,R.Ronfard,andE.Boyer.Freeview-pointactionrecognitionusingmotionhistoryvol-umes.ComputerVisionandImageUnderstanding,104(2),2006.[21]J.Winn,A.Criminisi,andT.Minka.Objectcatego-rizationbylearneduniversalvisualdictionary.InPro-ceedingsofInternationalConferenceonComputerVi-sion,2005.[22]P.Yan,S.Khan,andM.Shah.Learning4Daction featuremodelsforarbitraryviewactionrecognition.InProceedingsofIEEEInternationalConferenceonComputerVisionandPatternRecognition,2008.[23]A.YilmazandM.Shah.Actionssketch:Anovel actionrepresentation.InProceedingsofIEEEInter-nationalConferenceonComputerVisionandPatternRecognition,2005. 因篇幅问题不能全部显示,请点此查看更多更全内容