您的当前位置：首页正文

KLT

来源：小奈知识网

10105042t5t04dy2040−202−4−4−2dxdy0−2−4−4−224dx1050410dy20402−2−4−4−2dxTrajectons:ActionRecognitionThroughtheMotionAnalysisof

TrackedFeatures

t5042tdy4002−2−4−4−2dxPyryMatikainen1,MartialHebert1,RahulSukthankar2,1

RoboticsInstitute,CarnegieMellonUniversity2IntelLabsPittsburgh

pmatikai@cs.cmu.edu,hebert@ri.cmu.edu,rahuls@cs.cmu.edu

Abstract

Thedeﬁningfeatureofvideocomparedtostillimagesismotion,andassuchtheselectionofgoodmotionfea-turesforactionrecognitioniscrucial,especiallyforbagofwordstechniquesthatrelyheavilyontheirfeatures.Exist-ingmotiontechniqueseitherassumethatadifﬁcultproblemlikebackground/foregroundsegmentationhasalreadybeensolved(contour/silhouettebasedtechniques)orarecompu-tationallyexpensiveandpronetonoise(opticalﬂow).Wepresentatechniqueformotionbasedonquantizedtrajec-torysnippetsoftrackedfeatures.Thesequantizedsnippets,ortrajectons,relyonlyonsimplefeaturetrackingandarecomputationallyefﬁcient.Wedemonstratethatwithinabagofwordsframeworktrajectonscanmatchstateoftheartre-sults,slightlyoutperforminghistogramofopticalﬂowfea-turesontheHollywoodActionsdataset.Additionally,wepresentqualitativeresultsinavideosearchtaskonacus-tomdatasetofchallengingYouTubevideos.

Figure1.SimpleKLTfeaturetrackingisusedtotrackasmanyfeaturesaspossiblewithinavideo.EachtrackedpointproducesaﬁxedlengthtrajectorysnippeteveryframeconsistingofthelastL(usuallyten)positionsinitstrajectory.Thesesnippetsarequan-tizedtoalibraryoftrajectons.

1.Introduction

Therecentriseinpopularityofthebag-of-wordsparadigmforactionrecognitioninvideohasledtosig-niﬁcantgainsinperformanceandtheintroductionofmorechallengingdatasetstotackle.Indeed,someofthesetech-niquescanachievenearperfectperformancewheremoreprincipledapproachesonlyproducemediocreresults.Nev-ertheless,theirsuccesshashighlightedthefactthatsignif-icantworkremainstobedoneintheareaoffeaturesforvideo,particularlywithregardstomotion.

Comingfromworkwithstaticimages,itisnosurprisethatthefamiliartechniquesofthatareahavecometobeap-pliedhere,suchastheever-popularhistogramsoforientedgradients.Whendealingwiththeappearanceofframesinvideo,thesetypesoffeaturesareentirelyappropriate.Butactionsinvideocomprisebothanappearanceaspectandamotionaspect,andmotionfeatureshavenothadthebeneﬁtinheritingasetofcanonicaltechniqueshonedoverdecades.

Thisleadsustobelievethattherearestillsigniﬁcantgainstobecapturedwithmotionfeatures,andsinceitseemsthatactionsaredeﬁnedmorebytheirmotionthanbytheirinci-dentalappearance,thisproblemisdoublyimportant.

Presentbag-of-wordstechniquesformotionarelargelyextensionsofappearancebasedtechniques.Sincemotioninformationcannotbereaddirectlyfromthevideoasinthecaseofappearance,itisnecessarytoeitherexplicitlyes-timatethislatentinformationortoimplicitlyencapsulateitwithinanappearancemeasure.Inthelattercase,imagederivativestreatingtimeassimplyanotherdimension(pro-ducinganimagecube)implicitlyencodesomeamountofmotioninformation;thisapproachamountstoaugmenting2Dto3Dﬁlters.However,astheframerate(thatis,thesamplingfrequencyinthetimedimension)israthercoarsecomparedtotheimageresolution,anysigniﬁcantmotionwillbepoorlyrepresentedthrough3Dimagegradients.Theformerapproach,explicitestimationofmotion,hasprovedtobemorepopular,mostoftenintheformofopticalﬂow.However,opticalﬂowisitselfadifﬁcultproblemto

tackle,andeventhebestalgorithmsarenoisyandcomputa-tionallyexpensive.Whileinprincipleitshouldbepossibletotrackafeatureusingopticalﬂowbysimplyintegratingatrajectoryoverthetimevaryingﬂowﬁeld,inpracticethesigniﬁcantnoisewillquicklyoverwhelmanyestimatedtra-jectory.Incontrast,evensimpleKLTtrackers[2]can,withrelativerobustness,trackasparsesetoffeaturesoverrela-tivelylargeperiodsoftimeatonlyafractionofthecompu-tationalcostofopticalﬂow.

Comparedwithacubeofopticalﬂowvectors,thekeyadvantageofatrajectoryisthatitisattachedtoapartic-ularmovingfeature.Thatis,invideoderivingfromthemovementofphysicalbodiesthroughspace,aproperlytrackedfeature(andhencetrajectory)automaticallygainsforeground-backgroundseparation.Incontrast,histogram-mingoveracubeofopticalﬂowvectorswillblendthevar-ioussourcesofmotionwithinthatcube.Forsimplevideos,suchasasinglepersonagainstastaticbackground,thiscon-ﬂationofforegroundandbackgroundmaynotmatter.Formorecomplicatedvideoswithbackgroundmotionandmul-tipleoccludingactors,thisconﬂationcomesatacost.

Tothisendweintroducenewtrajectory-basedmotionfeatures,whichwecalltrajectons.Wedemonstratethatwithinabag-of-wordsframeworkoursimpleandcompu-tationallyefﬁcienttrajectonfeaturesarecompetitivewithstate-of-the-artmotionfeaturesonsimpledatasetsandout-performthemoncomplicatedones.

2.RelatedWork

Theideaoftextons,orquantizedﬁlterbankresponses,originatedwithtextureclassiﬁcationbutquicklygrewtobeappliedtoobjectandscenerecognitioninstaticim-ages[21,6].Theseapproachesarebackedbysomedegreeofpsychologicalresearchsuggestingthatevenhumanvi-sionmayemployunstructuredstatisticalapproachesatearlystages[15],andcanscalewelleventolargedatasetswhenusedwithsparsefeatures[5].Byanalogytotheideathattextualworksarecomposedfromasetofdiscreterepeatingelements(words),techniquesthatmodeldatausingalibraryofquantizedfeaturesaregenerallyknownasbags-of-wordsapproaches.

Recentlybagofwordstechniqueshavegainedsigniﬁ-cantpopularityinvideointerpretation[11].Inthecaseofvideoappearance,thesamefeaturesthatworkforstaticim-agesstillapplytovideo(e.g.,histogramsoforientedgra-dients,ﬁlterresponses).However,inthecaseofmotionitisnotpossibletodirectlyreadoffthemotionofthevideointhesamewayasappearancecanbedirectlyreadfrompixels,andsothechoiceofmotionfeaturesiscomplicatedsigniﬁcantly.

Thephilosophicallyclosestmeasuretomotionpixelsisdenseopticalﬂow,whichhasbeenaverycommonrep-resentationofmotion[11,8,10,18].Denseopticalﬂow

hasthebeneﬁtofintuitivelyrepresentingournotionofmo-tioninadirectlyanalogouswaytopixelsandappearance,buttheactualcalculationofopticalﬂowfromsequentialframesisitselfadifﬁcultproblemandeventhebestalgo-rithmsareplaguedbysevereartifactsandnoise.Inanat-tempttosidestepthisproblem,manytechniquesavoidtheactualcalculationofopticalﬂow,eitherbyimplicitlyencod-ingmotionthroughtemporalderivatives[16,12,7],orbyproducingtheinformationthatwouldberequiredtocom-puteopticalﬂowbutrefrainingfromtheﬁnalstep[18].Amoreradicalapproachistodiscardthenotionofadensemotionﬁeldcomputedovertheentirevideoandonlycomputemotionatlocationswhereitcanbeeffectivelyesti-mated.Sincemostapplicationsareinterestedinclassifyingthemotionofactorsdistinctfromtheirbackgrounds,anat-uraldesireistoonlycomputethatmotionwhichisrelevanttothoseactors.Duetotheintuitiveappealoftheideathattheevolutionofasilhouettethroughtimeisenoughtocap-turethemotionofanaction,silhouettebasedfeatureshavebeencommonaswell[23,3,10,19].However,silhouettescannotrepresentmotionthatoccurswithinthesilhouetteboundary(likeapersonclappingwithbothhandsinfrontofherbody),sothenaturalextensionshasbeenfromsil-houettestovisualhulls[22,20].

Assilhouetteextractionisnottrivialeither,thenextstepistodiscardbothdensityandsemanticmeaningtosimplyﬁndspeciallocationsofinterest(wherevertheymaybe,onapersonornot)forwhichthemotioncanbeeffectivelycom-puted.Whentheselocationsarepointlike,thatis,occur-ingatasingletime,theresultisspacetimeinterestpointsaroundwhichvariousmeasurescanbecalculated,suchasopticalﬂow[7,23].

Whentheselocationsextendintimetheybecometrajec-tories,mostfrequentlyarisingfromtrackedfeatures.Oftenitisassumedthatthesetrajectoriesarefromknown,ﬁxedfeatures,suchasparticularlandmarks(e.g.,elbows,hands)onahumanbody[1,14,9].Ifthetrajectoriesarenotonknownfeatures,theniftheyareverylongandrobust,itispotentiallypossibletoextractfull3Dinformationevenfromasingleview[13].Thedurationandcoherenceoftra-jectoriesmeansthateachpotentiallycontainsagreatdealofinformation.

Ourcontributionistobringtrajectoriesintoabagofwordsframeworkwhileavoidingthepitfallsofexistingtrajectory-basedmotionfeatures:theassumptionthattra-jectoriesarelong,noisefree,ortrackingknownbodyland-marks;atthesametime,wedonotdiscardthefundamentaltime-seriesnatureoftrajectoriesbytreatingthemasmerelyasunrelatedseriesofderivativestobebinned.Ourmethoddealswithshort(<20frames)andinconsistenttrajectories,andweareabletousecomputationally-efﬁcientstockfea-turetrackingmethodssuchasKLTevenoncomplexvideo.Sinceweareabletomatchtheperformanceofopticalﬂow

10510101050−40−2005−10200−80−600−5024−2010−5−10−15−10−15−20−100−100−301010510520405005−200−20basedmethods8642even−1000−20withour−40naive0102030system,webelievetherearestillsigniﬁcantgainstobemadewiththecombination10of5trackedfeature10pointtrajectories1055andbagofwordstech-4020niques.100010200−2−44−62−8−50560−20−4−8−6−4−223.10Method105105Ourmethod52030201proceeds−1003020−2−400−6−1−8−30−20according100102030tothestandardbag-of-words−2−2approach:02−4ﬁrst,featuresaretrackedoverthevideousingaKLTtracker[2]toproducefeaturetrajectories(xandypositionsovertime)foranumberoffeatures.Thesetrajectoriesareslightlytransformed(croppedandﬁltered,asdescribedlater)toproduceanumberoftrajectorysnip-petsforeachvideo.Givenatrainingsetofvideos,ﬁrstadictionaryoftrajectorywordsortrajectonsisproducedbyclusteringasamplesetoftrajectorysnippetsintoaspeciﬁednumber(k)ofclusters,thecentersofwhichareretainedasthetrajectonlibrary.Next,foreachvideo,eithertrainingortest,itstrajectorysnippetsareassignedthelabelofthenearestcenterinthetrajectonlibrary,andtheselabelsareaccumulatedoverthevideotoproduceahistogramwithkbins.Thisklengthvectorisnormalizedtosumtooneforeachvideo,andthetrainingsetofhistogramsalongwithtrainingclasslabelsisusedtotrainasupportvectorma-chine(SVM)toclassifyvideosintoactioncategories.ThisSVMisusedtoclassifythetestset.TheexperimentsshownbelowemploythestandardLIBSVM[4]implementationofsupportvectormachines.

Weproposetwovariantsoftrajectons,whichdifferintheirconstructionoftrajectorysnippets.Forvanillatrajec-tons,eachtrajectorysnippetissimplyaconcatenatedvectorof(dx,dy)derivativesforthetrajectory,whereasinAfﬁne-Augmented(AA)trajectons,thisvectorofderivativesisconcatenatedwithavectoroflocalafﬁnetransformstode-scribethemotionaroundeachtrajectorypoint.

3.1.Vanillatrajectons

3.1.1FeatureTracking

AstandardKLTtrackerisusedtotrackfeatures(using“goodfeaturestotrack”)overavideo.Inourimplemen-tation,wetrackaﬁxednumberoffeatures(typically100),withfeaturesreplacedasnecessarywhentracksarelost.Theoutputofthistrackingisatraceof(x,y)pairsforeachfeature.Forconvenienceofnotation,wecanassumethatfeatureindicesareneverreused;thenwecanexpressafea-turei’spositionattimetasXit=(xti,yt

i).3.1.2

TrajectorySnippetProduction

Then,foreachframe,eachfeaturethatexistsduringthisframeproducesatrajectorysnippetthatconsistsofthedis-cretederivativesofthefeaturepointlocationsintime.In

Figure2.ExampletrajectonsinthetrajectonlibrarycomputedfromtheKTHdataset.Many,suchasthelongandstraighttra-jectoriesandthecurvingarcs,correspondtostereotypicalportionsofspeciﬁcactionsintheKTHdataset(running,wavingarms).

otherwords,givenaframetimetandfeaturei,andamax-imumsnippetlengthL,thetrajectorysnippetproducedis:

Tit={Xit

−Xit−1,Xit−1−Xit−2,...,Xit−L+1−Xit−L},whereifXjidoesnotexistforagiventimeanytermscon-tainingitaresettozero.

SinceXiincludesbothxandyposition,thefullﬂat-tenedvectorwillbeoflength2L.Ifthenumberoftrackedfeaturesisﬁxedatn,andavideohasfframes,thismeansthatthetotalnumberoftrajectorysnippets(andhenceeven-tuallytrajectons)willbenf.Also,notethatifafeatureistrackedforlongerthanLframes,everywindowofsizeLinthattrajectoryproducesitsownsnippet.

3.1.3TrajectorySnippetClusteringandQuantization

Next,thesetrajectorysnippetsareclusteredintoalibraryandthatlibraryisusedtoquantizesnippetstoasetoflabels.Examples,selectedatrandom,fromthetrajectonlibrarycomputedfortheKTHdataset[17]areshowninFig.2.Givenasamplesetoftrajectorysnippets(vectorsoflength2L),thesesnippetsareclusteredusingk-meanswiththestandardEuclideandistancemetricintokclustersandtheclustercentersstoredinalibraryofktrajectons.Thesetrajectonsrepresentarchtypicaltrajectorieswithinthevideoset.

Thetrajectorysnippetsofeachvideo(bothtrainingandtest)arequantizedusingthetrajectonlibrarybyassigningeachtrajectorysnippettheindexofthetrajectontowhichithasthesmallestEuclideandistance.

Notethatnoattemptismadetoexplicitlyinducescaleinvariance,eitherspatialortemporal.Ifaparticularaction

1052−4−20−21056420−5−2−101054200−2201050−2−4010Figure3.Examplecomputedmotionclustersforavideoofamanjogging.Pointcolorandshapeindicatesclusterassignment.Thejoggingpersonisoversegmentedintofourclusters,howeveraseachcluster’spointsarelargelycorrectthisoversegmentationwillhavenoeffectontheendresult.Notethateachframe’smotionsegmentationisindependent.

occursatdifferentscalesorspeeds,theninstancesofthatactionareinitiallyrepresentedbydifferentsetsoftrajec-tons,anditisattheclassiﬁcationstagethattheseinstancesaregroupedtogetherunderasinglelabel.Thisideaiscon-sistentwithtypicalbagofwordsapproachesandallowstherepresentationtodiscriminatebetweensimilartypesofmo-tionwhennecessary(e.g.,runningvs.jogging).3.1.4

VideoClassiﬁcation

Followingthestandardbagofwordsframework,thesetra-jectonlabelsarebinnedtoproduceaﬁxed-lengthhistogramthatistheﬁnalfeaturevectorforavideo.Giventhenftrajectorysnippetsandassociatedtrajectonlabelforeachvideo,thetrajectonlabelsareaccumulatedovertheentirevideointoahistogramwithonebinperlabel,forktotalbins.Eachvideo’shistogramisnormalizedtosumtoone.Finally,videosareclassiﬁedusingsupportvectorma-chines.Amulti-classSVMistrainedonthesetoftraininghistogramstoproducevideoactionclassiﬁcations.

3.2.Afﬁne-AugmentedTrajectons

Vanillatrajectonssufferfromthedeﬁcit,relativetohis-togramsofopticalﬂow,thateachtrajectoncontainsonlyinformationaboutasinglepointwhileignoringthemotionofneighboringpoints.Sincewewanttopreservetheprop-ertythatatrajectonencodesinformationthatisattachedtoaparticularbody,wecannotsimplyhistogramderivativesofnearbytrajectoriessincethatwouldconfusethetrajec-toriesofpointsco-locatedonthesamebodyandforeigntrajectories.Instead,weproposetoﬁrstclusterthemo-tionswithinthevideointosetsoftrajectorieswhichcanbewelldescribedwithmutuallysharedtransforms;thesemo-

tionclustersideallyfallwithinasinglemovingbody.Eachtrajectoncanthencalculatelocalmovementarounditselfaccordingtothetransformsforitsmotioncluster.SomeexamplemotionclusterscanbeseeninFig.3inwhichapersonmovinghisheadandarminindependentwayshasthemproperlyassignedintodifferentclusters.

Forkmotionclusters,thegoalistoproduceasetofassignmentsoftrajectorysnippetstoclustersandclustertransformssuchthattheerrorbetweenhowatrajectoryisexpectedtoevolve(ascalculatedbysuccessivelytransform-ingtheﬁrstfoundlocationofagivenfeaturepointaccord-ingtoaclustercenter)anditsactualhistoricalrecord.Thisgoalisachievedinak-meanslikemannerinwhichtrajec-torysnippetsareﬁrstassignedtothecentersthatminimizeerror,andthencentertransformsarereﬁnedaccordingtotheirassignedtrajectorysnippets.Thesetwostepsrepeatuntileitherconvergenceisreachedoraﬁxednumberofit-erationshaveelapsed(inourimplementationwelimitto20iterations).3.2.1

PointtoCenterAssignment

Inthecenterassignmentstep,eachtrajectorysnippetisas-signedtothecenterwhichminimizestheerrorbetweenitspredictedtrajectoryaccordingtothosetransformsanditsactualtrajectory.

ForagiventrajectoryXi,letXti

=bethe

locationof0isthecurrentframe.

LetTj

a→bbethetransformforcenterjfromtimetato

timetb.Inparticular,letTj

timeta→0bethecumulativetransformfromagivenatothecurrentframet0.

ThentheerrorforatrajectoryXitoacenterTjisgivenby

󰀂0je(Xi,Tj

t=si,j+1||Tsi,j→tXsii,j−Xti

|||si,j|,wheresi,jistheearliesttimeforwhichboththetrajectoryandthecenterhaveinformation.ThisissimplytheaverageEuclideandistancebetweenwhereatrajectory’snextpointisexpectedtobeaccordingtoitstransformsanditsrecordedposition.

Eachtrajectoryissimplyassignedthecentertowhichithastheleasterror:

ai=argminj

e(Xi,Tj).

3.2.2CenterReﬁnement

Inthecenterreﬁnementstep,givenanumberofassignedtrajectorysnippets,eachcenter’stransforms(asetofafﬁnetransforms,oneperframe)isre-estimatedbysolvingtheleastsquaresminimizationforthetransforms.

GivenacenterwithtransformsTjandassignedpointsX1,X2...Xk,wecanreﬁnethetransformsbysolvingforthecumulativetransformsTtj→0accordingto(Ttj→0)

󰀃1Xt

2Xt

···

kXt

󰀁=

󰀃

1X02X0

···

kX0

󰀁

Anyotherneededtransformscanbecalculatedfromthecumulativetransforms.3.2.3

TrajectorySnippetAugmentation

Eachtrajectorysnippetnowhasanassociatedcenteras-signment.Foreachframeinthesnippet,alongwiththe(dx,dy)informationthatitalreadycontains,thesnippetisaugmentedwiththeafﬁnematrixAtatthecurrentframe.CEach(i)forthecenterasso-ciatedwithittrajectorysnippetisoflength6L,2LforthederivativesforeachofLframesand4Lfortheparametersoftheafﬁnematrix.

4.Experiments

WeevaluateourproposedmethodquantitativelyontheHollywoodActions[11]datasetandqualitativelyonacus-tomYouTubedataset.

4.1.HollywoodActions

Table1.HollywoodActionsResults

ActionOursOurs(Lax)Laptevetal.[11]Total31.1%27.2%27.1%SitDown4.5%13.6%20.7%StandUp69.0%42.9%40.0%Kiss71.4%42.9%36.5%AnswerPhone0.0%35.0%24.6%HugPerson0.0%23.5%17.4%HandShake5.3%5.3%12.1%SitUp11.1%11.1%5.7%GetOutCar7.7%7.7%14.9%

WeevaluateontheHollywoodActionsdataset[11]inordertogaugetheperformanceinadifﬁcultscenario.Wetrainandtestonthe“clean”(manuallyannotated)trainingandtestingsets.Wetrack100featureswhichareclusteredinto1000trajectonsusingtheAA-trajectonmethodwithsixmotionclustersperframe.ClassiﬁcationisperformedusingaSVMwithalinearkernel.

Per-classclassiﬁcationaccuraciesarepresentedinTa-ble1withacomparisontoLaptevetal.’sHOFfeatures.Ascanbeseen,withaggressiveSVMsettingsweoutperformHOFatthecostofconcentratingmostofthediscriminativeabilityintoafewclasses(HollywoodActionsisanimbal-anceddatasetwithsomeactionsrepresentingsigniﬁcantlymorethan1/8ofthetotalinstances).Withlessaggressive

Figure4.SampleframesfromourYouTubedataset

SVMsettings(labeled“lax”inTable1),westilloutper-formHOF,andourgainscanbeseeninﬁveoutoftheeightclasses.

4.2.YouTubeDataset

Asanexplorationofhowourtrajectonsmotionrepresen-tationcanfareinadifﬁcultretrievaltask,weevaluatequal-itativesearchresultsonacustomYouTubedataset.Thisdatasetiscomposedof2019YouTubevideosofanaveragelengthofapproximately2minuteseach(orapproximately66hoursofvideo),someframesofwhichcanbeseeninFig.4.EachrawYouTubevideoissplitintooverlappingsequencesof2ssuchthatonesequencestartseachsecondofeachvideo.Alibraryof400trajectonswasused,using100trackedfeaturesand6motionclustersusingtheAA-trajectonmethod.Trajectonhistogramsareaccumulatedoverthese2swindows,andsearchisperformedoverallofthesewindows,andasaresultweareeffectivelysearchingovermorethan230,000clips.4.2.1

VideoSimilaritywithTrajectons

Forvideosearch,sincewearedoingadirectcomparisonbetweenvideoswithoutanintermediatemachinelearningstep,directtrajectonhistogramcomparisonswithachi-squareddistancewouldbedominatedbythemorecommontrajectonbins.Toaccountforthis,followingChumetal.[5]wedownweighttrajectonsthatoccurfrequentlyusingtf-idfweighting,theweightsbeingplacedintothediagonalma-trixW(orequivalently,allthetrajectonhistogramsareele-mentwisemultipliedwiththeweightingvectorw),andthehistogramdistancesimplygivenbythechi-squareddistancebetweenhistogramsweightedbyw.Thedistancebetweentwovideosisthengivenby

d(m1,m2)=chisqr(Wm1,Wm2),

(1)

wheremiisthetrajectonhistogramforavideoi.Theactualsearchisperformedbycalculatingthemotionmifeaturevectorforthequeryvideoclipandthencomparingittoallthedatasetvideoclipsusingthedistancemetricineq.1.Thenvideoclipswiththesmallestdistancearereturnedasmatches.

Werandomlychose25windowstoactasqueries;wehaveidentiﬁedanumberofcommonandinterestingcasesintheclips.4.2.2

HumanBodyMotion

Performanceonwhole-bodyhumanmotions(suchasrais-ingone’shandsup)isreasonable.Inthepicturedexam-ple(toprowofFig.5),thebestmatchisafalsepositiveofpeoplebouncingupanddowninarollercoaster(mimickingtheupanddownmotionofthequeryvideoperson’sarms).However,thefollowingtwomatchesarecorrect.4.2.3

CameraMotion(notshownhere)

Performanceinﬁndingsimilarcameramovements(e.g.,panning,zooming)isverygood,asexpected.However,sincecameramotionsaresosimplethereisoftenlittlese-manticagreementbetweenthematches(e.g.,aPOVshotfromacardrivingalongaroadismotion-wiseamatchforazoomingcamerashotofasoccergame,butissemanticallyunrelated).4.2.4

SmallMotion(ShakyCamFailureCase)

Manyscenesarelargelystaticbutﬁlmedwithhandheldcameraswhichintroducessmallwhole-scenemovements.Sincethesemovementsarerelativelystrongandcorrelatedovertheentirescene,thewholesceneshakingdominatesthemotiondistance.Anysemanticallyimportantmotionsarelostinthemovement,andthereturnedmatchesaresceneswherethecamerashakingsharessimilarstatisti-calpropertiesratherthansemanticallyinterestingmatches.Thetf-idfweightingisunabletohelp,sincetrajectonspro-ducedbystrongrandommotionareuncorrelatedwitheachotherandhenceindividuallyuncommon,sotheywillnotbedownweightedbytf-idf.

Inthepicturedexample(thirdrowofFig.5),avideoofababytakenwithahandheldcameraisincorrectlymatchedtoanumberofunrelatedscenesalsoﬁlmedthroughhand-heldcameras.Thethirdmatchofthemotiononlycaseisarguablyacorrectmatch.Thesefailurecasescanlargelybeavoidedbypreprocessingthevideousingastandardstabi-lizationtechniquethateliminatesdominantframemotion.4.2.5

StaticScenes(notshownhere)

ThemostcommontypesofvideoinourYouTubesamplearethoseshowingstaticimagestoaccompanymusic.Suchstaticscenesaretriviallymatchedtootherstaticscenesbyourtechnique,buttrajectonsalone(sincetheyfocusexclu-sivelyonmotion)areinsufﬁcienttodeterminewhetherthecontentpresentindifferentstaticscenesissemanticallyre-lated.

4.2.6Interviews

Anothercommoncaseisapersonorgroupofpeopletalk-ingdirectlytoacamera.Thisresultsinacombinationofhandheldcameramovementandperson-likemovement.Inthe“Interview1”sectionofFig.5,asmallvideoofacow-boymusicianinterviewhasbeencorrectlymatchedtotwointerviewsandincorrectlytoabasketballgamewithsimilarcameramovement.

Inthe“Interview2(talking)”examplethecameraiscompletelystationaryandthepersonisquitestillaswell,withmouthmotionbeingthemostprominent.Thisiscor-rectlymatchedtotwoscenesofpeopletalking,andsur-prisinglycorrectlymatchedtoacartooncharactertalking(likelybecauseoftheexaggeratedmouthmovementspro-ducedbycartooncharacters).

5.Conclusions

Wepresentanovelandconcisemethodforusingthetrajectoriesoftrackedfeaturepointsinabagofwordsparadigmforvideoactionrecognition.Comparedtoex-istingmotionfeatures(opticalﬂow,silhouettes,derivatives)ourquantizedtrajectorywordsortrajectonsareabletotakeadvantageofthepositivefeaturesofeachclass:thecom-putationalefﬁciencyofderivativesandsparsefeatures,theperformanceofopticalﬂow,andthedeepstructureofsil-houettes.However,onthisthirdpointwehavebarelyscratchedthesurfaceofusingthestructuredinformationcontentoftrajectories,andwebelievethattherearestillsigniﬁcantgainstobemadeinthisarea.Futureworkwillconcentrateonhowtousetheinformationwithintrajecto-rieswithoutmakingassumptionsontrajectoryconsistencyorlocation.

5.1.Acknowledgements

ThisworkwassupportedinpartbyNSFGrantIIS-0534962andbyERCprogramunderGrantNo.EEC-0540865.

References

[1]S.Ali,A.Basharat,andM.Shah.Chaoticinvari-antsforhumanactionrecognition.InProceedingsofIEEEInternationalConferenceonComputerVisionandPatternRecognition,2007.

[2]S.Birchﬁeld.KLT:AnimplementationoftheKanade-Lucas-Tomasifeaturetracker,2007.

[3]M.Blank,L.Gorelick,E.Shechtman,M.Irani,and

R.Basri.Actionsasspace-timeshapes.InProceed-ingsofInternationalConferenceonComputerVision,2005.

[4]C.-C.ChangandC.-J.Lin.LIBSVM–alibraryfor

supportvectormachines,2001.

Figure5.TopmatchesforqueryvideosontheYouTubedataset.

[5]O.Chum,J.Philbin,J.Sivic,M.Isard,andA.Zisser-man.TotalRecall:Automaticqueryexpansionwithagenerativefeaturemodelforobjectretrieval.InPro-ceedingsofInternationalConferenceonComputerVi-sion,2007.[6]G.Csurka,C.R.Dance,L.Fan,J.Willamowski,and

C.Bray.Visualcategorizationwithbagsofkeypoints.InInWorkshoponStatisticalLearninginComputerVision,ECCV,2004.[7]P.Dollar,V.Rabaud,G.Cottrell,andS.Belongie.Be-haviorrecognitionviasparsespatio-temporalfeatures.InVS-PETS,2005.[8]A.FathiandG.Mori.Actionrecognitionbylearning

mid-levelmotionfeatures.InProceedingsofIEEEIn-ternationalConferenceonComputerVisionandPat-ternRecognition,2008.[9]I.N.Junejo,E.Dexter,I.Laptev,andP.P´erez.

Cross-viewactionrecognitionfromtemporalself-similarities.InProceedingsofEuropeanConferenceonComputerVision,2008.[10]Y.Ke,R.Sukthankar,andM.Hebert.Eventdetection

incrowdedvideos.InProceedingsofInternationalConferenceonComputerVision,June2007.[11]I.Laptev,M.Marszalek,C.Schmid,andB.Rozen-feld.Learningrealistichumanactionsfrommovies.InProceedingsofIEEEInternationalConferenceonComputerVisionandPatternRecognition,2008.[12]O.MasoudandN.Papanikolopoulos.Amethodfor

humanactionrecognition.ImageandVisionComput-ing,21:729–743,2003.[13]V.RabaudandS.Belongie.Re-thinkingnon-rigid

structurefrommotion.InProceedingsofIEEEInter-nationalConferenceonComputerVisionandPatternRecognition,2008.[14]C.Rao,A.Yilmaz,andM.Shah.View-invariantrep-resentationandrecognitionofactions.InternationalJournalofComputerVision,50(2),2002.[15]L.W.RenningerandJ.Malik.Whenissceneiden-tiﬁcationjusttexturerecognition?VisionResearch,44(19),2004.[16]M.Rodriguez,J.Ahmed,andM.Shah.Action

MACH:aspatio-temporalmaximumaveragecorrela-tionheightﬁlterforactionrecognition.InProceedingsofIEEEInternationalConferenceonComputerVisionandPatternRecognition,2008.[17]C.Schuldt,I.Laptev,andB.Caputo.Recognizing

humanactions:alocalsvmapproach.InProceedingsofInternationalConferenceonPatternRecognition,2004.

[18]E.ShechtmanandM.Irani.Space-timebehavior-basedcorrelation—or—howtotelliftwounderlyingmotionﬁeldsaresimilarwithoutcomputingthem?IEEETransactionsonPatternAnalysisandMachineIntelligence,29(11),2007.[19]S.Vitaladevuni,V.Kellokumpu,andL.Davis.Action

recognitionusingballisticdynamics.InProceedingsofIEEEInternationalConferenceonComputerVisionandPatternRecognition,2008.[20]D.Weinland,R.Ronfard,andE.Boyer.Freeview-pointactionrecognitionusingmotionhistoryvol-umes.ComputerVisionandImageUnderstanding,104(2),2006.[21]J.Winn,A.Criminisi,andT.Minka.Objectcatego-rizationbylearneduniversalvisualdictionary.InPro-ceedingsofInternationalConferenceonComputerVi-sion,2005.[22]P.Yan,S.Khan,andM.Shah.Learning4Daction

featuremodelsforarbitraryviewactionrecognition.InProceedingsofIEEEInternationalConferenceonComputerVisionandPatternRecognition,2008.[23]A.YilmazandM.Shah.Actionssketch:Anovel

actionrepresentation.InProceedingsofIEEEInter-nationalConferenceonComputerVisionandPatternRecognition,2005.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文