您的当前位置：首页 Chinese Word Frequency Approximation Based on Multitype Corpora

Chinese Word Frequency Approximation Based on Multitype Corpora

来源：小奈知识网

This article was downloaded by: [Canadian Research Knowledge Network]On: 5 October 2010Access details: Access Details: [subscription number 7830168]Publisher RoutledgeInforma Ltd Registered in England and Wales Registered Number: 10729 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Quantitative Linguistics

Publication details, including instructions for authors and subscription information:http://www.informaworld.com/smpp/title~content=t716100702

Wei Qiaoa; Maosong Suna; Wolfgang Menzelba

Tsinghua University, China b Hamburg University, GermanyOnline publication date: 14 May 2010

To cite this Article Qiao, Wei , Sun, Maosong and Menzel, Wolfgang(2010) 'Chinese Word Frequency Approximation

Based on Multitype Corpora', Journal of Quantitative Linguistics, 17: 2, 142 — 166To link to this Article: DOI: 10.1080/092961710033213URL: http://dx.doi.org/10.1080/092961710033213

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdfThis article may be used for research, teaching and private study purposes. Any substantial orsystematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply ordistribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contentswill be complete or accurate or up to date. The accuracy of any instructions, formulae and drug dosesshould be independently verified with primary sources. The publisher shall not be liable for any loss,actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directlyor indirectly in connection with or arising out of the use of this material.

JournalofQuantitativeLinguistics

2010,Volume17,Number2,pp.142–166DOI:10.1080/092961710033213

ChineseWordFrequencyApproximationBasedonMultitypeCorpora*

WeiQiao1,MaosongSun1andWolfgangMenzel21TsinghuaUniversity,China;2HamburgUniversity,Germany

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010ABSTRACT

DuetothenatureofChinese,aperfectword-segmentedChinesecorpusthatisidealforthetaskofwordfrequencyestimationmayneverexist.Therefore,areliableestimationforChinesewordfrequenciesremainsachallenge.Currently,threetypesofcorporacanbeconsideredforthispurpose:rawcorpora,automaticallyword-segmentedcorpora,andmanuallyword-segmentedcorpora.Aseachtypehasitsownadvantagesanddrawbacks,noneofthemissuﬃcientalone.Inthisarticle,weproposeahybridschemewhichutilizesexistingcorporaofdiﬀerenttypesforwordfrequencyapproximation.Experimentshavebeenperformedfromstatisticalandapplication-orientedperspectives.Wedemonstratethat,comparedwithotherschemes,theproposedschemeisthemosteﬀectiveoneandleadstobetterwordfrequencyapproximationresults.

1.INTRODUCTION

Chinesewordfrequenciesplayimportantrolesinmanynaturallanguageprocessingrelatedapplications.Forexample,itisimportantformakinganationalstandardofChinesewordsegmentationandobtainingaChinesewordlistwhichservesasafoundationforChinesecomputing.Nowadays,therearemanymanuallysegmentedChinesecorporawhichhavebeenproducedbydiﬀerentorganizationsandinstitutions.However,sincediﬀerentsegmentationstandardshavebeenapplied,theresultsarenoteasytointegrate.Thisisaseriousobstacleforthedevelopmentof

*Addresscorrespondenceto:WeiQiao,DepartmentofComputerScienceandTechnology,NationalLabonInformationScienceandTechnology,TsinghuaUniversity,Beijing,100084,China.E-mail:qiaow04@mails.tsinghua.edu.cn

0929-6174/10/17020142Ó2010Taylor&Francis

CHINESEWORDFREQUENCYAPPROXIMATION

143

Chineseinformationprocessingsystems.Onlyifreliablewordfrequencyinformationbecomesavailablecanthecoveragerateofthewordlistandthustheconsistencyofmanuallysegmentedcorporabeguaranteed.In1992anationalstandardwasestablishedwhichdeﬁnestheChinesewordas‘‘combinedtightlyandstably,commonlyusedindailylife’’.Butthisdeﬁnitionobviouslyleavesroomforinterpretationandishardtoimplementinpracticalapplications.ThelackofapracticalstandardforChinesewordsleadstoinconsistencyproblemsincorpusconstruction.Tosolvetheproblem,weattempttoconstructaChinesewordlistwhichcouldsupporttheestablishmentofanationalstandard.Sincewordfrequencyinformationisaprerequisiteforconstructingsuchawordlist,Chinesewordfrequencyestimationisavitaltaskintheprocess.

WhiletheestimationofwordfrequenciesiseasyforEnglish,itisdiﬃcultforChinese,becauseunlikeEnglish,plainChinesetextsdonotcontaindelimiterstoindicatewordboundaries.Wethereforecannotobtainwordfrequenciesasusualbysimplycountingwordtokenoccurrencesintherawdata.

Toproperlyestimatewordfrequencies,itwouldbeidealtohavea‘‘fully’’correctword-segmentedChinesecorpus(Liu,1973).However,wefacetwofundamentaldiﬃculties.Theﬁrstoneiscausedbyaseriousinconsistencyproblemamongmanuallysegmentedcorpora,evenwhenthesamesegmentationstandardisadoptedforannotation.DuetothecharacteristicsofChineseword-formation(Dai,1992;Chen,1994),itisverydiﬃculttoconstructa‘‘fully’’correctcorpusevenbymanualsegmentation,althoughthedeﬁnitionof‘‘word’’seemstobeclearfromalinguisticperspective(Zhu,1982;Tang,1992).Forexample,theChinese

’’(pork)caneitherbeconsideredasacharactersequence‘‘

compound,‘‘pork’’,orconsideredasaphraseconsistingoftwosingle-characterwords‘‘’’(pig)and‘‘’’(meat).Thus,thewordfrequencyof

’’(pork)couldbeprettyhighifitistreatedintheformerway,and‘‘

couldalsobezeroiftreatedinthelatterway.Theseconddiﬃcultyarisesfromthefrequencydistributionofwords.Inordertoobtainastatisticallyreliablewordfrequencyestimation,duetotheoverwhelmingmajorityofextremelyrarewords,evenforamedium-sizedChinesewordlist,abalancedcorpuswithseveralhundredmillioncharacters,ratherthanseveralmillioncharacters,isrequired.Butmanuallyconstructingasegmentedcorpusofthatsizeisalmostimpossible:itwouldbebothlabour-intensiveandtime-consuming.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010144

W.QIAOETAL.

Ontheonehand,a‘‘fully’’correctsegmentedcorpusisnotfeasible.Ontheotherhand,eventhe‘‘imperfect’’manuallysegmentedcorpuscanbeusefulforwordfrequencyestimation.Besidesthemanuallysegmentedcorpus,wecouldfurtherexploitthreeothertypesofinformationsource.

Theﬁrsttypeisthe‘‘perfect’’automaticallysegmentedcorpus:a‘‘perfect’’wordsegmenterisusedtosegmentthecorpusautomatically,leadingtoa‘‘perfect’’automaticallysegmentedcorpus.Thenwordfrequenciescanbeestimatedbasedonit.Clearly,itwouldbeidealifaverypowerfulwordsegmenterisavailable(Liu&Liang,1986).Unfortunately,thestate-of-the-artChinesewordsegmentersarestillnotsatisfactoryinperformance.IntheFirstInternationalChineseWordSegmentationBakeoﬀ(Sproat&Emerson,2003)organizedbySIGH-AN,thehighestF-scoresforwordsegmentationintheopentestonfoursmall-scalecorporawere95.9%,95.6%,90.4%and91.2%,respectively.IntheSecondandThirdSIGHANInternationalChineseWordSegmentationBakeoﬀs(Emerson,2005;Levow,2006),thesituationhasremainedunchangedinnature,despitetheminorincreaseinperformanceofwordsegmentation.Aside-eﬀectofsuchsystemsisthattheytrytosolvesegmentationambiguitiesandrecognizeunknownwordsincontext,producingalotofunexpectedinconsistenciesinsegmentation,whichareobviouslynotfavouredbythetaskhere.

ThesecondtypeistheMM-segmentedcorpus:aMM-segmentedcorpusisproducedbythesimplestmethodforChinesewordsegmenta-tion,‘‘maximalmatching’’(MM).LiuandLiang(1986)ﬁrstusedMMtohandlelarge-scaletexts.Accordingtothedirectionofsentencescanning,MMcanbefurthersubcategorizedasforwardMM(FMM)andbackwardMM(BMM).ExperimentsshowedthatMMisbotheﬀectiveandeﬃcient(Liang,1987).SunandT’sou(1995)distinguishedfourcasesinwhichFMMandBMMwerebothconsidered,providingverystrongevidenceforsupportingMM-basedschemestobereasonableestimationsofwordfrequencies.AnotheradvantageofMM-basedschemesistheirhighconsistencyinwordsegmentation.TheweakpointofMMisthatsegmentationerrorsinevitablyexistandout-of-vocabularywordswillleadtoaseveredropinperformance.

Thethirdtypeistherawcorpus:here,thefrequenciesofallstringsofcharacters(stringfrequency)areusedasanapproximationofwordfrequencies(noticethatweusetheterm‘‘approximation’’here)(Sunetal.,1998).Suchanapproximationhastheadvantagethatitcanbe

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010CHINESEWORDFREQUENCYAPPROXIMATION

145

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010deriveddirectlyfromarbitrarytextualdata.Obviously,stringfrequen-ciesarealwayslargerthanthedesiredwordfrequenciesgivenacorpus.Therefore,thisschememayseriouslyoverestimatewordfrequenciesforsomewords(inparticularformonosyllabicwords),butithasalsotwoattractiveproperties:itdoesnotneedanykindofwordsegmentationcomponent,andtherawdatacanbeeasilyobtainedinarbitraryamounts.

Accordingtotheanalysisabove,forthetaskofwordfrequencyestimation,a‘‘perfect’’word-segmentedcorpuswouldbeidealbutdoesnotexist,neithermanuallynorautomatically.Whatwehaveisavarietyofimperfectonesaswellasunsegmenteddata.Eachtypeofcorpuscomeswithitsownadvantagesanddrawbacks,sononeofthemalonecanﬁtthetaskofwordfrequencyestimation.Therefore,wehavetoconsiderahybridstrategywhichtriestoutilizealltheimperfectdatasourcesavailablesofar,rangingfrommanuallysegmentedandMMsegmentedonestorawcorpora,andcombinethembymeansofwordfrequencyapproximation,insteadofwordfrequencyestimation.

Therestofthisarticleisorganizedasfollows:Section2proposestheconstructionprocedureofthenewapproximationscheme.Section3introducesthedatasetweusedforparameteradjustmentandevaluation.ParameteradjustmentisdescribedinSection4.Section5presentstheexperimentstoevaluatetheperformanceoftheproposedscheme.InSection6weconcludeourwork.

2.THEAPPROXIMATIONSCHEME

Inthissectionweproposeahybridschemeusingcorporaofdiﬀerenttype.Thewholeprocessconsistsofthefollowingthreesteps:ﬁrst,wecombinetheapproximationresultsderivedfromrawandMM-segmentedcorpora;second,manuallysegmentedcorporaareusedforwordfrequencyapproximation;andﬁnally,theresultsderivedfromtheprevioustwostepsareusedtoaccomplishtheﬁnalapproximation.Wewillintroducethethreestepsonebyone.

2.1CombiningtheapproximationresultsderivedfromrawandMM-segmentedcorpora

SupposewehaveapredeﬁnedwordlistdenotedasWLandarawcorpusdenotedasC.ByusinganFMMandBMMsegmentertosegmentC,we

146

W.QIAOETAL.

canconstructtwoMM-segmentedcorpora.Thus,thefollowingthreestatisticsforeachwordwiinWLcanbeobtainedfromCrespectively:fFMM(wi,C):WordfrequencyofwiobtainedfromFMM-segmentedC.fBMM(wi,C):WordfrequencyofwiobtainedfromBMM-segmentedC.fRAW(wi,C):StringfrequencyofwiobtaineddirectlyfromC.TheworkofSunetal.(2006)showedthatintheframeworkofMM,theaverageoffFMM(wi,C)andfBMM(wi,C)givesthebestapproximationofwordfrequenciesforonetofourcharacterwords,fBMM(wi,C)isthebestforﬁvecharacterwords,andfRAW(wi,C)thebestforwordswithmorethanﬁvecharacters.Herewesimplyfollowthisclaim.

UsingFRFB(wi,C)torepresenttheresultofwordfrequencyapproximationbyjointlyconsideringfRAW(wi,C),fFMM(wi,C)andfBMM(wi,C),wethushave:

Forwordswithonetofourcharacters:

FRFBðwi;CÞ¼½fFMMðwi;CÞþfBMMðwi;CÞ󰀂:

Forwordswithﬁvecharacters:

FRFBðwi;CÞ¼fBMMðwi;CÞ:

Forwordswithmorethanﬁvecharacters:

FRFBðwi;CÞ¼fRAWðwi;CÞ:

ThiswordfrequencyapproximationschemeiscalledRFB.

ð3Þð2Þð1Þ

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20102.2Combiningtheapproximationresultderivedfrommanuallysegmentedcorpora

HerewedeﬁneMCasacorporasetwhichcontainsNdiﬀerentmanuallysegmentedcorpora,denotedasMCj(j¼1,2,...,N).Bysimplycountingwordtokenoccurrences,wecanobtainthecorpus-speciﬁcwordfrequenciesforeverywordwiinWL,denoteasfMS(wi,

CHINESEWORDFREQUENCYAPPROXIMATION

147

MC).Thewordfrequencyforwiderivedfromthemanuallysegmentedcorpora,denotedFMS(wi,MC),canbecalculatedusingthefollowingequation:

FMSðwi;MCÞ¼

NXj¼1

fMSðwi;MCjÞ:ð4Þ

ThiswordfrequencyapproximationschemeiscalledMS.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20102.3CombiningFRFB(wi,C)andFMS(wi,MC)

Sofar,twodiﬀerentwordfrequencyapproximationresultshavebeenobtained:FRFB(wi,C)andFMS(wi,MC).Inordertoappropriatelycombinethetworesults,twoproblemsshouldbeconsidered.

Oneisrelatedtothefactthatthesetwostatisticsarederivedfromcorporaofdiﬀerentsize:thesizeofC,denotedasNC,isgenerallymuchlargerthanthatofMC,denotedasNMC.Wethusintroduceaparameteratobalancethesizediﬀerence.Asaninitialtrial,wesimplytaketheratioofCtoMCasthevalueofa(a41).Wewillshowhowthevalueofacanbeautomaticallyadjustedbymeansofageneticalgorithm(GA)inSection4.

Inthisarticle,weexpecttointegratethemanuallysegmentedcorpora(MC)andtherawcorpus(C)intoa‘‘new’’corpuswithsizeof2NMC.ThesizeofCisreducedtoNC/a,asaconsequence,thesizeofMCchangestoN0MC:

N0MC¼2NMCÀNC=a:

Accordingly,FRFB(wi,C)turnsintoFRFB(wi,C):

ð5Þ

F0RFBðwi;CÞ¼FRFBðwi;CÞ=a:

And,FMS(wi,MC)shouldbemodiﬁedas:

F0MSðwi;MCÞ¼FMSðwi;MCÞÂ

2NMCÀNC=a

NMC

ð6Þ

ð7Þ

ThesecondproblemconcernstherelationshipbetweenwordlengthandtheprecisionoftheapproximationresultsforChinesewordfrequencies:theshortertheword,thelessreliabletheapproximationresultsonaraw

148

W.QIAOETAL.

corpuswillbe.Takingthisobservationintoaccount,weintroduceaweightingfactorb.Theshorterthewordis,themoreweightshouldbegiventoFMS(wi,MC).ForthatpurposewedividetheChinesewordsintofourclasses:one-characterwords,two-characterwords,three-characterwordsandwordshavingfourormorethanfourcharacters.Toeachclassanindividualbisassigned:

forone-characterwords;b>><1

fortwo-characterwords;b2

b¼

forthree-characterwords;b>>:3

b4¼0otherwise:

Accordingtotheanalysisabove,theseparametershavetosatisfytheglobalconstraintb1!b2!b34b4.

ToadaptthewordfrequencyapproximationtothenewweightingfactorsEquation(6)shallbefurtherchangedto:

F00RFBðwi;CÞ¼FRFBðwi;CÞÂ

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 201011

¼FRFBðwi;CÞÂ:1þbað1þbÞ

ð8Þ

Similarly,Equation(7)hastobechangedto:

F00MSðwi;MCÞ¼FMSðwi;MCÞÂ

C2NMCÀaðN1þbÞNMC

:ð9Þ

BasedonEquations(8)and(9),weobtainourﬁnalhybridscheme,

denotedasFRFBþMS(wi,CþMC):

FRFBþMSðwi;CþMCÞ¼F00MSðwi;MCÞþFRFBðwi;CÞ

¼FMSðwi;MCÞÂ

C2NMCÀaðN1þbÞNMC

:þFRFBðwi;CÞÂ

að1þbÞ

ð10Þ

ThiswordfrequencyapproximationschemeiscalledRFBþMS.

CHINESEWORDFREQUENCYAPPROXIMATION

149

3.DATASETS

Inthissectionweintroducethedatasetsweuseforparameteradjustmentandperformanceevaluation.(1)

Twomanuallyword-segmentedcorpora.TheﬁrstoneisHUAYU,abalancedChinesecorpusconsistingof1,040,190wordswith1,763,762characters,developedbyTsinghuaUniversityandBeijingLanguageandCultureUniversity.ThesecondoneistheBEIDAnewswirecorpusconsistingof7,286,870wordswith13,030,237characters,developedbyPekingUniversity.Themanuallyword-segmentedcorporahavetotally8,327,060wordswith14,793,999characters.Thegold-standardcorpus.Weuseamanuallyword-segmentedbalancedcorpusdevelopedbytheNationalLanguageCommissionofChina,denotedYUWEI,whichcontains25,000,309wordswith51,311,659characters.AstheYUWEIcorpusisanotedauthorityandrelativelylargeinsize,wetakeitasgold-standardforexperimentalcomparisonsinthisarticle.Awordlistisderivedfromthiscorpusandthecorrespondingwordfrequenciesareobtained.Wedeleteallwordswithafrequencylessthanfourfromthisoriginalwordlisttoformourstandardwordlist,whichisdenotedYWLandcontains99,660entries.

Arawcorpus.Weuseaverylargerawcorpus,denotedRC,whichcontains1,019,986,721characters.ThestringfrequencyofeverywordinYWLisobtainedfromRC.

MM-segmentedcorpora.UsingYWL,wesegmenttherawcorpusRCwithanFMM-segmenterandBMM-segmenterseparately,resultingintwoMM-segmentedcorpora.WedenotethemRC_FMMandRC_BMMrespectively.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010(2)

(3)(4)

Intotal,wehavethereforetwomanually-segmentedcorporaofmoderatesize,HUAYUandBEIDA,wherethecombinationHUAYUþBEIDAisdenotedasHB,oneverylargerawcorpus(RC),twoMM-segmentedcorpora(RC_FMMandRC_BMM),andagold-standardcorpus(YUWEI).

4.PARAMETERADJUSTMENT

Inthissectionageneticalgorithm(Holland,1975)isusedtoadjustparametersaandb(b1,b2andb3)thathavebeenintroducedinRFBþMS.

150

W.QIAOETAL.

ThespeciﬁcGAusedherehasbeendevelopedbyus.Itisbasedonnon-overlappingpopulationsandanoptionalelitism.Foreachgeneration,thealgorithmcreatesanentirelynewpopulationofindividuals.Themutationprobabilityisavariablevaluewiththeminimumvaluesetas0.03.Thecross-overprobabilityis0.75inourimplementation.Thegold-standardcorpus,YUWEI,hasbeenadoptedforparameteradjustment.TrainingandtestsetsareconstructedbyrandomlydividingYUWEIintotwoparts:YUWEI_1andYUWEI_2with28,536,843charactersand22,774,816charactersrespectively.YUWEI_1isusedasatrainingsetwhileYUWEI_2isusedforopentest.

WederivedtwooriginalwordlistsfromYUWEI_1andYUWEI_2andcalculatedthecorrespondingwordfrequenciesforthem.TwoderivedwordlistsdenotedYWL_1andYWL_2areobtainedbydeletingallthewordswithafrequencyoflessthan4fromtheoriginalwordlists.YWL_1has83,101entriesandYWL_2has76,514entries.Here,YUWEI_1isusedtoadjusttheparameters.

Aranknumberisassignedtoeveryofthe83,101entriesinYWL_1,afterithasbeensortedaccordingtodescendingwordfrequencies.ThisrankingisdenotedRYW1.Similarly,wecanalsoobtainaranksequenceforalltheentriesofYWL_1intermsofFRFBþMS(wi,RCþHB),denotedRRFBþMS(YWL_1).

Todeﬁneaﬁtnessfunctionforthegeneticalgorithm,weproposetwometricswhichmeasuretheperformanceofthewordfrequencyapproximationscheme.

Theﬁrstmetricdescribestheclosenessbetweenthegold-standardrankRYW1andthatderivedfromournewscheme.Thecloserthetworanksare,thebettertheschemeis.Thischoiceismotivatedbytheobservationthatwordsinaranklistdonotstrictlyfollowthesameorderevenforbalancedcorporaofsimilarsize.Rankclosenessallowstherankdiﬀerencestovaryinacertainrange.

WedecidedtousetheSpearmanCoeﬃcientofRankCorrelation(SCRC)tomeasurethecloseness.SupposetherearetworanksequencesR1andR2forawordlistdenotedasWL.ThevalueofSCRCforR1andR2iscalculatedby:

SCRCWLðR1;R2Þ󰀆1À6

NXi¼1

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010d2wi

;NðN2À1Þ

CHINESEWORDFREQUENCYAPPROXIMATION

151

wheredwiisthediﬀerencebetweentworanknumbersofwiwithrespecttoR1andR2,NisthelengthofWL.

TheSCRCvalue,rangingfrom0to1,describesthesimilaritybetweentworanksequences.Althoughtherankordermaydiﬀersubstantially,wearguethatthegeneraltendencywillbeproperlyreﬂected.

Withrespecttotheﬁrstmetric,theoptimizationproblemcanbeformalizedas:

argmaxSCRCYWL

a;b

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010ðRYW1;RRFBþMSðYWL1ÞÞ:

WedenotetheGAﬁtnessfunctiondeﬁnedbytheﬁrstmetricasFIT1.InordertodemonstratethatSCRCisaneﬀectivemeasurement,twocommonly-usedEnglishcorpora(forwhichaccuratewordfrequencyestimatescanbeeasilyobtained)havebeenchosentocalculatetheSCRC.OneistheLOBcorpuswith53,823wordtypes;theotheroneistheBrowncorpuswith55,734wordtypes.Overall,27,807wordswhicharecommonintwocorporahavebeenextracted.TheSCRCvaluebetweenthetworanksequencesderivedfromBrownandLOBis0.72.Ifwedeletethewordswithafrequencyoflessthanfour(wedothisalsoinourexperimentsforChinese)11,357commonwordsremainandtheSCRCvaluerisesto0.79.ThisshowsthatSCRCisaneﬀectivemeasurementoftheclosenessbetweentworanksequences.

ThesecondmetricisthecoveragerateofthetopNhigh-frequencywordsinaranksequenceRtoacorpusC,denotedasCoverage(R,N,C)givenby:

#:charactersinCcoveredbytopNhighfrequencywordsinR

Thetotal#:charactersinCWithrespecttothesecondmetric,theoptimizationproblemisgivenby:

argmaxCoverageðRRFBþMSðYWL1Þ;N;YUWEI1Þ:

a;b

Fortheexperimentsdescribedinthisarticle,theparameterNissetto50,000,sinceinYUWEI_2the50,000mostfrequentwordsalreadyhave

152

W.QIAOETAL.

Table1.ParametervaluesdeterminedbyGA(usingMMsegmenter).

ParametervaluesforFIT1ParametervaluesforFIT2Finalparametervalues

15.817.016.4

b10.91.21.0

b20.60.40.5

b30.30.20.3

SCRC0.7480.7430.746

Coveragerate

96.3%96.6%96.5%

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010coverageof98%.WedenotetheGAﬁtnessfunctiondeﬁnedbythesecondmetricasFIT2.

UsingFIT1andFIT2astheﬁtnessfunction,twogroupsofoptimalparametervaluesareobtainedafterGAruns150generationswithapopulationsizeof30.

TheoptimizedparametervaluesareshowninTable1togetherwiththecorrespondingSCRCvaluesandthecoveragerates.TheresultsshowthatusingtheﬁrstgroupofparameterswecanachieveahigherSCRCvalueatthecostofalowercoveragerate,whileinthesecondgroupalowerSCRCvalueisobtainedwithahighercoveragerate.Herewemakeatrade-oﬀandtaketheaveragevalueofthetwogroupsasﬁnalparametervalueswhicharelistedinthelastrowofTable1.

WereplaceMMsegmenterswithawidely-usedChinesewordsegmenterICTCLAS1.0,1whichexploitssophisticatedmechanismtodealwithsegmentationambiguitiesandunknownwords,tosegmentCtocheckiftheparametervaluesaresensitivetochangesinwordsegmentation.Wehave:

fICTCLAS(wi,C):wordfrequencyofwiobtainedfromICTCLAS1.0-segmentedC.

ForwordswithonetoﬁvecharacterstheschemeRFBthuschangesto:

FRFBðwi;CÞ¼fICTCLASðwi;CÞ:

Andforwordswithmorethanﬁvecharacters,westilluse:

FRFBðwi;CÞ¼fRAWðwi;CÞ:

1ICTCLAS1.0:http://www.nlp.org.cn

CHINESEWORDFREQUENCYAPPROXIMATION

153

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010Parametersarere-adjustedusingthegeneticalgorithm.Thedeter-minedparametervaluesbyGAtogetherwiththecorrespondingSCRCvaluesandcoverageratesaregiveninTable2.

ComparedwithTable1,wecanseethattheparametersdonotchangealotwhendiﬀerentwordsegmentersareused.Wealsonotethatusinga‘‘high-quality’’wordsegmenterdoesnotleadtoahigherSCRCvalueorabettercoveragerate.ThisconﬁrmsourpreviousclaimthatanMMsegmenteriseﬀectiveforthistask.

Bynow,theparametervalues(aandb)havebeendetermined.Wewillapplythisﬁxedschemeinaseriesofexperimentstoevaluateitsutilityandeﬀectiveness.

5.EXPERIMENTS

Inthissection,theopentestsetYUWEI_2istakenasakindofgold-standardcorpus,onwhichwecompareourproposedschemeRFBþMSwiththeotherthreeschemes:(1)(2)(3)

RFB:theschemeusingarawcorpusandMM-segmentedcorpora.MS:theschemeusingonlymanually-segmentedcorpora.

CRFþMS:theschemeusingahigh-qualityCRF-basedChinesewordsegmenter,trainedonmanually-segmentedcorpora,tosegmenttherawcorpus.

Totrainthewordsegmenter,weﬁrstconcatenatedtheautomaticallyandmanuallysegmentedcorporatoformanewsegmentedcorpusand,countedwordfrequenciesonit.Speciﬁcallyinthisarticle,awordsegmenterbasedonconditionalrandomﬁelds(CRF)(Laﬀertyetal.,2001;Pengetal.,2004)isadopted.IthasbeenreportedasthebestoneintheSIGHANwordsegmentationcompetition(Levow,2006).The

Table2.ParametervaluesdeterminedbyGA(usingICTCLAS1.0wordsegmenter).

ParametervaluesforFIT1ParametervaluesforFIT2Finalparametervalues

16.216.816.5

b11.01.21.1

b20.50.50.5

b30.20.20.2

SCRC0.7380.7240.734

Coveragerate

95.8%96.9%96.5%

W.QIAOETAL.

implementationoftheCRF-basedwordsegmenterusestheCRFþþversion0.50packageprovidedbyTakuKudo.2Thewindowsizeissetasﬁveandthefour-tagset:‘‘S’’(single-characterword),‘‘L’’(leftboundaryofaword),‘‘M’’(non-boundary),‘‘R’’(rightboundaryofaword)isusedtodistinguishthepositionofcharacterinaword.Thebasicfeaturetemplateadoptedby(Lowetal.,2005)isusedwhichwerestateheretomakethearticleself-containing:(a)(b)(c)(d)(e)

Cn,n¼72,71,0,1,2CnCnþ1,n¼72,71,0,1C-1C1Pu(C0)

T(C72)T(C71)T(C0)T(C1)T(C2)

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010where,CreferstoaChinesecharacter,Cnreferstothecharacternpositionstotheright(left)ofthecurrentcharacter,andPu(C0)checkswhetherC0isapunctuationmark.Fourtypesarefurtherdeﬁnedin(Lowetal.,2005),3T(Cn)denotesthetypethecharacterCnbelongsto.TheCRFmodelistrainedonaportionofthemanuallysegmentedcorporaHUAYUandBEIDAsincetheyaretoolargetobeusedentirely.Sentencesarerandomlyselectedfromthesecorporatocomposeatrainingsetwithsizeof15MbwhichistheutmostlimitsizeuptowhichtheCRFtrainingcanstillbeperformedona4Gbmemorycomputer.Performancecomparisonsareconductedfromﬁvediﬀerentperspectives.5.1Perspective1:TheSpearmanCoeﬃcientofRankCorrelation

Inthissubsection,theSCRCvalueisusedtomeasuretheperformanceofthewordfrequencyapproximationschemes.FromYUWEI_2weobtainaranksequenceforthe76,514entriesofYWL_2,denotedRYW2,usingthesameprocedureasforobtainingRYW1andRRFBþMS(YWL_1),whichwasdescribedinSection4.AnotherfourranksequencesdenotedRMS(YWL_2),RRFB(YWL_2),RRFBþMS(YWL_2)andRCRFþMS(YWL_2)aregeneratedusingFMS(wi,RCþHB),FRFB(wi,RCþHB),FRFBþMS(wi,RCþHB)andFCRFþMS(wi,RCþHB)correspondingly.

http://chasen.org/taku/software/CRFþþ/.

ThefourtypesareNumbers,Dates(theChinesecharactersfor‘‘day’’,‘‘month’’,‘‘year’’,respectively),EnglishlettersandOthers.

32CHINESEWORDFREQUENCYAPPROXIMATION

155

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010ThenwecalculatetheSCRCvaluebetweenRYW2andeachofthefourranksequences.Table3showstheresults.

TheSCRCvalueoftheranksequenceRRFBþMS(YWL_2)isthehighest,indicatingaclosercorrespondencetotheranksequenceRYW2thananyofthesequencesRMS(YWL_2),RRFB(YWL_2)andRCRFþMS(YWL_2).

TheexperimenthasbeenrepeatedonthesubsetsofYWL_2withwordfrequencies!10and!200.Table4showsthat,nomatterwhetherweexcludeverylow-frequencywordsorconsideronlyhigh-frequencywords,theproposedschemeoutperformstheotherthreeschemes.

Table5summarizestheimprovementondiﬀerentsubsetsofYWL_2.Wenotethatalthoughmanuallysegmentedcorporaareusuallymore

Table3.ComparisonsoftheSCRCvalueamongfourschemesoverYWL_2.

(RYW2,RMS(YWL_2))

SCRCYWL_20.66

(RYW2,RRFB(YWL_2))

0.73

(RYW2,RRFBþMS(YWL_2))

0.77

(RYW2,RCRFþMS(YWL_2))

0.73

Table4.SCRCvaluesoverYWL_2withwordfrequency!10and!200.

(RYW2,RMS(YWL_2))

SCRCYWL_2(wordfrequency!10)SCRCYWL_2(wordfrequency!200)

0.670.71

(RYW2,RRFB(YWL_2))

0.750.79

(RYW2,RRFBþMS(YWL_2))

0.790.83

(RYW2,RCRFþMS(YWL_2))0.760.80

Table5.ImprovementofSCRCvaluesoverdiﬀerentpartsofYWL_2.

Numberofwords

Wordswithfrequency!4Wordswith

frequency!10Wordswith

frequency!200

76,51447,2705,725

RFBþMSvs.MS0.110.120.12

RFBþMSvs.RFB0.040.040.04

RFBþMSvs.CRFþMS

0.040.030.03

156

W.QIAOETAL.

accurate,theysuﬀerfromthedrawbackofsmallsizeandinconsistencyintheirsegmentationstandards.Thisleadstoaworseperformance.

Anotherpointwhichdeservesnotingisthat,althoughtheperformanceofCRFþMSisslightlybetterthanRFB,CRFtrainingisverytime-consumingcomparedwiththefastmaximummatchingsegmentation.Sincetherawcorpusweusedusuallyhashugesize,theeﬃciencyofMMsegmentationisappreciable.Thisalsocanbeseenasanadvantageoftheproposedscheme.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20105.2Perspective2:Ranksequencedeviation

Inthissubsection,weperformedthepreviousexperimentfromanotherperspective.WedeﬁnetheranksequencedeviationswithPrespecttotworanksequencesRYW2andRscheme,sRschemeforshort,as:ijRschemeðwiÞÀRYW2ðwiÞj(ioverasubsetofYWL_2).Here,Rschemeisaranksequencederivedfromoneofthefourschemes:RFBþMS,MS,RFB,CRFþMS.WethencalculatesMS,sRFB,sRFBþMSandsCRFþMS.

Thevaluesof(sRFBþMS7sMS)/sMS,(sRFBþMS7sRFB)/sRFBand(sRFBþMS7sCRFþMS)/sCRFþMSdescribethevaryingrateofthesvalueoftheproposedschemecomparedtotheothertwoschemesrespectively.TheyarelistedinTable6wherewecanseethat,theproposedschemereceivesthebestresultsforonetothree-characterwordsinYWL_2,butforwordsoffourcharactersormore,itshowslowerperformance.

Inordertoseemoredetail,wordswiththesamelengtharefurtherdividedintothreegroups,i.e.,high,mediumandlow-frequencywords.Figure1showsthecoveragerateoftheNmostfrequentwordsofYUWEI_2whichisusedtodeterminethethresholdforhigh,mediumandlow-frequencywords.

Intermsofthiscoverageratecurve,wechoosethepointHMtodistinguishhighandmedium-frequencywordsandthepointMLto

Table6.Thecomparisonsofranksequencedeviationswithrespecttowordlength.Wordlength1234þ

sRFBþMSÀsMS

sMS

723.5%720.8%715.9%10.2%

sRFBþMSÀsRFB

sRFB

718.5%716.2%79.1%11.5%

sRFBþMSÀsCRFþMS

sCRFþMS

717.2%716.5%78.6%10.9%

CHINESEWORDFREQUENCYAPPROXIMATION

157

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010Fig.1.ThecoveragerateofthetopNfrequentwordstoYUWEI_2

distinguishmediumandlow-frequencywords.High-frequencywordsarethe6384mostfrequentwords(1*HM),withwordfrequency4174and,acoverageof80%toYUWEI_2.Medium-frequencywordsrangebetweenrank6385and31,794(HM*ML)withawordfrequency418.Theyraisethecoverageto95%.Low-frequencywordsaretheremainingwords(ML*76,514),withawordfrequency43.Tables7,8and9presenttheresultsofranksequencedeviationperformedonhigh,mediumandlow-frequencywordsrespectively.

FromTables7,8and9wecanseethat,inmostofthecases(10outof12),ourschemereceivesthebestresult.Butforlow-frequencywords,especiallyone-characterwordsandwordsoffourcharactersormore,theresultsgetworse.Low-frequencywords,inparticular,contributemosttotheperformancedegradationonthewordsoffourcharactersormore(seeTable6).Thiscanbepartlyexplainedbythegold-standardcorpusweused.AlthoughYUWEIisrelativelylargeandbalanced,itssizeisstillfarfromwhatweneedtoreliablyestimatewordfrequenciesforlow-frequencywords.

HerewegivesomeexamplesofwordswhicharereasonablyrankedhigherupinRRFBþMS(YWL_2)comparedtoRYW2:‘‘(friend)’’,‘‘(ferry)’’,‘‘(sunny)’’,‘‘(ShenzhenStockExchange)’’,

158

W.QIAOETAL.

Table7.Comparisonofranksequencedeviationsforhigh-frequencywordsinYWL_2withrespecttowordlength.

1characterwords

sRFBþMSÀsMSsMS

sRFBþMSÀsRFB

sRFB

sRFBþMSÀsCRFþMS

sCRFþMS

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20102characterwords738.6%732.9%731.0%

3characterwords767.8%758.7%756.3%

4þcharacterwords7.0%782.0%776.8%ü

745.0%734.8%732.5%

IsRFBþMSthebestscheme?

Table8.Comparisonofranksequencedeviationsformedium-frequencywordsinYWL_2withrespecttowordlength.

1characterwords

sRFBþMSÀsMS

sMS

sRFBþMSÀsRFB

sRFB

sRFBþMSÀsCRFþMS

sCRFþMSIsRFBþMSthebestscheme?

732.4%719.2%732.5%

2characterwords715.8%78.2%77.0%ü

3characterwords78.3%710.1%710.0%

4þcharacterwords714.2%711.2%710.8%ü

Table9.Comparisonofranksequencedeviationsforlow-frequencywordsinYWL_2withrespecttowordlength.

1characterwords

sRFBþMSÀsMS

sMS

sRFBþMSÀsRFB

sRFB

sRFBþMSÀsCRFþMS

sCRFþMSIsRFBþMSthebestscheme?

25.8%74.2%73.5%7

2characterwords725.3%712.4%712.0%

3characterwords719.2%75.7%75.3%ü

4þcharacterwords36.8%12.9%8.2%7

CHINESEWORDFREQUENCYAPPROXIMATION

159

Table10.Thecoveragerateoftop50,000high-frequencywordsoverYUWEI_2.SchemeCoveragerate

MS94.0%

RFB95.0%

RFBþMS96.9%

CRFþMS94.8%

‘‘(MotorolaInc.)’’,‘‘InformationBureau)’’etc.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010(TheStateDepartment

5.3Perspective3:Thecoveragerate

TomeasurethecoveragerateoverYUWEI_2,thetop50,000high-frequencywordshavebeenselectedfromthewordlistsprovidedbythefourschemes:MS,RFB,RFBþMSandCRFþMS.FromTable10wecanseethatthecoveragerateoftheproposedschemeincreasesby2.9%,1.9%and2.1%comparedtotheschemesMS,RFBandCRFþMSrespectively.

5.4Perspective4:Sampleanalysis

TakingRYW2asthegold-standardranksequence,wecomparedRRFBþMS(YWL_2)againstRMS(YWL_2).Thereare51,052wordsinRRFBþMS(YWL_2)whoseranksareadjustedclosertothestandardsequenceRYW2thantheranksinRMS(YWL_2).Wecallthesewordspositivesamples.Meanwhile,thereare25,447wordswhoseranksareadjustedfarfromRYW2.Wecallthemnegativesamples.Therestillare15wordsthathavesamerankinRMS(YWL_2)andRRFBþMS(YWL_2).Tables11and12showthedistributionofpositivesamplesandnegativesamplesfordiﬀerentfrequencyregions(high,medium,andlow).

(biologicHerewegivesomepositivesamples,suchas‘‘

(knowledgeeconomy)’’,‘‘(in-technology)’’,‘‘

(greenhouseeﬀect)’’.Thesewordsformationthruway)’’and‘‘

arecomparativelynew,comparedwiththebasicvocabularyofChinese,andhaveahighfrequencynowadays.Whenusingourscheme,theyareproperlyadjusted.Wealsogivesomenegativesamples,suchas‘‘(WinstonChurchill)’’,‘‘(CentralRedArmy)’’and‘‘(maidservant)’’.Thesewordsareusedlessoftentoday,butarestillrankedhighbyourscheme,duetothefactthattheywerefrequentlyusedhistorically,asreﬂectedinRC,alarge-scalerawcorpuswhichcoversthelinguisticphenomenaofthattimespanmoreintensivelythanHUAYU,BEIDAandYUWEI.

160

W.QIAOETAL.

Table11.Thedistributionofpositivesamplesfordiﬀerentwordfrequencyregions.WordfrequencyregionHigh-frequencywordsMedium-frequencywordsLow-frequencywords

Totalwords638425,41044,720

Numberofpositivesamples

502417,91428,129

Proportiontotheregion

78.7%70.5%62.9%

Table12.Thedistributionofnegativesamplesfordiﬀerentwordfrequencyregions.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010WordfrequencyregionHigh-frequencywordsMedium-frequencywordsLow-frequencywords

Totalwords638425,41044,720

Numberofnegativesamples

1360749616,591

Proportiontotheregion

21.3%29.5%37.1%

5.5Perspective5:PerformanceevaluationontheChinesewordsegmentationtask

ChinesewordsegmentationisthefundamentaltechnologyformanyNLP-relatedapplications.InthepastdecadesithasdrawnalargebodyofresearchwithintheChineselanguageprocessingcommunity.Avarietyofmethodshavebeenexploitedrangingfromrule-based(Palmer,1997;Chengetal.,1999)tostatistics-based(Sproatetal.,1996;Sunetal.,1998;Teahanetal.,2000),fromword-basedtocharacterbased(Xue,2003),fromsupervisedlearning-based(Pengetal.,2004;Lowetal.,2005)tounsupervisedlearning-based(Goldwateretal.,2006),andalsoincludeshybridapproaches(Gaoetal.,2005).

InordertoverifytheperformanceoftheproposedChinesewordfrequencyapproximationschemefromanapplicationpointofview,itisappliedtothetaskofChinesewordsegmentation.

N-grammodelswhichcanbetracedtoClaudeShannon’sworkininformationtheory(Shannon,1948)arewidelyusedinstatisticalnaturallanguageprocessingincludingChinesewordsegmentation(Kit,2004;Carpenter,2005;Maoetal.,2007).AnN-grammodelpredictsthecurrentstatexibasedonformernstatesxi71,xi72,...,xi7n.Inprobabilityterms,thisisnothingbutP(xijxi71,xi72...xi7n).Whenusedforlanguagemodellingindependenceassumptionsaremadesothateachworddependsonlyonthelastnwords.TheunigrammodelisaspecialcaseofN-grammodelwhereNissetas1.Ithasalsobeenusedfor

CHINESEWORDFREQUENCYAPPROXIMATION

161

Chinesewordsegmentationtasks(Chen,2005).SupposethereisasentenceSandasegmentationforSwhichcanberepresentedasnwords:w1,w2,...,wn.Whenusingtheunigrammodel,thesentenceSunderthissegmentationhastheprobabilityof:

probðSÞ¼

nYi¼1

probðwiÞ:

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010where,prob(wi)istheprobabilitythatthewordwiappearsinrunningtext.

Givenatrainingcorpus,thevalueofprob(wi)isusuallyestimatedbytheratioofthewordfrequencyofwitothetotalnumberofwordsinthetrainingcorpus.Adynamicprogrammingalgorithmisusedtoﬁndtheoptimalsegmentationwhichprovidesmaximalprob(S).Thus,thesegmentationperformanceonlydependsontheaccuracyofwordfrequencyestimationwhichisderivedfromthetrainingcorpus.Themoreaccuratethewordfrequencyapproximation,thebettersegmentationresultscanbeobtainedfromaunigrammodel.Inthisarticle,aunigrammodelisadoptedforChinesewordsegmentationtotesttheperformanceofwordfrequencyapproximationschemes.Incontrasttotheperspectivesdiscussedintheprevioussubsections,thisisanevaluationfromanapplicationpointofview.NoticethatherewedonotattempttoimprovetheperformanceofChinesewordsegmentationbutjustusethistaskasameasurementofthequalityoffrequencyapproximation.

InthefollowingexperimentstheF1-measureisusedtoevaluatetheperformanceofChinesewordsegmentation.Itisdeﬁnedas

F1-measure¼

2PÁR

;PþR

where‘‘P’’denotesprecisionand‘‘R’’denotesrecall.

Inordertoachieveacomprehensiveevaluation,allfourdatasetsoftheSIGHANBakeoﬀ’05(Emerson,2005)areused.TheyhavebeencompiledbyMicrosoftResearchAsia(MSRA),PekingUniversity(PKU),AcademiaSinica(AS)andCityUniversityof(CityU).AsummaryofthecorporaisshowninTable13.ThestandardBakeoﬀscoringprogramisusedtocalculateF1-measure.

Eachdatasetcontainsatrainingsetandacorrespondingtestset.Aunigrammodelistrainedoneachofthefourtrainingsetsrespectively.Table14showstheresultsofwordsegmentation.

162

W.QIAOETAL.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010InadditiontothefourschemesMS,RFB,RFBþMSandCRFþMS,anotherscheme,denotedasSTR,isalsoincludedinthecomparison.Itusesasapproximationthestringfrequenciesderivedfromtherawcorpus.Thereforeinthefollowingexperiments,ﬁveschemesarechecked:STR,MS,RFB,RFBþMSandCRFþMS.

TakingYWLastheoriginalwordlist,weobtainwordfrequencyestimationfromtheseﬁveschemesrespectively.Foraspeciﬁcscheme,somewordsinYWLmayhavezerofrequency.Inordertoensureafaircomparison,thewordsinYWLwithoutzerofrequencyestimationbyanyoftheﬁveschemesisextractedtoformanewwordlist,denotedCL.CLcontains62,639entries.WhiletheearlierfourperspectivesuseYUWEIasthegold-standardcorpus,hereYUWEIistreateddiﬀerently:itisincorporatedintothesetofmanuallysegmentedcorporatomaketrainingdataforfrequencyapproximationaslargeaspossible.Asaconsequence,thewordfrequencyapproximationgivenbytheschemeMSchangesfromFMS(wi,HB)toFMS(wi,HBþYUWEI),asdeﬁnedinEquation(4)and,theﬁnalwordfrequencyapproximationisgivenbyFRFBþMS(wi,RCþ(HBþYUWEI)),asdeﬁnedinEquation(10).

Table15demonstratestheperformanceoftheunigramwordsegmentationmodelsusingtheﬁvefrequencyapproximationschemesforCLontheSIGHANBakeoﬀ’05datasets.

FromTable15wecanseethattheproposedschemeRFBþMSoutperformsalltheotherwordfrequencyapproximationschemes.

Table13.ThestatisticsoftheSIGHANBakeoﬀ’05datasets.CorpusMSRAPKUASCityU

Sizeoftraining

set(Kb)

12,2576926,9886085

Sizeoftestset(Kb)

3683313133

Sizeofwordlist74,60851,113141,33869,085

Table14.F1-measureoftheunigramwordsegmentationmodelusingtheindividualwordlistandwordfrequenciesfromtheSIGHANtrainingset,testedonfourSIGHANBakeoﬀ’05datasets.CorpusF1(%)

MSRA91.4

PKU88.4

AS88.3

CityU86.1

CHINESEWORDFREQUENCYAPPROXIMATION

163

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010Unfortunately,thisresultisnotdirectlycomparabletotheresultsshowninTable14,sinceourproposedschemeusesthesamewordlistonallthefourdatasetswhereastheresultsofTable14areobtainedwithdiﬀerentindividualwordlistsextractedfromthecorrespondingtrainingset.

Inordertocompensatetheinﬂuenceoftheindividualwordlistsandtofacilitateafaircomparison,wefurtherconductaunigramwordsegmentationmodelwhichusestheindividualwordlistfromeachSIGHANtrainingset,alongwithwordfrequenciesreplacedbytheapproximationresultsobtainedfromtheproposedschemeRFBþMS.Asimplesmoothingmethod,Laplacesmoothing,isusedtoestimatetheprobabilityofwordswhichdonotappearinCL.Table16showsthenumberofsharedwordsbetweentheindividualSIGHANwordlistsandCL,aswellastheperformanceoftheresultingunigramwordsegmentationmodelsontheSIGHANBakeoﬀ’05datasets.

Theresultsindicatethatsuchastrategyonlydecreasesthewordsegmentationperformanceby0.1*0.3percent.WealsonotethattheproportionofthesharedwordsisfairlylowsothatroughlyhalfofthewordsreceivetheirfrequencyestimatebyLaplacesmoothing.IfthefrequenciesofthemissingwordsaretakenfromtheSIGHAN

Table15.F1-measureoftheunigramwordsegmentationmodelsusingtheﬁvefrequencyapproximationschemesforCL,testedonthefourSIGHANBakeoﬀ’05datasets.CorpusMSRAPKUASCityU

STR(%)85.285.478.377.4

MS(%)85.686.878.878.4

RFB(%)85.485.878.877.9

RFBþMS(%)

86.387.480.480.1

CRFþMS(%)

85.987.279.079.2

Table16.F1-measureoftheunigramwordsegmentationmodelusingtheRFBþMSfrequencyapproximationschemewithLaplacesmoothingfortheindividualwordlist,testedonthefourSIGHANBakeoﬀ’05datasets.Corpus

NumberofsharedwordsbetweenindividualwordlistandCLF1-measure(%)

MSRA38,99291.3

PKU32,43988.3

AS42,37488.0

CityU30,29985.9

W.QIAOETAL.

Table17.F1-measureoftheunigramwordsegmentationmodelusingtheRFBþMSfrequencyapproximationschemewithoriginalwordfrequenciesassmoothingfortheindividualwordlist,testedonthefourSIGHANBakeoﬀ’05datasets.CorpusF1-measure(%)

MSRA92.3

PKU90.3

AS.8

CityU87.4

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010trainingsetsdirectly,insteadofLaplacesmoothing,theperformanceofwordsegmentation,asgiveninTable17,isimproved,evenbetterthantheperformanceofusingtheindividualwordlistwithwordfrequenciesentirelyderivedfromSIGHANtrainingset(Table14)by0.9%,1.9%,1.5%and1.3%onthefourdatasets.ThisobservationaddsfurtherevidencetotheclaimthattheproposedschemeRFBþMSiseﬀectiveforChinesewordfrequencyapproximation.Italsosuggeststhattheschememayeasilybeadaptedtospeciﬁcdatacollections.

6.CONCLUSIONANDFUTUREWORK

Inthisarticle,ahybridschemeisproposedwhichjointlyusescorporaofdiﬀerenttype(rawcorpora,MM-segmentedcorporaandmanuallysegmentedcorpora)toapproximatethewordfrequenciesinChinese.IthasbeendevelopedtakingintoaccountthecharacteristicsoftheChineselanguage.TheparametersintheproposedschemehavebeendeterminedtoproduceageneralframeworkforChinesewordfrequencyapproxima-tion.Itcan,therefore,beimmediatelyappliedothertaskswithoutanadditionaladjustment.Statisticalevaluationshavebeencarriedoutfromseveralperspectives.TheyshowthattheproposedschemeoutperformsotherChinesewordfrequencyapproximationmethods.ThemodelhasalsobeenappliedtoaChinesewordsegmentationtasktocompareitsperformancewithotherChinesewordfrequencyapproximationschemes.Theresultsindicatethattheproposedschemeisthemosteﬀectiveoneamongtheapproachesunderconsideration.WearenowcompilinganauthoritativeChinesewordlistbasedonthiswordfrequencyapproxi-mationscheme.Bymakingitpubliclyavailableinthefuture,wehopetoprovidearesourcewhichmightbeveryusefulforotherChineseNLPresearchers.

CHINESEWORDFREQUENCYAPPROXIMATION

165

ACKNOWLEDGEMENTS

ThisworkissupportedbytheNationalScienceFoundationofChinaunderGrantNo.60873174,theNational863High-TechProjectofChinaunderGrantNo.2007AA01Z148andtheChina–Germany(Tsinghua–HamburgUniversity)CINACSProgram.

REFERENCES

Carpenter,B.(2005).Scalinghigh-ordercharacterlanguagemodelstogigabytes.ACL

SoftwareWorkshop.AnnArbor,USA.

Chen,A.,Zhou,Y.,Zhang,A.,&Sun,G.(2005).UnigramlanguagemodelforChinese

wordsegmentation.InChu-RenHuang&G.A.Levow(Eds),Proceedingsofthe4thSIGHANWorkshoponChineseLanguageProcessing(pp.138–141).JejuIsland,Korea:AssociationforComputationalLinguistics.

Chen,G.L.(1994).OnChineseMorphology.Shanghai:XuelinPublisher.

Cheng,K.S.,Young,G.H.,&Wong,K.F.(1999).Astudyonword-basedandintegral-bitChinesetextcompressionalgorithm.JournaloftheAmericanSocietyforInformationScience,50(3),218–228.

Dai,X.L.(1992).ChineseMorphologyanditsInterfacewiththeSyntax.PhD

dissertation,OhioStateUniversity.

Emerson,T.(2005).TheSecondInternationalChineseWordSegmentationBakeoﬀ.In

Chu-RenHuang&G.A.Levow(Eds),Proceedingsofthe4thSIGHANWorkshoponChineseLanguageProcessing.JejuIsland,Korea:AssociationforComputa-tionalLinguistics.

Gao,J.F.,Li,M.,Huang,C.,andWu,A.(2005).Chinesewordsegmentationandnamed

entityrecognition:Apragmaticapproach.JournalofComputationalLinguisticsandChineseLanguageProcessing,31(4),531–574.

Goldwater,S.,Griﬃths,T.L.,&Johnson,M.(2006).Contextualdependenciesin

unsupervisedwordsegmentation.InC.Cardie&P.Isabelle(Eds),ProceedingsoftheInternationalConferenceofCOLING-ACL2006(pp.673–680).Sydney:AssociationforComputationalLinguistics.

Holland,J.(1975).AdaptationinNaturalandArtiﬁcialSystems.AnnArbor,MI:

UniversityofMichiganPress.

Kit,C.,Xu,Z.,&Webster,J.J.(2004).IntegratingN-grammodelandcase-based

learningforChinesewordsegmentation.JournalofChineseLanguageandComputing,14(3),213–219.

Laﬀerty,J.,McCallum,A.,&Pereira,F.(2001).Conditionalrandomﬁelds:Probabilistic

modelsforsegmentingandlabelingsequencedata.InC.E.Brodley&A.PohoreckyjDanyluk(Eds),Proceedingsofthe18thInternationalConferenceonMachineLearning(ICML2001)(pp.282–2).Williamstown,MA:MorganKaufmannPublishers.Levow,G.A.(2006).TheThirdInternationalChineseWordSegmentationBakeoﬀ.In

H.T.Ng&O.O.Y.Kwong(Eds),Proceedingsofthe5thSIGHANWorkshoponChineseLanguageProcessing(pp.108–117).Sydney:AssociationforComputa-tionalLinguistics.

Liang,N.Y.(1987).CDWS:AwordsegmentationsystemforwrittenChinesetexts.

JournalofChineseInformationProcessing,1(2),44–52(inChinese).

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010166

W.QIAOETAL.

Liu,E.S.(1973).FrequencyDictionaryofChineseWords.TheHague:MoutonandCo.

N.V.Publishers.

Liu,Y.,&Liang,N.Y.(1986).CountingwordfrequenciesofcontemporaryChinese–

AnengineeringofChineseprocessing.JournalofChineseInformationProcessing,0(1),17–25.

Low,J.K.,Ng,H.T.,&Guo,W.(2005).AmaximumentropyapproachtoChineseword

segmentation.Proceedingsofthe4thSIGHANWorkshoponChineseLanguageProcessing(pp.161–1).JejuIsland,Korea:AssociationforComputationalLinguistics.Mao,J.,Cheng,G.,He,Y.,&Xing,Z.(2007).Atrigramstatisticallanguagemodel

algorithmforChinesewordsegmentation.InF.P.Preparata&Q.Fang(Eds),ProceedingsoftheInternationalConferenceofFAW2007(pp.271–280).Lanzhou,China.Berlin,Heidelberg:SpringerVerlag.

Palmer,D.D.(1997).Atrainablerule-basedalgorithmforwordsegmentation.InP.R.

Cohen&W.Wahlster(Eds),ProceedingsoftheInternationalConferenceofACL1997(pp.321–328).Madrid:AssociationforComputationalLinguistics.

Peng,F.,Feng,F.,&McCallum,A.(2004).Chinesesegmentationandnewword

detectionusingconditionalrandomﬁelds.InL.Lemnitzer,D.Meurers&E.Hinrichs(Eds.),ProceedingsoftheInternationalConferenceofCOLING2004(pp.562–568).Geneva,Switzerland:AssociationforComputationalLinguistics.

Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical

Journal,27,379–423,623–656.

Sproat,R.,Shih,C.,Gale,W.,&Chang,N.(1996).Astochasticﬁnite-stateword-segmentationalgorithmforChinese.JournalofComputationalLinguisticsandChineseLanguageProcessing,22(3),377–404.

Sproat,R.,&Emerson,T.(2003).TheFirstInternationalChineseWordSegmentation

Bakeoﬀ.InQ.Ma&F.Xia(Eds),Proceedingsofthe2ndSIGHANWorkshoponChineseLanguageProcessing(pp.133–143).Sapporo,Japan:AssociationforComputationalLinguistics.

Sun,M.S.,&T’sou,B.K.Y.(1995).AmbiguityResolutioninChineseWord

Segmentation.InB.K.T’sou&T.B.Y.Lai(Eds),Proceedingsofthe10thPaciﬁcAsiaConferenceonLanguage,InformationandComputation(pp.121–126).,China:CityUniversityof.

Sun,M.S.,Shen,D.Y.,&T’sou,B.K.Y.(1998).ChineseWordSegmentationwithout

UsingLexiconandHand-craftedTrainingData.InC.Boitet&P.Whitelock(Eds),Proceedingsofthe36thACLand17thCOLING(pp.1265–1271).Montreal,Canada:AssociationforComputationalLinguistics.

Sun,M.S.,Zhang,Z.C.,T’sou,B.K.Y.,&Lu,H.(2006).Wordfrequency

approximationforChinesewithoutusingmanuallyannotatedcorpus.InA.F.Gelbukh(Ed.),Proceedingsofthe7thInternationalConferenceonIntelligentTextProcessingandComputationalLinguistics(CICLING)(pp.105–116).MexicoCity,Mexico.Berlin,Heidelberg:SpringerVerlag.

Tang,T.C.(1992).ChineseMorphologyandSyntax.Taipei:TaiwanStudentPublisher.Teahan,W.J.,Wen,Y.,McNab,R.,&Witten,I.H.(2000).Acompression-based

algorithmforChinesewordsegmentation.JournalofComputationalLinguisticsandChineseLanguageProcessing,26(3),375–393.

Xue,N.W.(2003).Chinesewordsegmentationascharactertagging.Journalof

ComputationalLinguisticsandChineseLanguageProcessing,8(1),29–48.Zhu,D.X.(1982).LecturesonGrammar.Beijing:TheCommercialPress.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文