您好,欢迎来到小奈知识网。
搜索
您的当前位置:首页Chinese Word Frequency Approximation Based on Multitype Corpora

Chinese Word Frequency Approximation Based on Multitype Corpora

来源:小奈知识网
This article was downloaded by: [Canadian Research Knowledge Network]On: 5 October 2010Access details: Access Details: [subscription number 7830168]Publisher RoutledgeInforma Ltd Registered in England and Wales Registered Number: 10729 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Quantitative Linguistics

Publication details, including instructions for authors and subscription information:http://www.informaworld.com/smpp/title~content=t716100702

Chinese Word Frequency Approximation Based on Multitype Corpora

Wei Qiaoa; Maosong Suna; Wolfgang Menzelba

Tsinghua University, China b Hamburg University, GermanyOnline publication date: 14 May 2010

To cite this Article Qiao, Wei , Sun, Maosong and Menzel, Wolfgang(2010) 'Chinese Word Frequency Approximation

Based on Multitype Corpora', Journal of Quantitative Linguistics, 17: 2, 142 — 166To link to this Article: DOI: 10.1080/092961710033213URL: http://dx.doi.org/10.1080/092961710033213

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdfThis article may be used for research, teaching and private study purposes. Any substantial orsystematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply ordistribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contentswill be complete or accurate or up to date. The accuracy of any instructions, formulae and drug dosesshould be independently verified with primary sources. The publisher shall not be liable for any loss,actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directlyor indirectly in connection with or arising out of the use of this material.

JournalofQuantitativeLinguistics

2010,Volume17,Number2,pp.142–166DOI:10.1080/092961710033213

ChineseWordFrequencyApproximationBasedonMultitypeCorpora*

WeiQiao1,MaosongSun1andWolfgangMenzel21TsinghuaUniversity,China;2HamburgUniversity,Germany

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010ABSTRACT

DuetothenatureofChinese,aperfectword-segmentedChinesecorpusthatisidealforthetaskofwordfrequencyestimationmayneverexist.Therefore,areliableestimationforChinesewordfrequenciesremainsachallenge.Currently,threetypesofcorporacanbeconsideredforthispurpose:rawcorpora,automaticallyword-segmentedcorpora,andmanuallyword-segmentedcorpora.Aseachtypehasitsownadvantagesanddrawbacks,noneofthemissufficientalone.Inthisarticle,weproposeahybridschemewhichutilizesexistingcorporaofdifferenttypesforwordfrequencyapproximation.Experimentshavebeenperformedfromstatisticalandapplication-orientedperspectives.Wedemonstratethat,comparedwithotherschemes,theproposedschemeisthemosteffectiveoneandleadstobetterwordfrequencyapproximationresults.

1.INTRODUCTION

Chinesewordfrequenciesplayimportantrolesinmanynaturallanguageprocessingrelatedapplications.Forexample,itisimportantformakinganationalstandardofChinesewordsegmentationandobtainingaChinesewordlistwhichservesasafoundationforChinesecomputing.Nowadays,therearemanymanuallysegmentedChinesecorporawhichhavebeenproducedbydifferentorganizationsandinstitutions.However,sincedifferentsegmentationstandardshavebeenapplied,theresultsarenoteasytointegrate.Thisisaseriousobstacleforthedevelopmentof

*Addresscorrespondenceto:WeiQiao,DepartmentofComputerScienceandTechnology,NationalLabonInformationScienceandTechnology,TsinghuaUniversity,Beijing,100084,China.E-mail:qiaow04@mails.tsinghua.edu.cn

0929-6174/10/17020142Ó2010Taylor&Francis

CHINESEWORDFREQUENCYAPPROXIMATION

143

Chineseinformationprocessingsystems.Onlyifreliablewordfrequencyinformationbecomesavailablecanthecoveragerateofthewordlistandthustheconsistencyofmanuallysegmentedcorporabeguaranteed.In1992anationalstandardwasestablishedwhichdefinestheChinesewordas‘‘combinedtightlyandstably,commonlyusedindailylife’’.Butthisdefinitionobviouslyleavesroomforinterpretationandishardtoimplementinpracticalapplications.ThelackofapracticalstandardforChinesewordsleadstoinconsistencyproblemsincorpusconstruction.Tosolvetheproblem,weattempttoconstructaChinesewordlistwhichcouldsupporttheestablishmentofanationalstandard.Sincewordfrequencyinformationisaprerequisiteforconstructingsuchawordlist,Chinesewordfrequencyestimationisavitaltaskintheprocess.

WhiletheestimationofwordfrequenciesiseasyforEnglish,itisdifficultforChinese,becauseunlikeEnglish,plainChinesetextsdonotcontaindelimiterstoindicatewordboundaries.Wethereforecannotobtainwordfrequenciesasusualbysimplycountingwordtokenoccurrencesintherawdata.

Toproperlyestimatewordfrequencies,itwouldbeidealtohavea‘‘fully’’correctword-segmentedChinesecorpus(Liu,1973).However,wefacetwofundamentaldifficulties.Thefirstoneiscausedbyaseriousinconsistencyproblemamongmanuallysegmentedcorpora,evenwhenthesamesegmentationstandardisadoptedforannotation.DuetothecharacteristicsofChineseword-formation(Dai,1992;Chen,1994),itisverydifficulttoconstructa‘‘fully’’correctcorpusevenbymanualsegmentation,althoughthedefinitionof‘‘word’’seemstobeclearfromalinguisticperspective(Zhu,1982;Tang,1992).Forexample,theChinese

’’(pork)caneitherbeconsideredasacharactersequence‘‘

compound,‘‘pork’’,orconsideredasaphraseconsistingoftwosingle-characterwords‘‘’’(pig)and‘‘’’(meat).Thus,thewordfrequencyof

’’(pork)couldbeprettyhighifitistreatedintheformerway,and‘‘

couldalsobezeroiftreatedinthelatterway.Theseconddifficultyarisesfromthefrequencydistributionofwords.Inordertoobtainastatisticallyreliablewordfrequencyestimation,duetotheoverwhelmingmajorityofextremelyrarewords,evenforamedium-sizedChinesewordlist,abalancedcorpuswithseveralhundredmillioncharacters,ratherthanseveralmillioncharacters,isrequired.Butmanuallyconstructingasegmentedcorpusofthatsizeisalmostimpossible:itwouldbebothlabour-intensiveandtime-consuming.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010144

W.QIAOETAL.

Ontheonehand,a‘‘fully’’correctsegmentedcorpusisnotfeasible.Ontheotherhand,eventhe‘‘imperfect’’manuallysegmentedcorpuscanbeusefulforwordfrequencyestimation.Besidesthemanuallysegmentedcorpus,wecouldfurtherexploitthreeothertypesofinformationsource.

Thefirsttypeisthe‘‘perfect’’automaticallysegmentedcorpus:a‘‘perfect’’wordsegmenterisusedtosegmentthecorpusautomatically,leadingtoa‘‘perfect’’automaticallysegmentedcorpus.Thenwordfrequenciescanbeestimatedbasedonit.Clearly,itwouldbeidealifaverypowerfulwordsegmenterisavailable(Liu&Liang,1986).Unfortunately,thestate-of-the-artChinesewordsegmentersarestillnotsatisfactoryinperformance.IntheFirstInternationalChineseWordSegmentationBakeoff(Sproat&Emerson,2003)organizedbySIGH-AN,thehighestF-scoresforwordsegmentationintheopentestonfoursmall-scalecorporawere95.9%,95.6%,90.4%and91.2%,respectively.IntheSecondandThirdSIGHANInternationalChineseWordSegmentationBakeoffs(Emerson,2005;Levow,2006),thesituationhasremainedunchangedinnature,despitetheminorincreaseinperformanceofwordsegmentation.Aside-effectofsuchsystemsisthattheytrytosolvesegmentationambiguitiesandrecognizeunknownwordsincontext,producingalotofunexpectedinconsistenciesinsegmentation,whichareobviouslynotfavouredbythetaskhere.

ThesecondtypeistheMM-segmentedcorpus:aMM-segmentedcorpusisproducedbythesimplestmethodforChinesewordsegmenta-tion,‘‘maximalmatching’’(MM).LiuandLiang(1986)firstusedMMtohandlelarge-scaletexts.Accordingtothedirectionofsentencescanning,MMcanbefurthersubcategorizedasforwardMM(FMM)andbackwardMM(BMM).ExperimentsshowedthatMMisbotheffectiveandefficient(Liang,1987).SunandT’sou(1995)distinguishedfourcasesinwhichFMMandBMMwerebothconsidered,providingverystrongevidenceforsupportingMM-basedschemestobereasonableestimationsofwordfrequencies.AnotheradvantageofMM-basedschemesistheirhighconsistencyinwordsegmentation.TheweakpointofMMisthatsegmentationerrorsinevitablyexistandout-of-vocabularywordswillleadtoaseveredropinperformance.

Thethirdtypeistherawcorpus:here,thefrequenciesofallstringsofcharacters(stringfrequency)areusedasanapproximationofwordfrequencies(noticethatweusetheterm‘‘approximation’’here)(Sunetal.,1998).Suchanapproximationhastheadvantagethatitcanbe

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010CHINESEWORDFREQUENCYAPPROXIMATION

145

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010deriveddirectlyfromarbitrarytextualdata.Obviously,stringfrequen-ciesarealwayslargerthanthedesiredwordfrequenciesgivenacorpus.Therefore,thisschememayseriouslyoverestimatewordfrequenciesforsomewords(inparticularformonosyllabicwords),butithasalsotwoattractiveproperties:itdoesnotneedanykindofwordsegmentationcomponent,andtherawdatacanbeeasilyobtainedinarbitraryamounts.

Accordingtotheanalysisabove,forthetaskofwordfrequencyestimation,a‘‘perfect’’word-segmentedcorpuswouldbeidealbutdoesnotexist,neithermanuallynorautomatically.Whatwehaveisavarietyofimperfectonesaswellasunsegmenteddata.Eachtypeofcorpuscomeswithitsownadvantagesanddrawbacks,sononeofthemalonecanfitthetaskofwordfrequencyestimation.Therefore,wehavetoconsiderahybridstrategywhichtriestoutilizealltheimperfectdatasourcesavailablesofar,rangingfrommanuallysegmentedandMMsegmentedonestorawcorpora,andcombinethembymeansofwordfrequencyapproximation,insteadofwordfrequencyestimation.

Therestofthisarticleisorganizedasfollows:Section2proposestheconstructionprocedureofthenewapproximationscheme.Section3introducesthedatasetweusedforparameteradjustmentandevaluation.ParameteradjustmentisdescribedinSection4.Section5presentstheexperimentstoevaluatetheperformanceoftheproposedscheme.InSection6weconcludeourwork.

2.THEAPPROXIMATIONSCHEME

Inthissectionweproposeahybridschemeusingcorporaofdifferenttype.Thewholeprocessconsistsofthefollowingthreesteps:first,wecombinetheapproximationresultsderivedfromrawandMM-segmentedcorpora;second,manuallysegmentedcorporaareusedforwordfrequencyapproximation;andfinally,theresultsderivedfromtheprevioustwostepsareusedtoaccomplishthefinalapproximation.Wewillintroducethethreestepsonebyone.

2.1CombiningtheapproximationresultsderivedfromrawandMM-segmentedcorpora

SupposewehaveapredefinedwordlistdenotedasWLandarawcorpusdenotedasC.ByusinganFMMandBMMsegmentertosegmentC,we

146

W.QIAOETAL.

canconstructtwoMM-segmentedcorpora.Thus,thefollowingthreestatisticsforeachwordwiinWLcanbeobtainedfromCrespectively:fFMM(wi,C):WordfrequencyofwiobtainedfromFMM-segmentedC.fBMM(wi,C):WordfrequencyofwiobtainedfromBMM-segmentedC.fRAW(wi,C):StringfrequencyofwiobtaineddirectlyfromC.TheworkofSunetal.(2006)showedthatintheframeworkofMM,theaverageoffFMM(wi,C)andfBMM(wi,C)givesthebestapproximationofwordfrequenciesforonetofourcharacterwords,fBMM(wi,C)isthebestforfivecharacterwords,andfRAW(wi,C)thebestforwordswithmorethanfivecharacters.Herewesimplyfollowthisclaim.

UsingFRFB(wi,C)torepresenttheresultofwordfrequencyapproximationbyjointlyconsideringfRAW(wi,C),fFMM(wi,C)andfBMM(wi,C),wethushave:

Forwordswithonetofourcharacters:

1

FRFBðwi;CÞ¼½fFMMðwi;CÞþfBMMðwi;CÞ󰀂:

2

Forwordswithfivecharacters:

FRFBðwi;CÞ¼fBMMðwi;CÞ:

Forwordswithmorethanfivecharacters:

FRFBðwi;CÞ¼fRAWðwi;CÞ:

ThiswordfrequencyapproximationschemeiscalledRFB.

ð3Þð2Þð1Þ

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20102.2Combiningtheapproximationresultderivedfrommanuallysegmentedcorpora

HerewedefineMCasacorporasetwhichcontainsNdifferentmanuallysegmentedcorpora,denotedasMCj(j¼1,2,...,N).Bysimplycountingwordtokenoccurrences,wecanobtainthecorpus-specificwordfrequenciesforeverywordwiinWL,denoteasfMS(wi,

CHINESEWORDFREQUENCYAPPROXIMATION

147

MC).Thewordfrequencyforwiderivedfromthemanuallysegmentedcorpora,denotedFMS(wi,MC),canbecalculatedusingthefollowingequation:

FMSðwi;MCÞ¼

NXj¼1

fMSðwi;MCjÞ:ð4Þ

ThiswordfrequencyapproximationschemeiscalledMS.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20102.3CombiningFRFB(wi,C)andFMS(wi,MC)

Sofar,twodifferentwordfrequencyapproximationresultshavebeenobtained:FRFB(wi,C)andFMS(wi,MC).Inordertoappropriatelycombinethetworesults,twoproblemsshouldbeconsidered.

Oneisrelatedtothefactthatthesetwostatisticsarederivedfromcorporaofdifferentsize:thesizeofC,denotedasNC,isgenerallymuchlargerthanthatofMC,denotedasNMC.Wethusintroduceaparameteratobalancethesizedifference.Asaninitialtrial,wesimplytaketheratioofCtoMCasthevalueofa(a41).Wewillshowhowthevalueofacanbeautomaticallyadjustedbymeansofageneticalgorithm(GA)inSection4.

Inthisarticle,weexpecttointegratethemanuallysegmentedcorpora(MC)andtherawcorpus(C)intoa‘‘new’’corpuswithsizeof2NMC.ThesizeofCisreducedtoNC/a,asaconsequence,thesizeofMCchangestoN0MC:

N0MC¼2NMCÀNC=a:

0

Accordingly,FRFB(wi,C)turnsintoFRFB(wi,C):

ð5Þ

F0RFBðwi;CÞ¼FRFBðwi;CÞ=a:

And,FMS(wi,MC)shouldbemodifiedas:

F0MSðwi;MCÞ¼FMSðwi;MCÞÂ

2NMCÀNC=a

:

NMC

ð6Þ

ð7Þ

ThesecondproblemconcernstherelationshipbetweenwordlengthandtheprecisionoftheapproximationresultsforChinesewordfrequencies:theshortertheword,thelessreliabletheapproximationresultsonaraw

148

W.QIAOETAL.

corpuswillbe.Takingthisobservationintoaccount,weintroduceaweightingfactorb.Theshorterthewordis,themoreweightshouldbegiventoFMS(wi,MC).ForthatpurposewedividetheChinesewordsintofourclasses:one-characterwords,two-characterwords,three-characterwordsandwordshavingfourormorethanfourcharacters.Toeachclassanindividualbisassigned:

8

forone-characterwords;b>><1

fortwo-characterwords;b2

forthree-characterwords;b>>:3

b4¼0otherwise:

Accordingtotheanalysisabove,theseparametershavetosatisfytheglobalconstraintb1!b2!b34b4.

ToadaptthewordfrequencyapproximationtothenewweightingfactorsEquation(6)shallbefurtherchangedto:

0

F00RFBðwi;CÞ¼FRFBðwi;CÞÂ

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 201011

¼FRFBðwi;CÞÂ:1þbað1þbÞ

ð8Þ

Similarly,Equation(7)hastobechangedto:

F00MSðwi;MCÞ¼FMSðwi;MCÞÂ

C2NMCÀaðN1þbÞNMC

:ð9Þ

BasedonEquations(8)and(9),weobtainourfinalhybridscheme,

denotedasFRFBþMS(wi,CþMC):

00

FRFBþMSðwi;CþMCÞ¼F00MSðwi;MCÞþFRFBðwi;CÞ

¼FMSðwi;MCÞÂ

C2NMCÀaðN1þbÞNMC

1

:þFRFBðwi;CÞÂ

að1þbÞ

ð10Þ

ThiswordfrequencyapproximationschemeiscalledRFBþMS.

CHINESEWORDFREQUENCYAPPROXIMATION

149

3.DATASETS

Inthissectionweintroducethedatasetsweuseforparameteradjustmentandperformanceevaluation.(1)

Twomanuallyword-segmentedcorpora.ThefirstoneisHUAYU,abalancedChinesecorpusconsistingof1,040,190wordswith1,763,762characters,developedbyTsinghuaUniversityandBeijingLanguageandCultureUniversity.ThesecondoneistheBEIDAnewswirecorpusconsistingof7,286,870wordswith13,030,237characters,developedbyPekingUniversity.Themanuallyword-segmentedcorporahavetotally8,327,060wordswith14,793,999characters.Thegold-standardcorpus.Weuseamanuallyword-segmentedbalancedcorpusdevelopedbytheNationalLanguageCommissionofChina,denotedYUWEI,whichcontains25,000,309wordswith51,311,659characters.AstheYUWEIcorpusisanotedauthorityandrelativelylargeinsize,wetakeitasgold-standardforexperimentalcomparisonsinthisarticle.Awordlistisderivedfromthiscorpusandthecorrespondingwordfrequenciesareobtained.Wedeleteallwordswithafrequencylessthanfourfromthisoriginalwordlisttoformourstandardwordlist,whichisdenotedYWLandcontains99,660entries.

Arawcorpus.Weuseaverylargerawcorpus,denotedRC,whichcontains1,019,986,721characters.ThestringfrequencyofeverywordinYWLisobtainedfromRC.

MM-segmentedcorpora.UsingYWL,wesegmenttherawcorpusRCwithanFMM-segmenterandBMM-segmenterseparately,resultingintwoMM-segmentedcorpora.WedenotethemRC_FMMandRC_BMMrespectively.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010(2)

(3)(4)

Intotal,wehavethereforetwomanually-segmentedcorporaofmoderatesize,HUAYUandBEIDA,wherethecombinationHUAYUþBEIDAisdenotedasHB,oneverylargerawcorpus(RC),twoMM-segmentedcorpora(RC_FMMandRC_BMM),andagold-standardcorpus(YUWEI).

4.PARAMETERADJUSTMENT

Inthissectionageneticalgorithm(Holland,1975)isusedtoadjustparametersaandb(b1,b2andb3)thathavebeenintroducedinRFBþMS.

150

W.QIAOETAL.

ThespecificGAusedherehasbeendevelopedbyus.Itisbasedonnon-overlappingpopulationsandanoptionalelitism.Foreachgeneration,thealgorithmcreatesanentirelynewpopulationofindividuals.Themutationprobabilityisavariablevaluewiththeminimumvaluesetas0.03.Thecross-overprobabilityis0.75inourimplementation.Thegold-standardcorpus,YUWEI,hasbeenadoptedforparameteradjustment.TrainingandtestsetsareconstructedbyrandomlydividingYUWEIintotwoparts:YUWEI_1andYUWEI_2with28,536,843charactersand22,774,816charactersrespectively.YUWEI_1isusedasatrainingsetwhileYUWEI_2isusedforopentest.

WederivedtwooriginalwordlistsfromYUWEI_1andYUWEI_2andcalculatedthecorrespondingwordfrequenciesforthem.TwoderivedwordlistsdenotedYWL_1andYWL_2areobtainedbydeletingallthewordswithafrequencyoflessthan4fromtheoriginalwordlists.YWL_1has83,101entriesandYWL_2has76,514entries.Here,YUWEI_1isusedtoadjusttheparameters.

Aranknumberisassignedtoeveryofthe83,101entriesinYWL_1,afterithasbeensortedaccordingtodescendingwordfrequencies.ThisrankingisdenotedRYW1.Similarly,wecanalsoobtainaranksequenceforalltheentriesofYWL_1intermsofFRFBþMS(wi,RCþHB),denotedRRFBþMS(YWL_1).

Todefineafitnessfunctionforthegeneticalgorithm,weproposetwometricswhichmeasuretheperformanceofthewordfrequencyapproximationscheme.

Thefirstmetricdescribestheclosenessbetweenthegold-standardrankRYW1andthatderivedfromournewscheme.Thecloserthetworanksare,thebettertheschemeis.Thischoiceismotivatedbytheobservationthatwordsinaranklistdonotstrictlyfollowthesameorderevenforbalancedcorporaofsimilarsize.Rankclosenessallowstherankdifferencestovaryinacertainrange.

WedecidedtousetheSpearmanCoefficientofRankCorrelation(SCRC)tomeasurethecloseness.SupposetherearetworanksequencesR1andR2forawordlistdenotedasWL.ThevalueofSCRCforR1andR2iscalculatedby:

SCRCWLðR1;R2Þ󰀆1À6

NXi¼1

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010d2wi

;NðN2À1Þ

CHINESEWORDFREQUENCYAPPROXIMATION

151

wheredwiisthedifferencebetweentworanknumbersofwiwithrespecttoR1andR2,NisthelengthofWL.

TheSCRCvalue,rangingfrom0to1,describesthesimilaritybetweentworanksequences.Althoughtherankordermaydiffersubstantially,wearguethatthegeneraltendencywillbeproperlyreflected.

Withrespecttothefirstmetric,theoptimizationproblemcanbeformalizedas:

argmaxSCRCYWL

a;b

1

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010ðRYW1;RRFBþMSðYWL1ÞÞ:

WedenotetheGAfitnessfunctiondefinedbythefirstmetricasFIT1.InordertodemonstratethatSCRCisaneffectivemeasurement,twocommonly-usedEnglishcorpora(forwhichaccuratewordfrequencyestimatescanbeeasilyobtained)havebeenchosentocalculatetheSCRC.OneistheLOBcorpuswith53,823wordtypes;theotheroneistheBrowncorpuswith55,734wordtypes.Overall,27,807wordswhicharecommonintwocorporahavebeenextracted.TheSCRCvaluebetweenthetworanksequencesderivedfromBrownandLOBis0.72.Ifwedeletethewordswithafrequencyoflessthanfour(wedothisalsoinourexperimentsforChinese)11,357commonwordsremainandtheSCRCvaluerisesto0.79.ThisshowsthatSCRCisaneffectivemeasurementoftheclosenessbetweentworanksequences.

ThesecondmetricisthecoveragerateofthetopNhigh-frequencywordsinaranksequenceRtoacorpusC,denotedasCoverage(R,N,C)givenby:

#:charactersinCcoveredbytopNhighfrequencywordsinR

:

Thetotal#:charactersinCWithrespecttothesecondmetric,theoptimizationproblemisgivenby:

argmaxCoverageðRRFBþMSðYWL1Þ;N;YUWEI1Þ:

a;b

Fortheexperimentsdescribedinthisarticle,theparameterNissetto50,000,sinceinYUWEI_2the50,000mostfrequentwordsalreadyhave

152

W.QIAOETAL.

Table1.ParametervaluesdeterminedbyGA(usingMMsegmenter).

a

ParametervaluesforFIT1ParametervaluesforFIT2Finalparametervalues

15.817.016.4

b10.91.21.0

b20.60.40.5

b30.30.20.3

SCRC0.7480.7430.746

Coveragerate

96.3%96.6%96.5%

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010coverageof98%.WedenotetheGAfitnessfunctiondefinedbythesecondmetricasFIT2.

UsingFIT1andFIT2asthefitnessfunction,twogroupsofoptimalparametervaluesareobtainedafterGAruns150generationswithapopulationsizeof30.

TheoptimizedparametervaluesareshowninTable1togetherwiththecorrespondingSCRCvaluesandthecoveragerates.TheresultsshowthatusingthefirstgroupofparameterswecanachieveahigherSCRCvalueatthecostofalowercoveragerate,whileinthesecondgroupalowerSCRCvalueisobtainedwithahighercoveragerate.Herewemakeatrade-offandtaketheaveragevalueofthetwogroupsasfinalparametervalueswhicharelistedinthelastrowofTable1.

WereplaceMMsegmenterswithawidely-usedChinesewordsegmenterICTCLAS1.0,1whichexploitssophisticatedmechanismtodealwithsegmentationambiguitiesandunknownwords,tosegmentCtocheckiftheparametervaluesaresensitivetochangesinwordsegmentation.Wehave:

fICTCLAS(wi,C):wordfrequencyofwiobtainedfromICTCLAS1.0-segmentedC.

ForwordswithonetofivecharacterstheschemeRFBthuschangesto:

FRFBðwi;CÞ¼fICTCLASðwi;CÞ:

Andforwordswithmorethanfivecharacters,westilluse:

FRFBðwi;CÞ¼fRAWðwi;CÞ:

1ICTCLAS1.0:http://www.nlp.org.cn

CHINESEWORDFREQUENCYAPPROXIMATION

153

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010Parametersarere-adjustedusingthegeneticalgorithm.Thedeter-minedparametervaluesbyGAtogetherwiththecorrespondingSCRCvaluesandcoverageratesaregiveninTable2.

ComparedwithTable1,wecanseethattheparametersdonotchangealotwhendifferentwordsegmentersareused.Wealsonotethatusinga‘‘high-quality’’wordsegmenterdoesnotleadtoahigherSCRCvalueorabettercoveragerate.ThisconfirmsourpreviousclaimthatanMMsegmenteriseffectiveforthistask.

Bynow,theparametervalues(aandb)havebeendetermined.Wewillapplythisfixedschemeinaseriesofexperimentstoevaluateitsutilityandeffectiveness.

5.EXPERIMENTS

Inthissection,theopentestsetYUWEI_2istakenasakindofgold-standardcorpus,onwhichwecompareourproposedschemeRFBþMSwiththeotherthreeschemes:(1)(2)(3)

RFB:theschemeusingarawcorpusandMM-segmentedcorpora.MS:theschemeusingonlymanually-segmentedcorpora.

CRFþMS:theschemeusingahigh-qualityCRF-basedChinesewordsegmenter,trainedonmanually-segmentedcorpora,tosegmenttherawcorpus.

Totrainthewordsegmenter,wefirstconcatenatedtheautomaticallyandmanuallysegmentedcorporatoformanewsegmentedcorpusand,countedwordfrequenciesonit.Specificallyinthisarticle,awordsegmenterbasedonconditionalrandomfields(CRF)(Laffertyetal.,2001;Pengetal.,2004)isadopted.IthasbeenreportedasthebestoneintheSIGHANwordsegmentationcompetition(Levow,2006).The

Table2.ParametervaluesdeterminedbyGA(usingICTCLAS1.0wordsegmenter).

a

ParametervaluesforFIT1ParametervaluesforFIT2Finalparametervalues

16.216.816.5

b11.01.21.1

b20.50.50.5

b30.20.20.2

SCRC0.7380.7240.734

Coveragerate

95.8%96.9%96.5%

1

W.QIAOETAL.

implementationoftheCRF-basedwordsegmenterusestheCRFþþversion0.50packageprovidedbyTakuKudo.2Thewindowsizeissetasfiveandthefour-tagset:‘‘S’’(single-characterword),‘‘L’’(leftboundaryofaword),‘‘M’’(non-boundary),‘‘R’’(rightboundaryofaword)isusedtodistinguishthepositionofcharacterinaword.Thebasicfeaturetemplateadoptedby(Lowetal.,2005)isusedwhichwerestateheretomakethearticleself-containing:(a)(b)(c)(d)(e)

Cn,n¼72,71,0,1,2CnCnþ1,n¼72,71,0,1C-1C1Pu(C0)

T(C72)T(C71)T(C0)T(C1)T(C2)

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010where,CreferstoaChinesecharacter,Cnreferstothecharacternpositionstotheright(left)ofthecurrentcharacter,andPu(C0)checkswhetherC0isapunctuationmark.Fourtypesarefurtherdefinedin(Lowetal.,2005),3T(Cn)denotesthetypethecharacterCnbelongsto.TheCRFmodelistrainedonaportionofthemanuallysegmentedcorporaHUAYUandBEIDAsincetheyaretoolargetobeusedentirely.Sentencesarerandomlyselectedfromthesecorporatocomposeatrainingsetwithsizeof15MbwhichistheutmostlimitsizeuptowhichtheCRFtrainingcanstillbeperformedona4Gbmemorycomputer.Performancecomparisonsareconductedfromfivedifferentperspectives.5.1Perspective1:TheSpearmanCoefficientofRankCorrelation

Inthissubsection,theSCRCvalueisusedtomeasuretheperformanceofthewordfrequencyapproximationschemes.FromYUWEI_2weobtainaranksequenceforthe76,514entriesofYWL_2,denotedRYW2,usingthesameprocedureasforobtainingRYW1andRRFBþMS(YWL_1),whichwasdescribedinSection4.AnotherfourranksequencesdenotedRMS(YWL_2),RRFB(YWL_2),RRFBþMS(YWL_2)andRCRFþMS(YWL_2)aregeneratedusingFMS(wi,RCþHB),FRFB(wi,RCþHB),FRFBþMS(wi,RCþHB)andFCRFþMS(wi,RCþHB)correspondingly.

http://chasen.org/taku/software/CRFþþ/.

ThefourtypesareNumbers,Dates(theChinesecharactersfor‘‘day’’,‘‘month’’,‘‘year’’,respectively),EnglishlettersandOthers.

32CHINESEWORDFREQUENCYAPPROXIMATION

155

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010ThenwecalculatetheSCRCvaluebetweenRYW2andeachofthefourranksequences.Table3showstheresults.

TheSCRCvalueoftheranksequenceRRFBþMS(YWL_2)isthehighest,indicatingaclosercorrespondencetotheranksequenceRYW2thananyofthesequencesRMS(YWL_2),RRFB(YWL_2)andRCRFþMS(YWL_2).

TheexperimenthasbeenrepeatedonthesubsetsofYWL_2withwordfrequencies!10and!200.Table4showsthat,nomatterwhetherweexcludeverylow-frequencywordsorconsideronlyhigh-frequencywords,theproposedschemeoutperformstheotherthreeschemes.

Table5summarizestheimprovementondifferentsubsetsofYWL_2.Wenotethatalthoughmanuallysegmentedcorporaareusuallymore

Table3.ComparisonsoftheSCRCvalueamongfourschemesoverYWL_2.

(RYW2,RMS(YWL_2))

SCRCYWL_20.66

(RYW2,RRFB(YWL_2))

0.73

(RYW2,RRFBþMS(YWL_2))

0.77

(RYW2,RCRFþMS(YWL_2))

0.73

Table4.SCRCvaluesoverYWL_2withwordfrequency!10and!200.

(RYW2,RMS(YWL_2))

SCRCYWL_2(wordfrequency!10)SCRCYWL_2(wordfrequency!200)

0.670.71

(RYW2,RRFB(YWL_2))

0.750.79

(RYW2,RRFBþMS(YWL_2))

0.790.83

(RYW2,RCRFþMS(YWL_2))0.760.80

Table5.ImprovementofSCRCvaluesoverdifferentpartsofYWL_2.

Numberofwords

Wordswithfrequency!4Wordswith

frequency!10Wordswith

frequency!200

76,51447,2705,725

RFBþMSvs.MS0.110.120.12

RFBþMSvs.RFB0.040.040.04

RFBþMSvs.CRFþMS

0.040.030.03

156

W.QIAOETAL.

accurate,theysufferfromthedrawbackofsmallsizeandinconsistencyintheirsegmentationstandards.Thisleadstoaworseperformance.

Anotherpointwhichdeservesnotingisthat,althoughtheperformanceofCRFþMSisslightlybetterthanRFB,CRFtrainingisverytime-consumingcomparedwiththefastmaximummatchingsegmentation.Sincetherawcorpusweusedusuallyhashugesize,theefficiencyofMMsegmentationisappreciable.Thisalsocanbeseenasanadvantageoftheproposedscheme.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20105.2Perspective2:Ranksequencedeviation

Inthissubsection,weperformedthepreviousexperimentfromanotherperspective.WedefinetheranksequencedeviationswithPrespecttotworanksequencesRYW2andRscheme,sRschemeforshort,as:ijRschemeðwiÞÀRYW2ðwiÞj(ioverasubsetofYWL_2).Here,Rschemeisaranksequencederivedfromoneofthefourschemes:RFBþMS,MS,RFB,CRFþMS.WethencalculatesMS,sRFB,sRFBþMSandsCRFþMS.

Thevaluesof(sRFBþMS7sMS)/sMS,(sRFBþMS7sRFB)/sRFBand(sRFBþMS7sCRFþMS)/sCRFþMSdescribethevaryingrateofthesvalueoftheproposedschemecomparedtotheothertwoschemesrespectively.TheyarelistedinTable6wherewecanseethat,theproposedschemereceivesthebestresultsforonetothree-characterwordsinYWL_2,butforwordsoffourcharactersormore,itshowslowerperformance.

Inordertoseemoredetail,wordswiththesamelengtharefurtherdividedintothreegroups,i.e.,high,mediumandlow-frequencywords.Figure1showsthecoveragerateoftheNmostfrequentwordsofYUWEI_2whichisusedtodeterminethethresholdforhigh,mediumandlow-frequencywords.

Intermsofthiscoverageratecurve,wechoosethepointHMtodistinguishhighandmedium-frequencywordsandthepointMLto

Table6.Thecomparisonsofranksequencedeviationswithrespecttowordlength.Wordlength1234þ

sRFBþMSÀsMS

sMS

723.5%720.8%715.9%10.2%

sRFBþMSÀsRFB

sRFB

718.5%716.2%79.1%11.5%

sRFBþMSÀsCRFþMS

sCRFþMS

717.2%716.5%78.6%10.9%

CHINESEWORDFREQUENCYAPPROXIMATION

157

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010Fig.1.ThecoveragerateofthetopNfrequentwordstoYUWEI_2

distinguishmediumandlow-frequencywords.High-frequencywordsarethe6384mostfrequentwords(1*HM),withwordfrequency4174and,acoverageof80%toYUWEI_2.Medium-frequencywordsrangebetweenrank6385and31,794(HM*ML)withawordfrequency418.Theyraisethecoverageto95%.Low-frequencywordsaretheremainingwords(ML*76,514),withawordfrequency43.Tables7,8and9presenttheresultsofranksequencedeviationperformedonhigh,mediumandlow-frequencywordsrespectively.

FromTables7,8and9wecanseethat,inmostofthecases(10outof12),ourschemereceivesthebestresult.Butforlow-frequencywords,especiallyone-characterwordsandwordsoffourcharactersormore,theresultsgetworse.Low-frequencywords,inparticular,contributemosttotheperformancedegradationonthewordsoffourcharactersormore(seeTable6).Thiscanbepartlyexplainedbythegold-standardcorpusweused.AlthoughYUWEIisrelativelylargeandbalanced,itssizeisstillfarfromwhatweneedtoreliablyestimatewordfrequenciesforlow-frequencywords.

HerewegivesomeexamplesofwordswhicharereasonablyrankedhigherupinRRFBþMS(YWL_2)comparedtoRYW2:‘‘(friend)’’,‘‘(ferry)’’,‘‘(sunny)’’,‘‘(ShenzhenStockExchange)’’,

158

W.QIAOETAL.

Table7.Comparisonofranksequencedeviationsforhigh-frequencywordsinYWL_2withrespecttowordlength.

1characterwords

sRFBþMSÀsMSsMS

sRFBþMSÀsRFB

sRFB

sRFBþMSÀsCRFþMS

sCRFþMS

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20102characterwords738.6%732.9%731.0%

ü

3characterwords767.8%758.7%756.3%

ü

4þcharacterwords7.0%782.0%776.8%ü

745.0%734.8%732.5%

ü

IsRFBþMSthebestscheme?

Table8.Comparisonofranksequencedeviationsformedium-frequencywordsinYWL_2withrespecttowordlength.

1characterwords

sRFBþMSÀsMS

sMS

sRFBþMSÀsRFB

sRFB

sRFBþMSÀsCRFþMS

sCRFþMSIsRFBþMSthebestscheme?

732.4%719.2%732.5%

ü

2characterwords715.8%78.2%77.0%ü

3characterwords78.3%710.1%710.0%

ü

4þcharacterwords714.2%711.2%710.8%ü

Table9.Comparisonofranksequencedeviationsforlow-frequencywordsinYWL_2withrespecttowordlength.

1characterwords

sRFBþMSÀsMS

sMS

sRFBþMSÀsRFB

sRFB

sRFBþMSÀsCRFþMS

sCRFþMSIsRFBþMSthebestscheme?

25.8%74.2%73.5%7

2characterwords725.3%712.4%712.0%

ü

3characterwords719.2%75.7%75.3%ü

4þcharacterwords36.8%12.9%8.2%7

CHINESEWORDFREQUENCYAPPROXIMATION

159

Table10.Thecoveragerateoftop50,000high-frequencywordsoverYUWEI_2.SchemeCoveragerate

MS94.0%

RFB95.0%

RFBþMS96.9%

CRFþMS94.8%

‘‘(MotorolaInc.)’’,‘‘InformationBureau)’’etc.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010(TheStateDepartment

5.3Perspective3:Thecoveragerate

TomeasurethecoveragerateoverYUWEI_2,thetop50,000high-frequencywordshavebeenselectedfromthewordlistsprovidedbythefourschemes:MS,RFB,RFBþMSandCRFþMS.FromTable10wecanseethatthecoveragerateoftheproposedschemeincreasesby2.9%,1.9%and2.1%comparedtotheschemesMS,RFBandCRFþMSrespectively.

5.4Perspective4:Sampleanalysis

TakingRYW2asthegold-standardranksequence,wecomparedRRFBþMS(YWL_2)againstRMS(YWL_2).Thereare51,052wordsinRRFBþMS(YWL_2)whoseranksareadjustedclosertothestandardsequenceRYW2thantheranksinRMS(YWL_2).Wecallthesewordspositivesamples.Meanwhile,thereare25,447wordswhoseranksareadjustedfarfromRYW2.Wecallthemnegativesamples.Therestillare15wordsthathavesamerankinRMS(YWL_2)andRRFBþMS(YWL_2).Tables11and12showthedistributionofpositivesamplesandnegativesamplesfordifferentfrequencyregions(high,medium,andlow).

(biologicHerewegivesomepositivesamples,suchas‘‘

(knowledgeeconomy)’’,‘‘(in-technology)’’,‘‘

(greenhouseeffect)’’.Thesewordsformationthruway)’’and‘‘

arecomparativelynew,comparedwiththebasicvocabularyofChinese,andhaveahighfrequencynowadays.Whenusingourscheme,theyareproperlyadjusted.Wealsogivesomenegativesamples,suchas‘‘(WinstonChurchill)’’,‘‘(CentralRedArmy)’’and‘‘(maidservant)’’.Thesewordsareusedlessoftentoday,butarestillrankedhighbyourscheme,duetothefactthattheywerefrequentlyusedhistorically,asreflectedinRC,alarge-scalerawcorpuswhichcoversthelinguisticphenomenaofthattimespanmoreintensivelythanHUAYU,BEIDAandYUWEI.

160

W.QIAOETAL.

Table11.Thedistributionofpositivesamplesfordifferentwordfrequencyregions.WordfrequencyregionHigh-frequencywordsMedium-frequencywordsLow-frequencywords

Totalwords638425,41044,720

Numberofpositivesamples

502417,91428,129

Proportiontotheregion

78.7%70.5%62.9%

Table12.Thedistributionofnegativesamplesfordifferentwordfrequencyregions.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010WordfrequencyregionHigh-frequencywordsMedium-frequencywordsLow-frequencywords

Totalwords638425,41044,720

Numberofnegativesamples

1360749616,591

Proportiontotheregion

21.3%29.5%37.1%

5.5Perspective5:PerformanceevaluationontheChinesewordsegmentationtask

ChinesewordsegmentationisthefundamentaltechnologyformanyNLP-relatedapplications.InthepastdecadesithasdrawnalargebodyofresearchwithintheChineselanguageprocessingcommunity.Avarietyofmethodshavebeenexploitedrangingfromrule-based(Palmer,1997;Chengetal.,1999)tostatistics-based(Sproatetal.,1996;Sunetal.,1998;Teahanetal.,2000),fromword-basedtocharacterbased(Xue,2003),fromsupervisedlearning-based(Pengetal.,2004;Lowetal.,2005)tounsupervisedlearning-based(Goldwateretal.,2006),andalsoincludeshybridapproaches(Gaoetal.,2005).

InordertoverifytheperformanceoftheproposedChinesewordfrequencyapproximationschemefromanapplicationpointofview,itisappliedtothetaskofChinesewordsegmentation.

N-grammodelswhichcanbetracedtoClaudeShannon’sworkininformationtheory(Shannon,1948)arewidelyusedinstatisticalnaturallanguageprocessingincludingChinesewordsegmentation(Kit,2004;Carpenter,2005;Maoetal.,2007).AnN-grammodelpredictsthecurrentstatexibasedonformernstatesxi71,xi72,...,xi7n.Inprobabilityterms,thisisnothingbutP(xijxi71,xi72...xi7n).Whenusedforlanguagemodellingindependenceassumptionsaremadesothateachworddependsonlyonthelastnwords.TheunigrammodelisaspecialcaseofN-grammodelwhereNissetas1.Ithasalsobeenusedfor

CHINESEWORDFREQUENCYAPPROXIMATION

161

Chinesewordsegmentationtasks(Chen,2005).SupposethereisasentenceSandasegmentationforSwhichcanberepresentedasnwords:w1,w2,...,wn.Whenusingtheunigrammodel,thesentenceSunderthissegmentationhastheprobabilityof:

probðSÞ¼

nYi¼1

probðwiÞ:

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010where,prob(wi)istheprobabilitythatthewordwiappearsinrunningtext.

Givenatrainingcorpus,thevalueofprob(wi)isusuallyestimatedbytheratioofthewordfrequencyofwitothetotalnumberofwordsinthetrainingcorpus.Adynamicprogrammingalgorithmisusedtofindtheoptimalsegmentationwhichprovidesmaximalprob(S).Thus,thesegmentationperformanceonlydependsontheaccuracyofwordfrequencyestimationwhichisderivedfromthetrainingcorpus.Themoreaccuratethewordfrequencyapproximation,thebettersegmentationresultscanbeobtainedfromaunigrammodel.Inthisarticle,aunigrammodelisadoptedforChinesewordsegmentationtotesttheperformanceofwordfrequencyapproximationschemes.Incontrasttotheperspectivesdiscussedintheprevioussubsections,thisisanevaluationfromanapplicationpointofview.NoticethatherewedonotattempttoimprovetheperformanceofChinesewordsegmentationbutjustusethistaskasameasurementofthequalityoffrequencyapproximation.

InthefollowingexperimentstheF1-measureisusedtoevaluatetheperformanceofChinesewordsegmentation.Itisdefinedas

F1-measure¼

2PÁR

;PþR

where‘‘P’’denotesprecisionand‘‘R’’denotesrecall.

Inordertoachieveacomprehensiveevaluation,allfourdatasetsoftheSIGHANBakeoff’05(Emerson,2005)areused.TheyhavebeencompiledbyMicrosoftResearchAsia(MSRA),PekingUniversity(PKU),AcademiaSinica(AS)andCityUniversityof(CityU).AsummaryofthecorporaisshowninTable13.ThestandardBakeoffscoringprogramisusedtocalculateF1-measure.

Eachdatasetcontainsatrainingsetandacorrespondingtestset.Aunigrammodelistrainedoneachofthefourtrainingsetsrespectively.Table14showstheresultsofwordsegmentation.

162

W.QIAOETAL.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010InadditiontothefourschemesMS,RFB,RFBþMSandCRFþMS,anotherscheme,denotedasSTR,isalsoincludedinthecomparison.Itusesasapproximationthestringfrequenciesderivedfromtherawcorpus.Thereforeinthefollowingexperiments,fiveschemesarechecked:STR,MS,RFB,RFBþMSandCRFþMS.

TakingYWLastheoriginalwordlist,weobtainwordfrequencyestimationfromthesefiveschemesrespectively.Foraspecificscheme,somewordsinYWLmayhavezerofrequency.Inordertoensureafaircomparison,thewordsinYWLwithoutzerofrequencyestimationbyanyofthefiveschemesisextractedtoformanewwordlist,denotedCL.CLcontains62,639entries.WhiletheearlierfourperspectivesuseYUWEIasthegold-standardcorpus,hereYUWEIistreateddifferently:itisincorporatedintothesetofmanuallysegmentedcorporatomaketrainingdataforfrequencyapproximationaslargeaspossible.Asaconsequence,thewordfrequencyapproximationgivenbytheschemeMSchangesfromFMS(wi,HB)toFMS(wi,HBþYUWEI),asdefinedinEquation(4)and,thefinalwordfrequencyapproximationisgivenbyFRFBþMS(wi,RCþ(HBþYUWEI)),asdefinedinEquation(10).

Table15demonstratestheperformanceoftheunigramwordsegmentationmodelsusingthefivefrequencyapproximationschemesforCLontheSIGHANBakeoff’05datasets.

FromTable15wecanseethattheproposedschemeRFBþMSoutperformsalltheotherwordfrequencyapproximationschemes.

Table13.ThestatisticsoftheSIGHANBakeoff’05datasets.CorpusMSRAPKUASCityU

Sizeoftraining

set(Kb)

12,2576926,9886085

Sizeoftestset(Kb)

3683313133

Sizeofwordlist74,60851,113141,33869,085

Table14.F1-measureoftheunigramwordsegmentationmodelusingtheindividualwordlistandwordfrequenciesfromtheSIGHANtrainingset,testedonfourSIGHANBakeoff’05datasets.CorpusF1(%)

MSRA91.4

PKU88.4

AS88.3

CityU86.1

CHINESEWORDFREQUENCYAPPROXIMATION

163

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010Unfortunately,thisresultisnotdirectlycomparabletotheresultsshowninTable14,sinceourproposedschemeusesthesamewordlistonallthefourdatasetswhereastheresultsofTable14areobtainedwithdifferentindividualwordlistsextractedfromthecorrespondingtrainingset.

Inordertocompensatetheinfluenceoftheindividualwordlistsandtofacilitateafaircomparison,wefurtherconductaunigramwordsegmentationmodelwhichusestheindividualwordlistfromeachSIGHANtrainingset,alongwithwordfrequenciesreplacedbytheapproximationresultsobtainedfromtheproposedschemeRFBþMS.Asimplesmoothingmethod,Laplacesmoothing,isusedtoestimatetheprobabilityofwordswhichdonotappearinCL.Table16showsthenumberofsharedwordsbetweentheindividualSIGHANwordlistsandCL,aswellastheperformanceoftheresultingunigramwordsegmentationmodelsontheSIGHANBakeoff’05datasets.

Theresultsindicatethatsuchastrategyonlydecreasesthewordsegmentationperformanceby0.1*0.3percent.WealsonotethattheproportionofthesharedwordsisfairlylowsothatroughlyhalfofthewordsreceivetheirfrequencyestimatebyLaplacesmoothing.IfthefrequenciesofthemissingwordsaretakenfromtheSIGHAN

Table15.F1-measureoftheunigramwordsegmentationmodelsusingthefivefrequencyapproximationschemesforCL,testedonthefourSIGHANBakeoff’05datasets.CorpusMSRAPKUASCityU

STR(%)85.285.478.377.4

MS(%)85.686.878.878.4

RFB(%)85.485.878.877.9

RFBþMS(%)

86.387.480.480.1

CRFþMS(%)

85.987.279.079.2

Table16.F1-measureoftheunigramwordsegmentationmodelusingtheRFBþMSfrequencyapproximationschemewithLaplacesmoothingfortheindividualwordlist,testedonthefourSIGHANBakeoff’05datasets.Corpus

NumberofsharedwordsbetweenindividualwordlistandCLF1-measure(%)

MSRA38,99291.3

PKU32,43988.3

AS42,37488.0

CityU30,29985.9

1

W.QIAOETAL.

Table17.F1-measureoftheunigramwordsegmentationmodelusingtheRFBþMSfrequencyapproximationschemewithoriginalwordfrequenciesassmoothingfortheindividualwordlist,testedonthefourSIGHANBakeoff’05datasets.CorpusF1-measure(%)

MSRA92.3

PKU90.3

AS.8

CityU87.4

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010trainingsetsdirectly,insteadofLaplacesmoothing,theperformanceofwordsegmentation,asgiveninTable17,isimproved,evenbetterthantheperformanceofusingtheindividualwordlistwithwordfrequenciesentirelyderivedfromSIGHANtrainingset(Table14)by0.9%,1.9%,1.5%and1.3%onthefourdatasets.ThisobservationaddsfurtherevidencetotheclaimthattheproposedschemeRFBþMSiseffectiveforChinesewordfrequencyapproximation.Italsosuggeststhattheschememayeasilybeadaptedtospecificdatacollections.

6.CONCLUSIONANDFUTUREWORK

Inthisarticle,ahybridschemeisproposedwhichjointlyusescorporaofdifferenttype(rawcorpora,MM-segmentedcorporaandmanuallysegmentedcorpora)toapproximatethewordfrequenciesinChinese.IthasbeendevelopedtakingintoaccountthecharacteristicsoftheChineselanguage.TheparametersintheproposedschemehavebeendeterminedtoproduceageneralframeworkforChinesewordfrequencyapproxima-tion.Itcan,therefore,beimmediatelyappliedothertaskswithoutanadditionaladjustment.Statisticalevaluationshavebeencarriedoutfromseveralperspectives.TheyshowthattheproposedschemeoutperformsotherChinesewordfrequencyapproximationmethods.ThemodelhasalsobeenappliedtoaChinesewordsegmentationtasktocompareitsperformancewithotherChinesewordfrequencyapproximationschemes.Theresultsindicatethattheproposedschemeisthemosteffectiveoneamongtheapproachesunderconsideration.WearenowcompilinganauthoritativeChinesewordlistbasedonthiswordfrequencyapproxi-mationscheme.Bymakingitpubliclyavailableinthefuture,wehopetoprovidearesourcewhichmightbeveryusefulforotherChineseNLPresearchers.

CHINESEWORDFREQUENCYAPPROXIMATION

165

ACKNOWLEDGEMENTS

ThisworkissupportedbytheNationalScienceFoundationofChinaunderGrantNo.60873174,theNational863High-TechProjectofChinaunderGrantNo.2007AA01Z148andtheChina–Germany(Tsinghua–HamburgUniversity)CINACSProgram.

REFERENCES

Carpenter,B.(2005).Scalinghigh-ordercharacterlanguagemodelstogigabytes.ACL

SoftwareWorkshop.AnnArbor,USA.

Chen,A.,Zhou,Y.,Zhang,A.,&Sun,G.(2005).UnigramlanguagemodelforChinese

wordsegmentation.InChu-RenHuang&G.A.Levow(Eds),Proceedingsofthe4thSIGHANWorkshoponChineseLanguageProcessing(pp.138–141).JejuIsland,Korea:AssociationforComputationalLinguistics.

Chen,G.L.(1994).OnChineseMorphology.Shanghai:XuelinPublisher.

Cheng,K.S.,Young,G.H.,&Wong,K.F.(1999).Astudyonword-basedandintegral-bitChinesetextcompressionalgorithm.JournaloftheAmericanSocietyforInformationScience,50(3),218–228.

Dai,X.L.(1992).ChineseMorphologyanditsInterfacewiththeSyntax.PhD

dissertation,OhioStateUniversity.

Emerson,T.(2005).TheSecondInternationalChineseWordSegmentationBakeoff.In

Chu-RenHuang&G.A.Levow(Eds),Proceedingsofthe4thSIGHANWorkshoponChineseLanguageProcessing.JejuIsland,Korea:AssociationforComputa-tionalLinguistics.

Gao,J.F.,Li,M.,Huang,C.,andWu,A.(2005).Chinesewordsegmentationandnamed

entityrecognition:Apragmaticapproach.JournalofComputationalLinguisticsandChineseLanguageProcessing,31(4),531–574.

Goldwater,S.,Griffiths,T.L.,&Johnson,M.(2006).Contextualdependenciesin

unsupervisedwordsegmentation.InC.Cardie&P.Isabelle(Eds),ProceedingsoftheInternationalConferenceofCOLING-ACL2006(pp.673–680).Sydney:AssociationforComputationalLinguistics.

Holland,J.(1975).AdaptationinNaturalandArtificialSystems.AnnArbor,MI:

UniversityofMichiganPress.

Kit,C.,Xu,Z.,&Webster,J.J.(2004).IntegratingN-grammodelandcase-based

learningforChinesewordsegmentation.JournalofChineseLanguageandComputing,14(3),213–219.

Lafferty,J.,McCallum,A.,&Pereira,F.(2001).Conditionalrandomfields:Probabilistic

modelsforsegmentingandlabelingsequencedata.InC.E.Brodley&A.PohoreckyjDanyluk(Eds),Proceedingsofthe18thInternationalConferenceonMachineLearning(ICML2001)(pp.282–2).Williamstown,MA:MorganKaufmannPublishers.Levow,G.A.(2006).TheThirdInternationalChineseWordSegmentationBakeoff.In

H.T.Ng&O.O.Y.Kwong(Eds),Proceedingsofthe5thSIGHANWorkshoponChineseLanguageProcessing(pp.108–117).Sydney:AssociationforComputa-tionalLinguistics.

Liang,N.Y.(1987).CDWS:AwordsegmentationsystemforwrittenChinesetexts.

JournalofChineseInformationProcessing,1(2),44–52(inChinese).

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010166

W.QIAOETAL.

Liu,E.S.(1973).FrequencyDictionaryofChineseWords.TheHague:MoutonandCo.

N.V.Publishers.

Liu,Y.,&Liang,N.Y.(1986).CountingwordfrequenciesofcontemporaryChinese–

AnengineeringofChineseprocessing.JournalofChineseInformationProcessing,0(1),17–25.

Low,J.K.,Ng,H.T.,&Guo,W.(2005).AmaximumentropyapproachtoChineseword

segmentation.Proceedingsofthe4thSIGHANWorkshoponChineseLanguageProcessing(pp.161–1).JejuIsland,Korea:AssociationforComputationalLinguistics.Mao,J.,Cheng,G.,He,Y.,&Xing,Z.(2007).Atrigramstatisticallanguagemodel

algorithmforChinesewordsegmentation.InF.P.Preparata&Q.Fang(Eds),ProceedingsoftheInternationalConferenceofFAW2007(pp.271–280).Lanzhou,China.Berlin,Heidelberg:SpringerVerlag.

Palmer,D.D.(1997).Atrainablerule-basedalgorithmforwordsegmentation.InP.R.

Cohen&W.Wahlster(Eds),ProceedingsoftheInternationalConferenceofACL1997(pp.321–328).Madrid:AssociationforComputationalLinguistics.

Peng,F.,Feng,F.,&McCallum,A.(2004).Chinesesegmentationandnewword

detectionusingconditionalrandomfields.InL.Lemnitzer,D.Meurers&E.Hinrichs(Eds.),ProceedingsoftheInternationalConferenceofCOLING2004(pp.562–568).Geneva,Switzerland:AssociationforComputationalLinguistics.

Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical

Journal,27,379–423,623–656.

Sproat,R.,Shih,C.,Gale,W.,&Chang,N.(1996).Astochasticfinite-stateword-segmentationalgorithmforChinese.JournalofComputationalLinguisticsandChineseLanguageProcessing,22(3),377–404.

Sproat,R.,&Emerson,T.(2003).TheFirstInternationalChineseWordSegmentation

Bakeoff.InQ.Ma&F.Xia(Eds),Proceedingsofthe2ndSIGHANWorkshoponChineseLanguageProcessing(pp.133–143).Sapporo,Japan:AssociationforComputationalLinguistics.

Sun,M.S.,&T’sou,B.K.Y.(1995).AmbiguityResolutioninChineseWord

Segmentation.InB.K.T’sou&T.B.Y.Lai(Eds),Proceedingsofthe10thPacificAsiaConferenceonLanguage,InformationandComputation(pp.121–126).,China:CityUniversityof.

Sun,M.S.,Shen,D.Y.,&T’sou,B.K.Y.(1998).ChineseWordSegmentationwithout

UsingLexiconandHand-craftedTrainingData.InC.Boitet&P.Whitelock(Eds),Proceedingsofthe36thACLand17thCOLING(pp.1265–1271).Montreal,Canada:AssociationforComputationalLinguistics.

Sun,M.S.,Zhang,Z.C.,T’sou,B.K.Y.,&Lu,H.(2006).Wordfrequency

approximationforChinesewithoutusingmanuallyannotatedcorpus.InA.F.Gelbukh(Ed.),Proceedingsofthe7thInternationalConferenceonIntelligentTextProcessingandComputationalLinguistics(CICLING)(pp.105–116).MexicoCity,Mexico.Berlin,Heidelberg:SpringerVerlag.

Tang,T.C.(1992).ChineseMorphologyandSyntax.Taipei:TaiwanStudentPublisher.Teahan,W.J.,Wen,Y.,McNab,R.,&Witten,I.H.(2000).Acompression-based

algorithmforChinesewordsegmentation.JournalofComputationalLinguisticsandChineseLanguageProcessing,26(3),375–393.

Xue,N.W.(2003).Chinesewordsegmentationascharactertagging.Journalof

ComputationalLinguisticsandChineseLanguageProcessing,8(1),29–48.Zhu,D.X.(1982).LecturesonGrammar.Beijing:TheCommercialPress.

Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- huatuo3.com 版权所有 蜀ICP备2023022190号-1

违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务