Journal of Quantitative Linguistics
Publication details, including instructions for authors and subscription information:http://www.informaworld.com/smpp/title~content=t716100702
Chinese Word Frequency Approximation Based on Multitype Corpora
Wei Qiaoa; Maosong Suna; Wolfgang Menzelba
Tsinghua University, China b Hamburg University, GermanyOnline publication date: 14 May 2010
To cite this Article Qiao, Wei , Sun, Maosong and Menzel, Wolfgang(2010) 'Chinese Word Frequency Approximation
Based on Multitype Corpora', Journal of Quantitative Linguistics, 17: 2, 142 — 166To link to this Article: DOI: 10.1080/092961710033213URL: http://dx.doi.org/10.1080/092961710033213
PLEASE SCROLL DOWN FOR ARTICLE
Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdfThis article may be used for research, teaching and private study purposes. Any substantial orsystematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply ordistribution in any form to anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation that the contentswill be complete or accurate or up to date. The accuracy of any instructions, formulae and drug dosesshould be independently verified with primary sources. The publisher shall not be liable for any loss,actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directlyor indirectly in connection with or arising out of the use of this material.
JournalofQuantitativeLinguistics
2010,Volume17,Number2,pp.142–166DOI:10.1080/092961710033213
ChineseWordFrequencyApproximationBasedonMultitypeCorpora*
WeiQiao1,MaosongSun1andWolfgangMenzel21TsinghuaUniversity,China;2HamburgUniversity,Germany
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010ABSTRACT
DuetothenatureofChinese,aperfectword-segmentedChinesecorpusthatisidealforthetaskofwordfrequencyestimationmayneverexist.Therefore,areliableestimationforChinesewordfrequenciesremainsachallenge.Currently,threetypesofcorporacanbeconsideredforthispurpose:rawcorpora,automaticallyword-segmentedcorpora,andmanuallyword-segmentedcorpora.Aseachtypehasitsownadvantagesanddrawbacks,noneofthemissufficientalone.Inthisarticle,weproposeahybridschemewhichutilizesexistingcorporaofdifferenttypesforwordfrequencyapproximation.Experimentshavebeenperformedfromstatisticalandapplication-orientedperspectives.Wedemonstratethat,comparedwithotherschemes,theproposedschemeisthemosteffectiveoneandleadstobetterwordfrequencyapproximationresults.
1.INTRODUCTION
Chinesewordfrequenciesplayimportantrolesinmanynaturallanguageprocessingrelatedapplications.Forexample,itisimportantformakinganationalstandardofChinesewordsegmentationandobtainingaChinesewordlistwhichservesasafoundationforChinesecomputing.Nowadays,therearemanymanuallysegmentedChinesecorporawhichhavebeenproducedbydifferentorganizationsandinstitutions.However,sincedifferentsegmentationstandardshavebeenapplied,theresultsarenoteasytointegrate.Thisisaseriousobstacleforthedevelopmentof
*Addresscorrespondenceto:WeiQiao,DepartmentofComputerScienceandTechnology,NationalLabonInformationScienceandTechnology,TsinghuaUniversity,Beijing,100084,China.E-mail:qiaow04@mails.tsinghua.edu.cn
0929-6174/10/17020142Ó2010Taylor&Francis
CHINESEWORDFREQUENCYAPPROXIMATION
143
Chineseinformationprocessingsystems.Onlyifreliablewordfrequencyinformationbecomesavailablecanthecoveragerateofthewordlistandthustheconsistencyofmanuallysegmentedcorporabeguaranteed.In1992anationalstandardwasestablishedwhichdefinestheChinesewordas‘‘combinedtightlyandstably,commonlyusedindailylife’’.Butthisdefinitionobviouslyleavesroomforinterpretationandishardtoimplementinpracticalapplications.ThelackofapracticalstandardforChinesewordsleadstoinconsistencyproblemsincorpusconstruction.Tosolvetheproblem,weattempttoconstructaChinesewordlistwhichcouldsupporttheestablishmentofanationalstandard.Sincewordfrequencyinformationisaprerequisiteforconstructingsuchawordlist,Chinesewordfrequencyestimationisavitaltaskintheprocess.
WhiletheestimationofwordfrequenciesiseasyforEnglish,itisdifficultforChinese,becauseunlikeEnglish,plainChinesetextsdonotcontaindelimiterstoindicatewordboundaries.Wethereforecannotobtainwordfrequenciesasusualbysimplycountingwordtokenoccurrencesintherawdata.
Toproperlyestimatewordfrequencies,itwouldbeidealtohavea‘‘fully’’correctword-segmentedChinesecorpus(Liu,1973).However,wefacetwofundamentaldifficulties.Thefirstoneiscausedbyaseriousinconsistencyproblemamongmanuallysegmentedcorpora,evenwhenthesamesegmentationstandardisadoptedforannotation.DuetothecharacteristicsofChineseword-formation(Dai,1992;Chen,1994),itisverydifficulttoconstructa‘‘fully’’correctcorpusevenbymanualsegmentation,althoughthedefinitionof‘‘word’’seemstobeclearfromalinguisticperspective(Zhu,1982;Tang,1992).Forexample,theChinese
’’(pork)caneitherbeconsideredasacharactersequence‘‘
compound,‘‘pork’’,orconsideredasaphraseconsistingoftwosingle-characterwords‘‘’’(pig)and‘‘’’(meat).Thus,thewordfrequencyof
’’(pork)couldbeprettyhighifitistreatedintheformerway,and‘‘
couldalsobezeroiftreatedinthelatterway.Theseconddifficultyarisesfromthefrequencydistributionofwords.Inordertoobtainastatisticallyreliablewordfrequencyestimation,duetotheoverwhelmingmajorityofextremelyrarewords,evenforamedium-sizedChinesewordlist,abalancedcorpuswithseveralhundredmillioncharacters,ratherthanseveralmillioncharacters,isrequired.Butmanuallyconstructingasegmentedcorpusofthatsizeisalmostimpossible:itwouldbebothlabour-intensiveandtime-consuming.
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010144
W.QIAOETAL.
Ontheonehand,a‘‘fully’’correctsegmentedcorpusisnotfeasible.Ontheotherhand,eventhe‘‘imperfect’’manuallysegmentedcorpuscanbeusefulforwordfrequencyestimation.Besidesthemanuallysegmentedcorpus,wecouldfurtherexploitthreeothertypesofinformationsource.
Thefirsttypeisthe‘‘perfect’’automaticallysegmentedcorpus:a‘‘perfect’’wordsegmenterisusedtosegmentthecorpusautomatically,leadingtoa‘‘perfect’’automaticallysegmentedcorpus.Thenwordfrequenciescanbeestimatedbasedonit.Clearly,itwouldbeidealifaverypowerfulwordsegmenterisavailable(Liu&Liang,1986).Unfortunately,thestate-of-the-artChinesewordsegmentersarestillnotsatisfactoryinperformance.IntheFirstInternationalChineseWordSegmentationBakeoff(Sproat&Emerson,2003)organizedbySIGH-AN,thehighestF-scoresforwordsegmentationintheopentestonfoursmall-scalecorporawere95.9%,95.6%,90.4%and91.2%,respectively.IntheSecondandThirdSIGHANInternationalChineseWordSegmentationBakeoffs(Emerson,2005;Levow,2006),thesituationhasremainedunchangedinnature,despitetheminorincreaseinperformanceofwordsegmentation.Aside-effectofsuchsystemsisthattheytrytosolvesegmentationambiguitiesandrecognizeunknownwordsincontext,producingalotofunexpectedinconsistenciesinsegmentation,whichareobviouslynotfavouredbythetaskhere.
ThesecondtypeistheMM-segmentedcorpus:aMM-segmentedcorpusisproducedbythesimplestmethodforChinesewordsegmenta-tion,‘‘maximalmatching’’(MM).LiuandLiang(1986)firstusedMMtohandlelarge-scaletexts.Accordingtothedirectionofsentencescanning,MMcanbefurthersubcategorizedasforwardMM(FMM)andbackwardMM(BMM).ExperimentsshowedthatMMisbotheffectiveandefficient(Liang,1987).SunandT’sou(1995)distinguishedfourcasesinwhichFMMandBMMwerebothconsidered,providingverystrongevidenceforsupportingMM-basedschemestobereasonableestimationsofwordfrequencies.AnotheradvantageofMM-basedschemesistheirhighconsistencyinwordsegmentation.TheweakpointofMMisthatsegmentationerrorsinevitablyexistandout-of-vocabularywordswillleadtoaseveredropinperformance.
Thethirdtypeistherawcorpus:here,thefrequenciesofallstringsofcharacters(stringfrequency)areusedasanapproximationofwordfrequencies(noticethatweusetheterm‘‘approximation’’here)(Sunetal.,1998).Suchanapproximationhastheadvantagethatitcanbe
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010CHINESEWORDFREQUENCYAPPROXIMATION
145
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010deriveddirectlyfromarbitrarytextualdata.Obviously,stringfrequen-ciesarealwayslargerthanthedesiredwordfrequenciesgivenacorpus.Therefore,thisschememayseriouslyoverestimatewordfrequenciesforsomewords(inparticularformonosyllabicwords),butithasalsotwoattractiveproperties:itdoesnotneedanykindofwordsegmentationcomponent,andtherawdatacanbeeasilyobtainedinarbitraryamounts.
Accordingtotheanalysisabove,forthetaskofwordfrequencyestimation,a‘‘perfect’’word-segmentedcorpuswouldbeidealbutdoesnotexist,neithermanuallynorautomatically.Whatwehaveisavarietyofimperfectonesaswellasunsegmenteddata.Eachtypeofcorpuscomeswithitsownadvantagesanddrawbacks,sononeofthemalonecanfitthetaskofwordfrequencyestimation.Therefore,wehavetoconsiderahybridstrategywhichtriestoutilizealltheimperfectdatasourcesavailablesofar,rangingfrommanuallysegmentedandMMsegmentedonestorawcorpora,andcombinethembymeansofwordfrequencyapproximation,insteadofwordfrequencyestimation.
Therestofthisarticleisorganizedasfollows:Section2proposestheconstructionprocedureofthenewapproximationscheme.Section3introducesthedatasetweusedforparameteradjustmentandevaluation.ParameteradjustmentisdescribedinSection4.Section5presentstheexperimentstoevaluatetheperformanceoftheproposedscheme.InSection6weconcludeourwork.
2.THEAPPROXIMATIONSCHEME
Inthissectionweproposeahybridschemeusingcorporaofdifferenttype.Thewholeprocessconsistsofthefollowingthreesteps:first,wecombinetheapproximationresultsderivedfromrawandMM-segmentedcorpora;second,manuallysegmentedcorporaareusedforwordfrequencyapproximation;andfinally,theresultsderivedfromtheprevioustwostepsareusedtoaccomplishthefinalapproximation.Wewillintroducethethreestepsonebyone.
2.1CombiningtheapproximationresultsderivedfromrawandMM-segmentedcorpora
SupposewehaveapredefinedwordlistdenotedasWLandarawcorpusdenotedasC.ByusinganFMMandBMMsegmentertosegmentC,we
146
W.QIAOETAL.
canconstructtwoMM-segmentedcorpora.Thus,thefollowingthreestatisticsforeachwordwiinWLcanbeobtainedfromCrespectively:fFMM(wi,C):WordfrequencyofwiobtainedfromFMM-segmentedC.fBMM(wi,C):WordfrequencyofwiobtainedfromBMM-segmentedC.fRAW(wi,C):StringfrequencyofwiobtaineddirectlyfromC.TheworkofSunetal.(2006)showedthatintheframeworkofMM,theaverageoffFMM(wi,C)andfBMM(wi,C)givesthebestapproximationofwordfrequenciesforonetofourcharacterwords,fBMM(wi,C)isthebestforfivecharacterwords,andfRAW(wi,C)thebestforwordswithmorethanfivecharacters.Herewesimplyfollowthisclaim.
UsingFRFB(wi,C)torepresenttheresultofwordfrequencyapproximationbyjointlyconsideringfRAW(wi,C),fFMM(wi,C)andfBMM(wi,C),wethushave:
Forwordswithonetofourcharacters:
1
FRFBðwi;CÞ¼½fFMMðwi;CÞþfBMMðwi;CÞ:
2
Forwordswithfivecharacters:
FRFBðwi;CÞ¼fBMMðwi;CÞ:
Forwordswithmorethanfivecharacters:
FRFBðwi;CÞ¼fRAWðwi;CÞ:
ThiswordfrequencyapproximationschemeiscalledRFB.
ð3Þð2Þð1Þ
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20102.2Combiningtheapproximationresultderivedfrommanuallysegmentedcorpora
HerewedefineMCasacorporasetwhichcontainsNdifferentmanuallysegmentedcorpora,denotedasMCj(j¼1,2,...,N).Bysimplycountingwordtokenoccurrences,wecanobtainthecorpus-specificwordfrequenciesforeverywordwiinWL,denoteasfMS(wi,
CHINESEWORDFREQUENCYAPPROXIMATION
147
MC).Thewordfrequencyforwiderivedfromthemanuallysegmentedcorpora,denotedFMS(wi,MC),canbecalculatedusingthefollowingequation:
FMSðwi;MCÞ¼
NXj¼1
fMSðwi;MCjÞ:ð4Þ
ThiswordfrequencyapproximationschemeiscalledMS.
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20102.3CombiningFRFB(wi,C)andFMS(wi,MC)
Sofar,twodifferentwordfrequencyapproximationresultshavebeenobtained:FRFB(wi,C)andFMS(wi,MC).Inordertoappropriatelycombinethetworesults,twoproblemsshouldbeconsidered.
Oneisrelatedtothefactthatthesetwostatisticsarederivedfromcorporaofdifferentsize:thesizeofC,denotedasNC,isgenerallymuchlargerthanthatofMC,denotedasNMC.Wethusintroduceaparameteratobalancethesizedifference.Asaninitialtrial,wesimplytaketheratioofCtoMCasthevalueofa(a41).Wewillshowhowthevalueofacanbeautomaticallyadjustedbymeansofageneticalgorithm(GA)inSection4.
Inthisarticle,weexpecttointegratethemanuallysegmentedcorpora(MC)andtherawcorpus(C)intoa‘‘new’’corpuswithsizeof2NMC.ThesizeofCisreducedtoNC/a,asaconsequence,thesizeofMCchangestoN0MC:
N0MC¼2NMCÀNC=a:
0
Accordingly,FRFB(wi,C)turnsintoFRFB(wi,C):
ð5Þ
F0RFBðwi;CÞ¼FRFBðwi;CÞ=a:
And,FMS(wi,MC)shouldbemodifiedas:
F0MSðwi;MCÞ¼FMSðwi;MCÞÂ
2NMCÀNC=a
:
NMC
ð6Þ
ð7Þ
ThesecondproblemconcernstherelationshipbetweenwordlengthandtheprecisionoftheapproximationresultsforChinesewordfrequencies:theshortertheword,thelessreliabletheapproximationresultsonaraw
148
W.QIAOETAL.
corpuswillbe.Takingthisobservationintoaccount,weintroduceaweightingfactorb.Theshorterthewordis,themoreweightshouldbegiventoFMS(wi,MC).ForthatpurposewedividetheChinesewordsintofourclasses:one-characterwords,two-characterwords,three-characterwordsandwordshavingfourormorethanfourcharacters.Toeachclassanindividualbisassigned:
8
forone-characterwords;b>><1
fortwo-characterwords;b2
b¼
forthree-characterwords;b>>:3
b4¼0otherwise:
Accordingtotheanalysisabove,theseparametershavetosatisfytheglobalconstraintb1!b2!b34b4.
ToadaptthewordfrequencyapproximationtothenewweightingfactorsEquation(6)shallbefurtherchangedto:
0
F00RFBðwi;CÞ¼FRFBðwi;CÞÂ
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 201011
¼FRFBðwi;CÞÂ:1þbað1þbÞ
ð8Þ
Similarly,Equation(7)hastobechangedto:
F00MSðwi;MCÞ¼FMSðwi;MCÞÂ
C2NMCÀaðN1þbÞNMC
:ð9Þ
BasedonEquations(8)and(9),weobtainourfinalhybridscheme,
denotedasFRFBþMS(wi,CþMC):
00
FRFBþMSðwi;CþMCÞ¼F00MSðwi;MCÞþFRFBðwi;CÞ
¼FMSðwi;MCÞÂ
C2NMCÀaðN1þbÞNMC
1
:þFRFBðwi;CÞÂ
að1þbÞ
ð10Þ
ThiswordfrequencyapproximationschemeiscalledRFBþMS.
CHINESEWORDFREQUENCYAPPROXIMATION
149
3.DATASETS
Inthissectionweintroducethedatasetsweuseforparameteradjustmentandperformanceevaluation.(1)
Twomanuallyword-segmentedcorpora.ThefirstoneisHUAYU,abalancedChinesecorpusconsistingof1,040,190wordswith1,763,762characters,developedbyTsinghuaUniversityandBeijingLanguageandCultureUniversity.ThesecondoneistheBEIDAnewswirecorpusconsistingof7,286,870wordswith13,030,237characters,developedbyPekingUniversity.Themanuallyword-segmentedcorporahavetotally8,327,060wordswith14,793,999characters.Thegold-standardcorpus.Weuseamanuallyword-segmentedbalancedcorpusdevelopedbytheNationalLanguageCommissionofChina,denotedYUWEI,whichcontains25,000,309wordswith51,311,659characters.AstheYUWEIcorpusisanotedauthorityandrelativelylargeinsize,wetakeitasgold-standardforexperimentalcomparisonsinthisarticle.Awordlistisderivedfromthiscorpusandthecorrespondingwordfrequenciesareobtained.Wedeleteallwordswithafrequencylessthanfourfromthisoriginalwordlisttoformourstandardwordlist,whichisdenotedYWLandcontains99,660entries.
Arawcorpus.Weuseaverylargerawcorpus,denotedRC,whichcontains1,019,986,721characters.ThestringfrequencyofeverywordinYWLisobtainedfromRC.
MM-segmentedcorpora.UsingYWL,wesegmenttherawcorpusRCwithanFMM-segmenterandBMM-segmenterseparately,resultingintwoMM-segmentedcorpora.WedenotethemRC_FMMandRC_BMMrespectively.
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010(2)
(3)(4)
Intotal,wehavethereforetwomanually-segmentedcorporaofmoderatesize,HUAYUandBEIDA,wherethecombinationHUAYUþBEIDAisdenotedasHB,oneverylargerawcorpus(RC),twoMM-segmentedcorpora(RC_FMMandRC_BMM),andagold-standardcorpus(YUWEI).
4.PARAMETERADJUSTMENT
Inthissectionageneticalgorithm(Holland,1975)isusedtoadjustparametersaandb(b1,b2andb3)thathavebeenintroducedinRFBþMS.
150
W.QIAOETAL.
ThespecificGAusedherehasbeendevelopedbyus.Itisbasedonnon-overlappingpopulationsandanoptionalelitism.Foreachgeneration,thealgorithmcreatesanentirelynewpopulationofindividuals.Themutationprobabilityisavariablevaluewiththeminimumvaluesetas0.03.Thecross-overprobabilityis0.75inourimplementation.Thegold-standardcorpus,YUWEI,hasbeenadoptedforparameteradjustment.TrainingandtestsetsareconstructedbyrandomlydividingYUWEIintotwoparts:YUWEI_1andYUWEI_2with28,536,843charactersand22,774,816charactersrespectively.YUWEI_1isusedasatrainingsetwhileYUWEI_2isusedforopentest.
WederivedtwooriginalwordlistsfromYUWEI_1andYUWEI_2andcalculatedthecorrespondingwordfrequenciesforthem.TwoderivedwordlistsdenotedYWL_1andYWL_2areobtainedbydeletingallthewordswithafrequencyoflessthan4fromtheoriginalwordlists.YWL_1has83,101entriesandYWL_2has76,514entries.Here,YUWEI_1isusedtoadjusttheparameters.
Aranknumberisassignedtoeveryofthe83,101entriesinYWL_1,afterithasbeensortedaccordingtodescendingwordfrequencies.ThisrankingisdenotedRYW1.Similarly,wecanalsoobtainaranksequenceforalltheentriesofYWL_1intermsofFRFBþMS(wi,RCþHB),denotedRRFBþMS(YWL_1).
Todefineafitnessfunctionforthegeneticalgorithm,weproposetwometricswhichmeasuretheperformanceofthewordfrequencyapproximationscheme.
Thefirstmetricdescribestheclosenessbetweenthegold-standardrankRYW1andthatderivedfromournewscheme.Thecloserthetworanksare,thebettertheschemeis.Thischoiceismotivatedbytheobservationthatwordsinaranklistdonotstrictlyfollowthesameorderevenforbalancedcorporaofsimilarsize.Rankclosenessallowstherankdifferencestovaryinacertainrange.
WedecidedtousetheSpearmanCoefficientofRankCorrelation(SCRC)tomeasurethecloseness.SupposetherearetworanksequencesR1andR2forawordlistdenotedasWL.ThevalueofSCRCforR1andR2iscalculatedby:
SCRCWLðR1;R2Þ1À6
NXi¼1
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010d2wi
;NðN2À1Þ
CHINESEWORDFREQUENCYAPPROXIMATION
151
wheredwiisthedifferencebetweentworanknumbersofwiwithrespecttoR1andR2,NisthelengthofWL.
TheSCRCvalue,rangingfrom0to1,describesthesimilaritybetweentworanksequences.Althoughtherankordermaydiffersubstantially,wearguethatthegeneraltendencywillbeproperlyreflected.
Withrespecttothefirstmetric,theoptimizationproblemcanbeformalizedas:
argmaxSCRCYWL
a;b
1
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010ðRYW1;RRFBþMSðYWL1ÞÞ:
WedenotetheGAfitnessfunctiondefinedbythefirstmetricasFIT1.InordertodemonstratethatSCRCisaneffectivemeasurement,twocommonly-usedEnglishcorpora(forwhichaccuratewordfrequencyestimatescanbeeasilyobtained)havebeenchosentocalculatetheSCRC.OneistheLOBcorpuswith53,823wordtypes;theotheroneistheBrowncorpuswith55,734wordtypes.Overall,27,807wordswhicharecommonintwocorporahavebeenextracted.TheSCRCvaluebetweenthetworanksequencesderivedfromBrownandLOBis0.72.Ifwedeletethewordswithafrequencyoflessthanfour(wedothisalsoinourexperimentsforChinese)11,357commonwordsremainandtheSCRCvaluerisesto0.79.ThisshowsthatSCRCisaneffectivemeasurementoftheclosenessbetweentworanksequences.
ThesecondmetricisthecoveragerateofthetopNhigh-frequencywordsinaranksequenceRtoacorpusC,denotedasCoverage(R,N,C)givenby:
#:charactersinCcoveredbytopNhighfrequencywordsinR
:
Thetotal#:charactersinCWithrespecttothesecondmetric,theoptimizationproblemisgivenby:
argmaxCoverageðRRFBþMSðYWL1Þ;N;YUWEI1Þ:
a;b
Fortheexperimentsdescribedinthisarticle,theparameterNissetto50,000,sinceinYUWEI_2the50,000mostfrequentwordsalreadyhave
152
W.QIAOETAL.
Table1.ParametervaluesdeterminedbyGA(usingMMsegmenter).
a
ParametervaluesforFIT1ParametervaluesforFIT2Finalparametervalues
15.817.016.4
b10.91.21.0
b20.60.40.5
b30.30.20.3
SCRC0.7480.7430.746
Coveragerate
96.3%96.6%96.5%
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010coverageof98%.WedenotetheGAfitnessfunctiondefinedbythesecondmetricasFIT2.
UsingFIT1andFIT2asthefitnessfunction,twogroupsofoptimalparametervaluesareobtainedafterGAruns150generationswithapopulationsizeof30.
TheoptimizedparametervaluesareshowninTable1togetherwiththecorrespondingSCRCvaluesandthecoveragerates.TheresultsshowthatusingthefirstgroupofparameterswecanachieveahigherSCRCvalueatthecostofalowercoveragerate,whileinthesecondgroupalowerSCRCvalueisobtainedwithahighercoveragerate.Herewemakeatrade-offandtaketheaveragevalueofthetwogroupsasfinalparametervalueswhicharelistedinthelastrowofTable1.
WereplaceMMsegmenterswithawidely-usedChinesewordsegmenterICTCLAS1.0,1whichexploitssophisticatedmechanismtodealwithsegmentationambiguitiesandunknownwords,tosegmentCtocheckiftheparametervaluesaresensitivetochangesinwordsegmentation.Wehave:
fICTCLAS(wi,C):wordfrequencyofwiobtainedfromICTCLAS1.0-segmentedC.
ForwordswithonetofivecharacterstheschemeRFBthuschangesto:
FRFBðwi;CÞ¼fICTCLASðwi;CÞ:
Andforwordswithmorethanfivecharacters,westilluse:
FRFBðwi;CÞ¼fRAWðwi;CÞ:
1ICTCLAS1.0:http://www.nlp.org.cn
CHINESEWORDFREQUENCYAPPROXIMATION
153
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010Parametersarere-adjustedusingthegeneticalgorithm.Thedeter-minedparametervaluesbyGAtogetherwiththecorrespondingSCRCvaluesandcoverageratesaregiveninTable2.
ComparedwithTable1,wecanseethattheparametersdonotchangealotwhendifferentwordsegmentersareused.Wealsonotethatusinga‘‘high-quality’’wordsegmenterdoesnotleadtoahigherSCRCvalueorabettercoveragerate.ThisconfirmsourpreviousclaimthatanMMsegmenteriseffectiveforthistask.
Bynow,theparametervalues(aandb)havebeendetermined.Wewillapplythisfixedschemeinaseriesofexperimentstoevaluateitsutilityandeffectiveness.
5.EXPERIMENTS
Inthissection,theopentestsetYUWEI_2istakenasakindofgold-standardcorpus,onwhichwecompareourproposedschemeRFBþMSwiththeotherthreeschemes:(1)(2)(3)
RFB:theschemeusingarawcorpusandMM-segmentedcorpora.MS:theschemeusingonlymanually-segmentedcorpora.
CRFþMS:theschemeusingahigh-qualityCRF-basedChinesewordsegmenter,trainedonmanually-segmentedcorpora,tosegmenttherawcorpus.
Totrainthewordsegmenter,wefirstconcatenatedtheautomaticallyandmanuallysegmentedcorporatoformanewsegmentedcorpusand,countedwordfrequenciesonit.Specificallyinthisarticle,awordsegmenterbasedonconditionalrandomfields(CRF)(Laffertyetal.,2001;Pengetal.,2004)isadopted.IthasbeenreportedasthebestoneintheSIGHANwordsegmentationcompetition(Levow,2006).The
Table2.ParametervaluesdeterminedbyGA(usingICTCLAS1.0wordsegmenter).
a
ParametervaluesforFIT1ParametervaluesforFIT2Finalparametervalues
16.216.816.5
b11.01.21.1
b20.50.50.5
b30.20.20.2
SCRC0.7380.7240.734
Coveragerate
95.8%96.9%96.5%
1
W.QIAOETAL.
implementationoftheCRF-basedwordsegmenterusestheCRFþþversion0.50packageprovidedbyTakuKudo.2Thewindowsizeissetasfiveandthefour-tagset:‘‘S’’(single-characterword),‘‘L’’(leftboundaryofaword),‘‘M’’(non-boundary),‘‘R’’(rightboundaryofaword)isusedtodistinguishthepositionofcharacterinaword.Thebasicfeaturetemplateadoptedby(Lowetal.,2005)isusedwhichwerestateheretomakethearticleself-containing:(a)(b)(c)(d)(e)
Cn,n¼72,71,0,1,2CnCnþ1,n¼72,71,0,1C-1C1Pu(C0)
T(C72)T(C71)T(C0)T(C1)T(C2)
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010where,CreferstoaChinesecharacter,Cnreferstothecharacternpositionstotheright(left)ofthecurrentcharacter,andPu(C0)checkswhetherC0isapunctuationmark.Fourtypesarefurtherdefinedin(Lowetal.,2005),3T(Cn)denotesthetypethecharacterCnbelongsto.TheCRFmodelistrainedonaportionofthemanuallysegmentedcorporaHUAYUandBEIDAsincetheyaretoolargetobeusedentirely.Sentencesarerandomlyselectedfromthesecorporatocomposeatrainingsetwithsizeof15MbwhichistheutmostlimitsizeuptowhichtheCRFtrainingcanstillbeperformedona4Gbmemorycomputer.Performancecomparisonsareconductedfromfivedifferentperspectives.5.1Perspective1:TheSpearmanCoefficientofRankCorrelation
Inthissubsection,theSCRCvalueisusedtomeasuretheperformanceofthewordfrequencyapproximationschemes.FromYUWEI_2weobtainaranksequenceforthe76,514entriesofYWL_2,denotedRYW2,usingthesameprocedureasforobtainingRYW1andRRFBþMS(YWL_1),whichwasdescribedinSection4.AnotherfourranksequencesdenotedRMS(YWL_2),RRFB(YWL_2),RRFBþMS(YWL_2)andRCRFþMS(YWL_2)aregeneratedusingFMS(wi,RCþHB),FRFB(wi,RCþHB),FRFBþMS(wi,RCþHB)andFCRFþMS(wi,RCþHB)correspondingly.
http://chasen.org/taku/software/CRFþþ/.
ThefourtypesareNumbers,Dates(theChinesecharactersfor‘‘day’’,‘‘month’’,‘‘year’’,respectively),EnglishlettersandOthers.
32CHINESEWORDFREQUENCYAPPROXIMATION
155
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010ThenwecalculatetheSCRCvaluebetweenRYW2andeachofthefourranksequences.Table3showstheresults.
TheSCRCvalueoftheranksequenceRRFBþMS(YWL_2)isthehighest,indicatingaclosercorrespondencetotheranksequenceRYW2thananyofthesequencesRMS(YWL_2),RRFB(YWL_2)andRCRFþMS(YWL_2).
TheexperimenthasbeenrepeatedonthesubsetsofYWL_2withwordfrequencies!10and!200.Table4showsthat,nomatterwhetherweexcludeverylow-frequencywordsorconsideronlyhigh-frequencywords,theproposedschemeoutperformstheotherthreeschemes.
Table5summarizestheimprovementondifferentsubsetsofYWL_2.Wenotethatalthoughmanuallysegmentedcorporaareusuallymore
Table3.ComparisonsoftheSCRCvalueamongfourschemesoverYWL_2.
(RYW2,RMS(YWL_2))
SCRCYWL_20.66
(RYW2,RRFB(YWL_2))
0.73
(RYW2,RRFBþMS(YWL_2))
0.77
(RYW2,RCRFþMS(YWL_2))
0.73
Table4.SCRCvaluesoverYWL_2withwordfrequency!10and!200.
(RYW2,RMS(YWL_2))
SCRCYWL_2(wordfrequency!10)SCRCYWL_2(wordfrequency!200)
0.670.71
(RYW2,RRFB(YWL_2))
0.750.79
(RYW2,RRFBþMS(YWL_2))
0.790.83
(RYW2,RCRFþMS(YWL_2))0.760.80
Table5.ImprovementofSCRCvaluesoverdifferentpartsofYWL_2.
Numberofwords
Wordswithfrequency!4Wordswith
frequency!10Wordswith
frequency!200
76,51447,2705,725
RFBþMSvs.MS0.110.120.12
RFBþMSvs.RFB0.040.040.04
RFBþMSvs.CRFþMS
0.040.030.03
156
W.QIAOETAL.
accurate,theysufferfromthedrawbackofsmallsizeandinconsistencyintheirsegmentationstandards.Thisleadstoaworseperformance.
Anotherpointwhichdeservesnotingisthat,althoughtheperformanceofCRFþMSisslightlybetterthanRFB,CRFtrainingisverytime-consumingcomparedwiththefastmaximummatchingsegmentation.Sincetherawcorpusweusedusuallyhashugesize,theefficiencyofMMsegmentationisappreciable.Thisalsocanbeseenasanadvantageoftheproposedscheme.
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20105.2Perspective2:Ranksequencedeviation
Inthissubsection,weperformedthepreviousexperimentfromanotherperspective.WedefinetheranksequencedeviationswithPrespecttotworanksequencesRYW2andRscheme,sRschemeforshort,as:ijRschemeðwiÞÀRYW2ðwiÞj(ioverasubsetofYWL_2).Here,Rschemeisaranksequencederivedfromoneofthefourschemes:RFBþMS,MS,RFB,CRFþMS.WethencalculatesMS,sRFB,sRFBþMSandsCRFþMS.
Thevaluesof(sRFBþMS7sMS)/sMS,(sRFBþMS7sRFB)/sRFBand(sRFBþMS7sCRFþMS)/sCRFþMSdescribethevaryingrateofthesvalueoftheproposedschemecomparedtotheothertwoschemesrespectively.TheyarelistedinTable6wherewecanseethat,theproposedschemereceivesthebestresultsforonetothree-characterwordsinYWL_2,butforwordsoffourcharactersormore,itshowslowerperformance.
Inordertoseemoredetail,wordswiththesamelengtharefurtherdividedintothreegroups,i.e.,high,mediumandlow-frequencywords.Figure1showsthecoveragerateoftheNmostfrequentwordsofYUWEI_2whichisusedtodeterminethethresholdforhigh,mediumandlow-frequencywords.
Intermsofthiscoverageratecurve,wechoosethepointHMtodistinguishhighandmedium-frequencywordsandthepointMLto
Table6.Thecomparisonsofranksequencedeviationswithrespecttowordlength.Wordlength1234þ
sRFBþMSÀsMS
sMS
723.5%720.8%715.9%10.2%
sRFBþMSÀsRFB
sRFB
718.5%716.2%79.1%11.5%
sRFBþMSÀsCRFþMS
sCRFþMS
717.2%716.5%78.6%10.9%
CHINESEWORDFREQUENCYAPPROXIMATION
157
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010Fig.1.ThecoveragerateofthetopNfrequentwordstoYUWEI_2
distinguishmediumandlow-frequencywords.High-frequencywordsarethe6384mostfrequentwords(1*HM),withwordfrequency4174and,acoverageof80%toYUWEI_2.Medium-frequencywordsrangebetweenrank6385and31,794(HM*ML)withawordfrequency418.Theyraisethecoverageto95%.Low-frequencywordsaretheremainingwords(ML*76,514),withawordfrequency43.Tables7,8and9presenttheresultsofranksequencedeviationperformedonhigh,mediumandlow-frequencywordsrespectively.
FromTables7,8and9wecanseethat,inmostofthecases(10outof12),ourschemereceivesthebestresult.Butforlow-frequencywords,especiallyone-characterwordsandwordsoffourcharactersormore,theresultsgetworse.Low-frequencywords,inparticular,contributemosttotheperformancedegradationonthewordsoffourcharactersormore(seeTable6).Thiscanbepartlyexplainedbythegold-standardcorpusweused.AlthoughYUWEIisrelativelylargeandbalanced,itssizeisstillfarfromwhatweneedtoreliablyestimatewordfrequenciesforlow-frequencywords.
HerewegivesomeexamplesofwordswhicharereasonablyrankedhigherupinRRFBþMS(YWL_2)comparedtoRYW2:‘‘(friend)’’,‘‘(ferry)’’,‘‘(sunny)’’,‘‘(ShenzhenStockExchange)’’,
158
W.QIAOETAL.
Table7.Comparisonofranksequencedeviationsforhigh-frequencywordsinYWL_2withrespecttowordlength.
1characterwords
sRFBþMSÀsMSsMS
sRFBþMSÀsRFB
sRFB
sRFBþMSÀsCRFþMS
sCRFþMS
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 20102characterwords738.6%732.9%731.0%
ü
3characterwords767.8%758.7%756.3%
ü
4þcharacterwords7.0%782.0%776.8%ü
745.0%734.8%732.5%
ü
IsRFBþMSthebestscheme?
Table8.Comparisonofranksequencedeviationsformedium-frequencywordsinYWL_2withrespecttowordlength.
1characterwords
sRFBþMSÀsMS
sMS
sRFBþMSÀsRFB
sRFB
sRFBþMSÀsCRFþMS
sCRFþMSIsRFBþMSthebestscheme?
732.4%719.2%732.5%
ü
2characterwords715.8%78.2%77.0%ü
3characterwords78.3%710.1%710.0%
ü
4þcharacterwords714.2%711.2%710.8%ü
Table9.Comparisonofranksequencedeviationsforlow-frequencywordsinYWL_2withrespecttowordlength.
1characterwords
sRFBþMSÀsMS
sMS
sRFBþMSÀsRFB
sRFB
sRFBþMSÀsCRFþMS
sCRFþMSIsRFBþMSthebestscheme?
25.8%74.2%73.5%7
2characterwords725.3%712.4%712.0%
ü
3characterwords719.2%75.7%75.3%ü
4þcharacterwords36.8%12.9%8.2%7
CHINESEWORDFREQUENCYAPPROXIMATION
159
Table10.Thecoveragerateoftop50,000high-frequencywordsoverYUWEI_2.SchemeCoveragerate
MS94.0%
RFB95.0%
RFBþMS96.9%
CRFþMS94.8%
‘‘(MotorolaInc.)’’,‘‘InformationBureau)’’etc.
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010(TheStateDepartment
5.3Perspective3:Thecoveragerate
TomeasurethecoveragerateoverYUWEI_2,thetop50,000high-frequencywordshavebeenselectedfromthewordlistsprovidedbythefourschemes:MS,RFB,RFBþMSandCRFþMS.FromTable10wecanseethatthecoveragerateoftheproposedschemeincreasesby2.9%,1.9%and2.1%comparedtotheschemesMS,RFBandCRFþMSrespectively.
5.4Perspective4:Sampleanalysis
TakingRYW2asthegold-standardranksequence,wecomparedRRFBþMS(YWL_2)againstRMS(YWL_2).Thereare51,052wordsinRRFBþMS(YWL_2)whoseranksareadjustedclosertothestandardsequenceRYW2thantheranksinRMS(YWL_2).Wecallthesewordspositivesamples.Meanwhile,thereare25,447wordswhoseranksareadjustedfarfromRYW2.Wecallthemnegativesamples.Therestillare15wordsthathavesamerankinRMS(YWL_2)andRRFBþMS(YWL_2).Tables11and12showthedistributionofpositivesamplesandnegativesamplesfordifferentfrequencyregions(high,medium,andlow).
(biologicHerewegivesomepositivesamples,suchas‘‘
(knowledgeeconomy)’’,‘‘(in-technology)’’,‘‘
(greenhouseeffect)’’.Thesewordsformationthruway)’’and‘‘
arecomparativelynew,comparedwiththebasicvocabularyofChinese,andhaveahighfrequencynowadays.Whenusingourscheme,theyareproperlyadjusted.Wealsogivesomenegativesamples,suchas‘‘(WinstonChurchill)’’,‘‘(CentralRedArmy)’’and‘‘(maidservant)’’.Thesewordsareusedlessoftentoday,butarestillrankedhighbyourscheme,duetothefactthattheywerefrequentlyusedhistorically,asreflectedinRC,alarge-scalerawcorpuswhichcoversthelinguisticphenomenaofthattimespanmoreintensivelythanHUAYU,BEIDAandYUWEI.
160
W.QIAOETAL.
Table11.Thedistributionofpositivesamplesfordifferentwordfrequencyregions.WordfrequencyregionHigh-frequencywordsMedium-frequencywordsLow-frequencywords
Totalwords638425,41044,720
Numberofpositivesamples
502417,91428,129
Proportiontotheregion
78.7%70.5%62.9%
Table12.Thedistributionofnegativesamplesfordifferentwordfrequencyregions.
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010WordfrequencyregionHigh-frequencywordsMedium-frequencywordsLow-frequencywords
Totalwords638425,41044,720
Numberofnegativesamples
1360749616,591
Proportiontotheregion
21.3%29.5%37.1%
5.5Perspective5:PerformanceevaluationontheChinesewordsegmentationtask
ChinesewordsegmentationisthefundamentaltechnologyformanyNLP-relatedapplications.InthepastdecadesithasdrawnalargebodyofresearchwithintheChineselanguageprocessingcommunity.Avarietyofmethodshavebeenexploitedrangingfromrule-based(Palmer,1997;Chengetal.,1999)tostatistics-based(Sproatetal.,1996;Sunetal.,1998;Teahanetal.,2000),fromword-basedtocharacterbased(Xue,2003),fromsupervisedlearning-based(Pengetal.,2004;Lowetal.,2005)tounsupervisedlearning-based(Goldwateretal.,2006),andalsoincludeshybridapproaches(Gaoetal.,2005).
InordertoverifytheperformanceoftheproposedChinesewordfrequencyapproximationschemefromanapplicationpointofview,itisappliedtothetaskofChinesewordsegmentation.
N-grammodelswhichcanbetracedtoClaudeShannon’sworkininformationtheory(Shannon,1948)arewidelyusedinstatisticalnaturallanguageprocessingincludingChinesewordsegmentation(Kit,2004;Carpenter,2005;Maoetal.,2007).AnN-grammodelpredictsthecurrentstatexibasedonformernstatesxi71,xi72,...,xi7n.Inprobabilityterms,thisisnothingbutP(xijxi71,xi72...xi7n).Whenusedforlanguagemodellingindependenceassumptionsaremadesothateachworddependsonlyonthelastnwords.TheunigrammodelisaspecialcaseofN-grammodelwhereNissetas1.Ithasalsobeenusedfor
CHINESEWORDFREQUENCYAPPROXIMATION
161
Chinesewordsegmentationtasks(Chen,2005).SupposethereisasentenceSandasegmentationforSwhichcanberepresentedasnwords:w1,w2,...,wn.Whenusingtheunigrammodel,thesentenceSunderthissegmentationhastheprobabilityof:
probðSÞ¼
nYi¼1
probðwiÞ:
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010where,prob(wi)istheprobabilitythatthewordwiappearsinrunningtext.
Givenatrainingcorpus,thevalueofprob(wi)isusuallyestimatedbytheratioofthewordfrequencyofwitothetotalnumberofwordsinthetrainingcorpus.Adynamicprogrammingalgorithmisusedtofindtheoptimalsegmentationwhichprovidesmaximalprob(S).Thus,thesegmentationperformanceonlydependsontheaccuracyofwordfrequencyestimationwhichisderivedfromthetrainingcorpus.Themoreaccuratethewordfrequencyapproximation,thebettersegmentationresultscanbeobtainedfromaunigrammodel.Inthisarticle,aunigrammodelisadoptedforChinesewordsegmentationtotesttheperformanceofwordfrequencyapproximationschemes.Incontrasttotheperspectivesdiscussedintheprevioussubsections,thisisanevaluationfromanapplicationpointofview.NoticethatherewedonotattempttoimprovetheperformanceofChinesewordsegmentationbutjustusethistaskasameasurementofthequalityoffrequencyapproximation.
InthefollowingexperimentstheF1-measureisusedtoevaluatetheperformanceofChinesewordsegmentation.Itisdefinedas
F1-measure¼
2PÁR
;PþR
where‘‘P’’denotesprecisionand‘‘R’’denotesrecall.
Inordertoachieveacomprehensiveevaluation,allfourdatasetsoftheSIGHANBakeoff’05(Emerson,2005)areused.TheyhavebeencompiledbyMicrosoftResearchAsia(MSRA),PekingUniversity(PKU),AcademiaSinica(AS)andCityUniversityof(CityU).AsummaryofthecorporaisshowninTable13.ThestandardBakeoffscoringprogramisusedtocalculateF1-measure.
Eachdatasetcontainsatrainingsetandacorrespondingtestset.Aunigrammodelistrainedoneachofthefourtrainingsetsrespectively.Table14showstheresultsofwordsegmentation.
162
W.QIAOETAL.
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010InadditiontothefourschemesMS,RFB,RFBþMSandCRFþMS,anotherscheme,denotedasSTR,isalsoincludedinthecomparison.Itusesasapproximationthestringfrequenciesderivedfromtherawcorpus.Thereforeinthefollowingexperiments,fiveschemesarechecked:STR,MS,RFB,RFBþMSandCRFþMS.
TakingYWLastheoriginalwordlist,weobtainwordfrequencyestimationfromthesefiveschemesrespectively.Foraspecificscheme,somewordsinYWLmayhavezerofrequency.Inordertoensureafaircomparison,thewordsinYWLwithoutzerofrequencyestimationbyanyofthefiveschemesisextractedtoformanewwordlist,denotedCL.CLcontains62,639entries.WhiletheearlierfourperspectivesuseYUWEIasthegold-standardcorpus,hereYUWEIistreateddifferently:itisincorporatedintothesetofmanuallysegmentedcorporatomaketrainingdataforfrequencyapproximationaslargeaspossible.Asaconsequence,thewordfrequencyapproximationgivenbytheschemeMSchangesfromFMS(wi,HB)toFMS(wi,HBþYUWEI),asdefinedinEquation(4)and,thefinalwordfrequencyapproximationisgivenbyFRFBþMS(wi,RCþ(HBþYUWEI)),asdefinedinEquation(10).
Table15demonstratestheperformanceoftheunigramwordsegmentationmodelsusingthefivefrequencyapproximationschemesforCLontheSIGHANBakeoff’05datasets.
FromTable15wecanseethattheproposedschemeRFBþMSoutperformsalltheotherwordfrequencyapproximationschemes.
Table13.ThestatisticsoftheSIGHANBakeoff’05datasets.CorpusMSRAPKUASCityU
Sizeoftraining
set(Kb)
12,2576926,9886085
Sizeoftestset(Kb)
3683313133
Sizeofwordlist74,60851,113141,33869,085
Table14.F1-measureoftheunigramwordsegmentationmodelusingtheindividualwordlistandwordfrequenciesfromtheSIGHANtrainingset,testedonfourSIGHANBakeoff’05datasets.CorpusF1(%)
MSRA91.4
PKU88.4
AS88.3
CityU86.1
CHINESEWORDFREQUENCYAPPROXIMATION
163
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010Unfortunately,thisresultisnotdirectlycomparabletotheresultsshowninTable14,sinceourproposedschemeusesthesamewordlistonallthefourdatasetswhereastheresultsofTable14areobtainedwithdifferentindividualwordlistsextractedfromthecorrespondingtrainingset.
Inordertocompensatetheinfluenceoftheindividualwordlistsandtofacilitateafaircomparison,wefurtherconductaunigramwordsegmentationmodelwhichusestheindividualwordlistfromeachSIGHANtrainingset,alongwithwordfrequenciesreplacedbytheapproximationresultsobtainedfromtheproposedschemeRFBþMS.Asimplesmoothingmethod,Laplacesmoothing,isusedtoestimatetheprobabilityofwordswhichdonotappearinCL.Table16showsthenumberofsharedwordsbetweentheindividualSIGHANwordlistsandCL,aswellastheperformanceoftheresultingunigramwordsegmentationmodelsontheSIGHANBakeoff’05datasets.
Theresultsindicatethatsuchastrategyonlydecreasesthewordsegmentationperformanceby0.1*0.3percent.WealsonotethattheproportionofthesharedwordsisfairlylowsothatroughlyhalfofthewordsreceivetheirfrequencyestimatebyLaplacesmoothing.IfthefrequenciesofthemissingwordsaretakenfromtheSIGHAN
Table15.F1-measureoftheunigramwordsegmentationmodelsusingthefivefrequencyapproximationschemesforCL,testedonthefourSIGHANBakeoff’05datasets.CorpusMSRAPKUASCityU
STR(%)85.285.478.377.4
MS(%)85.686.878.878.4
RFB(%)85.485.878.877.9
RFBþMS(%)
86.387.480.480.1
CRFþMS(%)
85.987.279.079.2
Table16.F1-measureoftheunigramwordsegmentationmodelusingtheRFBþMSfrequencyapproximationschemewithLaplacesmoothingfortheindividualwordlist,testedonthefourSIGHANBakeoff’05datasets.Corpus
NumberofsharedwordsbetweenindividualwordlistandCLF1-measure(%)
MSRA38,99291.3
PKU32,43988.3
AS42,37488.0
CityU30,29985.9
1
W.QIAOETAL.
Table17.F1-measureoftheunigramwordsegmentationmodelusingtheRFBþMSfrequencyapproximationschemewithoriginalwordfrequenciesassmoothingfortheindividualwordlist,testedonthefourSIGHANBakeoff’05datasets.CorpusF1-measure(%)
MSRA92.3
PKU90.3
AS.8
CityU87.4
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010trainingsetsdirectly,insteadofLaplacesmoothing,theperformanceofwordsegmentation,asgiveninTable17,isimproved,evenbetterthantheperformanceofusingtheindividualwordlistwithwordfrequenciesentirelyderivedfromSIGHANtrainingset(Table14)by0.9%,1.9%,1.5%and1.3%onthefourdatasets.ThisobservationaddsfurtherevidencetotheclaimthattheproposedschemeRFBþMSiseffectiveforChinesewordfrequencyapproximation.Italsosuggeststhattheschememayeasilybeadaptedtospecificdatacollections.
6.CONCLUSIONANDFUTUREWORK
Inthisarticle,ahybridschemeisproposedwhichjointlyusescorporaofdifferenttype(rawcorpora,MM-segmentedcorporaandmanuallysegmentedcorpora)toapproximatethewordfrequenciesinChinese.IthasbeendevelopedtakingintoaccountthecharacteristicsoftheChineselanguage.TheparametersintheproposedschemehavebeendeterminedtoproduceageneralframeworkforChinesewordfrequencyapproxima-tion.Itcan,therefore,beimmediatelyappliedothertaskswithoutanadditionaladjustment.Statisticalevaluationshavebeencarriedoutfromseveralperspectives.TheyshowthattheproposedschemeoutperformsotherChinesewordfrequencyapproximationmethods.ThemodelhasalsobeenappliedtoaChinesewordsegmentationtasktocompareitsperformancewithotherChinesewordfrequencyapproximationschemes.Theresultsindicatethattheproposedschemeisthemosteffectiveoneamongtheapproachesunderconsideration.WearenowcompilinganauthoritativeChinesewordlistbasedonthiswordfrequencyapproxi-mationscheme.Bymakingitpubliclyavailableinthefuture,wehopetoprovidearesourcewhichmightbeveryusefulforotherChineseNLPresearchers.
CHINESEWORDFREQUENCYAPPROXIMATION
165
ACKNOWLEDGEMENTS
ThisworkissupportedbytheNationalScienceFoundationofChinaunderGrantNo.60873174,theNational863High-TechProjectofChinaunderGrantNo.2007AA01Z148andtheChina–Germany(Tsinghua–HamburgUniversity)CINACSProgram.
REFERENCES
Carpenter,B.(2005).Scalinghigh-ordercharacterlanguagemodelstogigabytes.ACL
SoftwareWorkshop.AnnArbor,USA.
Chen,A.,Zhou,Y.,Zhang,A.,&Sun,G.(2005).UnigramlanguagemodelforChinese
wordsegmentation.InChu-RenHuang&G.A.Levow(Eds),Proceedingsofthe4thSIGHANWorkshoponChineseLanguageProcessing(pp.138–141).JejuIsland,Korea:AssociationforComputationalLinguistics.
Chen,G.L.(1994).OnChineseMorphology.Shanghai:XuelinPublisher.
Cheng,K.S.,Young,G.H.,&Wong,K.F.(1999).Astudyonword-basedandintegral-bitChinesetextcompressionalgorithm.JournaloftheAmericanSocietyforInformationScience,50(3),218–228.
Dai,X.L.(1992).ChineseMorphologyanditsInterfacewiththeSyntax.PhD
dissertation,OhioStateUniversity.
Emerson,T.(2005).TheSecondInternationalChineseWordSegmentationBakeoff.In
Chu-RenHuang&G.A.Levow(Eds),Proceedingsofthe4thSIGHANWorkshoponChineseLanguageProcessing.JejuIsland,Korea:AssociationforComputa-tionalLinguistics.
Gao,J.F.,Li,M.,Huang,C.,andWu,A.(2005).Chinesewordsegmentationandnamed
entityrecognition:Apragmaticapproach.JournalofComputationalLinguisticsandChineseLanguageProcessing,31(4),531–574.
Goldwater,S.,Griffiths,T.L.,&Johnson,M.(2006).Contextualdependenciesin
unsupervisedwordsegmentation.InC.Cardie&P.Isabelle(Eds),ProceedingsoftheInternationalConferenceofCOLING-ACL2006(pp.673–680).Sydney:AssociationforComputationalLinguistics.
Holland,J.(1975).AdaptationinNaturalandArtificialSystems.AnnArbor,MI:
UniversityofMichiganPress.
Kit,C.,Xu,Z.,&Webster,J.J.(2004).IntegratingN-grammodelandcase-based
learningforChinesewordsegmentation.JournalofChineseLanguageandComputing,14(3),213–219.
Lafferty,J.,McCallum,A.,&Pereira,F.(2001).Conditionalrandomfields:Probabilistic
modelsforsegmentingandlabelingsequencedata.InC.E.Brodley&A.PohoreckyjDanyluk(Eds),Proceedingsofthe18thInternationalConferenceonMachineLearning(ICML2001)(pp.282–2).Williamstown,MA:MorganKaufmannPublishers.Levow,G.A.(2006).TheThirdInternationalChineseWordSegmentationBakeoff.In
H.T.Ng&O.O.Y.Kwong(Eds),Proceedingsofthe5thSIGHANWorkshoponChineseLanguageProcessing(pp.108–117).Sydney:AssociationforComputa-tionalLinguistics.
Liang,N.Y.(1987).CDWS:AwordsegmentationsystemforwrittenChinesetexts.
JournalofChineseInformationProcessing,1(2),44–52(inChinese).
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010166
W.QIAOETAL.
Liu,E.S.(1973).FrequencyDictionaryofChineseWords.TheHague:MoutonandCo.
N.V.Publishers.
Liu,Y.,&Liang,N.Y.(1986).CountingwordfrequenciesofcontemporaryChinese–
AnengineeringofChineseprocessing.JournalofChineseInformationProcessing,0(1),17–25.
Low,J.K.,Ng,H.T.,&Guo,W.(2005).AmaximumentropyapproachtoChineseword
segmentation.Proceedingsofthe4thSIGHANWorkshoponChineseLanguageProcessing(pp.161–1).JejuIsland,Korea:AssociationforComputationalLinguistics.Mao,J.,Cheng,G.,He,Y.,&Xing,Z.(2007).Atrigramstatisticallanguagemodel
algorithmforChinesewordsegmentation.InF.P.Preparata&Q.Fang(Eds),ProceedingsoftheInternationalConferenceofFAW2007(pp.271–280).Lanzhou,China.Berlin,Heidelberg:SpringerVerlag.
Palmer,D.D.(1997).Atrainablerule-basedalgorithmforwordsegmentation.InP.R.
Cohen&W.Wahlster(Eds),ProceedingsoftheInternationalConferenceofACL1997(pp.321–328).Madrid:AssociationforComputationalLinguistics.
Peng,F.,Feng,F.,&McCallum,A.(2004).Chinesesegmentationandnewword
detectionusingconditionalrandomfields.InL.Lemnitzer,D.Meurers&E.Hinrichs(Eds.),ProceedingsoftheInternationalConferenceofCOLING2004(pp.562–568).Geneva,Switzerland:AssociationforComputationalLinguistics.
Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical
Journal,27,379–423,623–656.
Sproat,R.,Shih,C.,Gale,W.,&Chang,N.(1996).Astochasticfinite-stateword-segmentationalgorithmforChinese.JournalofComputationalLinguisticsandChineseLanguageProcessing,22(3),377–404.
Sproat,R.,&Emerson,T.(2003).TheFirstInternationalChineseWordSegmentation
Bakeoff.InQ.Ma&F.Xia(Eds),Proceedingsofthe2ndSIGHANWorkshoponChineseLanguageProcessing(pp.133–143).Sapporo,Japan:AssociationforComputationalLinguistics.
Sun,M.S.,&T’sou,B.K.Y.(1995).AmbiguityResolutioninChineseWord
Segmentation.InB.K.T’sou&T.B.Y.Lai(Eds),Proceedingsofthe10thPacificAsiaConferenceonLanguage,InformationandComputation(pp.121–126).,China:CityUniversityof.
Sun,M.S.,Shen,D.Y.,&T’sou,B.K.Y.(1998).ChineseWordSegmentationwithout
UsingLexiconandHand-craftedTrainingData.InC.Boitet&P.Whitelock(Eds),Proceedingsofthe36thACLand17thCOLING(pp.1265–1271).Montreal,Canada:AssociationforComputationalLinguistics.
Sun,M.S.,Zhang,Z.C.,T’sou,B.K.Y.,&Lu,H.(2006).Wordfrequency
approximationforChinesewithoutusingmanuallyannotatedcorpus.InA.F.Gelbukh(Ed.),Proceedingsofthe7thInternationalConferenceonIntelligentTextProcessingandComputationalLinguistics(CICLING)(pp.105–116).MexicoCity,Mexico.Berlin,Heidelberg:SpringerVerlag.
Tang,T.C.(1992).ChineseMorphologyandSyntax.Taipei:TaiwanStudentPublisher.Teahan,W.J.,Wen,Y.,McNab,R.,&Witten,I.H.(2000).Acompression-based
algorithmforChinesewordsegmentation.JournalofComputationalLinguisticsandChineseLanguageProcessing,26(3),375–393.
Xue,N.W.(2003).Chinesewordsegmentationascharactertagging.Journalof
ComputationalLinguisticsandChineseLanguageProcessing,8(1),29–48.Zhu,D.X.(1982).LecturesonGrammar.Beijing:TheCommercialPress.
Downloaded By: [Canadian Research Knowledge Network] At: 00:50 5 October 2010
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- huatuo3.com 版权所有 蜀ICP备2023022190号-1
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务