搜索
您的当前位置:首页正文

BLAT--The BLAST-Like Alignment Tool

来源:小奈知识网
Downloaded from genome.cshlp.org on November 2, 2010 - Published by Cold Spring Harbor Laboratory Press

W. James Kent

Genome Res. 2002 12: 656-664

Access the most recent version at doi:10.1101/gr.229202

BLAT−−The BLAST-Like Alignment Tool

ReferencesEmail alerting

service

Article cited in:

http://genome.cshlp.org/content/12/4/656.full.html#related-urls

Receive free email alerts when new articles cite this article - sign up in the box at thetop right corner of the article orclick here

To subscribe to Genome Research go to: http://genome.cshlp.org/subscriptions

Cold Spring Harbor Laboratory Press

Downloaded from genome.cshlp.org on November 2, 2010 - Published by Cold Spring Harbor Laboratory Press

Resource

BLAT—TheBLAST-LikeAlignmentTool

W.JamesKent

DepartmentofBiologyandCenterforMolecularBiologyofRNA,UniversityofCalifornia,SantaCruz,SantaCruz,California95064,USA

AnalyzingvertebrategenomesrequiresrapidmRNA/DNAandcross-speciesproteinalignments.Anewtool,BLAT,ismoreaccurateand500timesfasterthanpopularexistingtoolsformRNA/DNAalignmentsand50timesfasterforproteinalignmentsatsensitivitysettingstypicallyusedwhencomparingvertebratesequences.BLAT’sspeedstemsfromanindexofallnonoverlappingK-mersinthegenome.ThisindexfitsinsidetheRAMofinexpensivecomputers,andneedonlybecomputedonceforeachgenomeassembly.BLAThasseveralmajorstages.Itusestheindextofindregionsinthegenomelikelytobehomologoustothequerysequence.Itperformsanalignmentbetweenhomologousregions.Itstitchestogetherthesealignedregions(oftenexons)intolargeralignments(typicallygenes).Finally,BLATrevisitssmallinternalexonspossiblymissedatthefirststageandadjustslargegapboundariesthathavecanonicalsplicesiteswherefeasible.ThispaperdescribeshowBLATwasoptimized.EffectsonspeedandsensitivityareexploredforvariousK-mersizes,mismatchschemes,andnumberofrequiredindexmatches.BLATiscomparedwithotheralignmentprogramsonvarioustestsetsandthenusedinseveralgenome-wideapplications.http://genome.ucsc.eduhostsaweb-basedBLATserverforthehumangenome.

Somemightwonderwhyintheyear2002theworldneedsanothersequencealignmenttool.Thelocalalignmentprob-lembetweentwoshortsequenceswassolvedbytheSmith-Watermanalgorithmin1980(SmithandWaterman1981).TheFASTA(PearsonandLipman1988)andtheBLASTfamilyofalignmentprogramsincludingNCBIBLAST(Altschuletal.1990,1997),MegaBLAST(Zhangetal.2000),andWU-BLAST(Altschuletal.1990;GishandStates1993;StatesandGish1994)provideflexibleandfastalignmentsinvolvinglargese-quencedatabases,andareavailablefreeonmanywebsites.Sim4(Floreaetal.1998)doesafinejobofcDNAalignment.TheSAMprogram(Karplusetal.1998)andPSI-BLAST(Altschuletal.1997)slowlybutsurelyfindremotehomologs.Gotoh’smanyalgorithmsrobustlydealwithgaps(Gotoh1990,2000).SSAHA(Ningetal.2001)mapssequencereadstothegenomewithblazingefficiency.

Intheprocessofassemblingandannotatingthehumangenome,Iwasfacedwithtwoverylarge-scalealignmentprob-lems:aligningthreemillionESTsandaligning13millionmousewhole-genomerandomreadsagainstthehumange-nome.Thesealignmentsneededtobedoneinlessthantwoweeks’timeonamoderate-sized(90CPU)Linuxclusterinordertohavetimetoprocessanupdatedgenomeeverymonthortwo.ToachievethisIdevelopedavery-high-speedmRNA/DNAandtranslatedproteinalignmentalgorithm.

ThenewalgorithmiscalledBLAT,whichisshortfor“BLAST-likealignmenttool.”BLATissimilarinmanywaystoBLAST.Theprogramrapidlyscansforrelativelyshortmatches(hits),andextendstheseintohigh-scoringpairs(HSPs).How-ever,BLATdiffersfromBLASTinsomesignificantways.WhereBLASTbuildsanindexofthequerysequenceandthenscanslinearlythroughthedatabase,BLATbuildsanindexofthedatabaseandthenscanslinearlythroughthequeryse-quence.WhereBLASTtriggersanextensionwhenoneortwohitsoccurinproximitytoeachother,BLATcantriggerexten-E-MAILkent@biology.ucsc.eduArticleandpublicationareathttp://www.genome.org/cgi/doi/10.1101/gr.229202.ArticlepublishedonlinebeforeMarch2002.

sionsonanynumberofperfectornear-perfecthits.WhereBLASTreturnseachareaofhomologybetweentwosequencesasseparatealignments,BLATstitchesthemtogetherintoalargeralignment.BLAThasspecialcodetohandleintronsinRNA/DNAalignments.Therefore,whereasBLASTdeliversalistofexonssortedbyexonsize,withalignmentsextendingslightlybeyondtheedgeofeachexon,BLATeffectively“un-splices”mRNAontothegenome—givingasinglealignmentthatuseseachbaseofthemRNAonlyonce,andwhichcor-rectlypositionssplicesites.

BLATisavailableinseveralforms.Sincebuildinganin-dexofthewholegenomeisarelativelyslowprocedure,aBLATserverisavailablewhichbuildstheindexandkeepsitinmemory.ABLATclientcanthenquerytheindexthroughtheserver.Theclient/serverversionisespeciallysuitableforin-teractiveapplications,andisavailableviaawebinterfaceathttp://genome.ucsc.edu.Astand-aloneBLATisalsoavailable,whichismoresuitableforbatchrunsononeormoreCPUs.Boththeclient/serverandthestand-alonecandocompari-sonsatthenucleotide,protein,ortranslatednucleotidelevel.

RESULTS

BLATiscurrentlyusedinthreemajorapplicationsinconjunc-tionwithhttp://genome.ucsc.edu.BLATisusedtoproducethehumanESTandmRNAalignments.ThehumanESTalign-mentscompared1.75ן109basesin3.73ן106ESTsagainst2.88ן109basesofhumanDNAandtook220CPUhoursonaLinuxfarmof800MhZPentiumIIIs.BLATwasusedintranslatedmodetoaligna2.5ןcoverageunassembledwhole-genomeshotgunofthemouseversusthemaskedhu-mangenome.Thisinvolved7.51ן109basesin1.33ן107readsandtook16,300CPUhours.Theclient/serverversionofBLATisusedtopoweruntranslatedandtranslatedinteractivesearchesonhttp://genome.ucsc.edu.ResearchersallovertheworlduseBLATtoperformthousandsofinteractivesequencesearchesperday.Thenucleotideserverhassustainedover500,000searchrequestsperdayfromprogram-drivenqueries.Wedoaskthoseresearcherswhoaredoingmorethanafew

656GenomeResearch

www.genome.org

12:656–664©2002byColdSpringHarborLaboratoryPressISSN1088-9051/01$5.00;www.genome.org

Downloaded from genome.cshlp.org on November 2, 2010 - Published by Cold Spring Harbor Laboratory Press

thousandprogram-drivenqueriestoobtainacopyofBLATtouseontheirownservers.Thenucleotideserverisnotaseffi-cientasthestand-aloneprogram,sincetosavememoryitdoesnotkeepthegenomeinmemory,onlytheindex.Theindexusesapproximately1gigabyteonunmaskedDNAinuntranslatedmode,andapproximately2.5gigabytesonmaskedDNAintranslatedmode.Thetranslatedmodeserverbydefaultislesssensitivethanthedefaultstand-aloneset-tings.Itrequiresthreeperfectaminoacid4-merstotriggeranalignment.Theuntranslatedserverusuallyrespondstoa1000-basecDNAqueryinlessthanasecond.Thetranslatedserverusuallyrespondstoa400-aminoacidproteinqueryin<5sec.

EvaluatingmRNA/DNAAlignments

AsatestofBLAT,Iremapped713mRNAscorrespondingtogenesthattheSangerCentrehasannotatedonchromosome22(Dunhametal.1999)backtochromosome22withBLATandwithSim4(Florea1998).WhenBLATproducedmultiplealignmentsforanmRNA,onlythehighestscoringalignmentwaskept.In99.99%oftheannotatedbases,theBLATalign-mentagreedwiththeSangerannotations.Therewere107basesin10geneswheretherewasdisagreement.Infiveofthe10genes,thedisagreementwasonlyintheplacementofnon-standardsplicesites.Intwocases,BLATdidnotfindsmall(<32-base)initialexons.Inonecase,anexonofsixbaseswaspresentandaligningfully,butinadifferentplacethanan-notated(whereitalsoalignedfully,butwithbetterflankingsplicesites).Inonecase,BLATpositionedanintrontocon-formwiththeconsensussequenceonthewrongstrand.Thatis,thegapcorrespondingtotheintronwaspositionedtohaveCT/ACratherthanGT/AGends.Thefinalcasewasa38-basesequencethatBLATwasunabletoplacebecausethemiddlecontainedsomedegeneratesequence.TheBLATalignmentsweredoneatthedefaultsettingsandtook26sec.

TheSim4alignmentsofthesamedatatook17,468sec(almost5h).TheyagreedwiththeSangerannotationsin99.66%ofthebases.ThereweredisagreementsbetweentheSim4alignmentsandtheSangerannotationsfromvariouscausesin52ofthegenes.Mostofthesedisagreementsweresmall.

EvaluatingMouse/HumanTranslatedAlignments

ThoughthetranslatedmodesofBLATarerelativelynew,theyarequickandeffective.ThetranslatedmodeofBLATwasin-spiredbytheExofishresearchatGenoscope(RoestCrolliusetal.2000).ExofishshowedthataTBLASTXrunusinganiden-titymatrix(wherematcheswereweighted+15andmis-matchesמ12forallaminoacids)andawordsizeof5wasquiteeffectiveinaligningcodingregionsconservedbetweenHomosapiensandTetraodonnigroviridis.Forhumanandmouseithasbeenshownthatgaplessalignmentsareinmanywayspreferabletogappedalignmentsfordetectingcodingregions(Wieheetal.2001).Table1showsthetimingsofBLATandWU-TBLASTXrunonamodest-sizeddatasetatgaplessExofish-likesettings.BLATrunsmuchfaster,makingitfea-sibletocomparevertebrategenomesquicklyenoughtokeepupwiththevastoutputoftoday’ssequencingcenters.

PankajAgarwalprovidedaWU-TBLASTXalignmentof13millionmousegenomicreadsversushumanchromosome22rununderagaplesssettingthatshouldtheoreticallybesome-whatmoresensitivethanthematrixusedfortheExofishset-

BLAT—TheBLAST-LikeAlignmentTool

Table1.TimingofBLATvs.WU-TBLASTXonaDataSetof1000MouseReadsandaRepeatMaskedHumanChromosome22MethodKNMatrixTimeWU-TBLASTX51+15/מ122736sWU-TBLASTX51BLOSUM622714sBLAT51+2/מ161sBLAT42+2/מ137sThefirstWU-TBLASTXrunwasperformedusingthesettingsusedinExofish.ThesecondWU-TBLASTXrunwasperformedusingthesettingsB=9000V=9000hspmax=4topcomboN=1W=5E=0.01Z=3000000000nogapsfilter=xnu+seg.TheKcolumnindicatesthesizeoftheperfectlymatchinghitthatservesasaseedforanalignment.TheNcolumnindicateshowmanyhitsinagapless100-aminoacidwindowwererequiredtotriggerade-tailedalignment.TheMatrixcolumndescribesthematch/mismatchscoresorthesubstitutionscorematrixused.tingsbecauseoftheuseoftheBLOSUM62matrix(P.Agarwal,pers.comm.).Table2showsacomparisonbetweenthisalign-mentandatranslatedBLATalignmentdoneattheindicatedsetting.Theresultswerequitecomparableinsensitivity.

OtherUsageInformation

BLATcanalsobeusedintranslatedmodetoalignproteinsormRNAfromonespeciesagainstgenomicDNAofanotherspe-cies.IntranslatedmRNA/translatedDNAmode,BLAThastoalignonlyonestrandofthequerysequence,speedingitupbyafactoroftwo.Inthismodeitalsobecomesmoretolerantofintron-inducedgaps.BLATcandoprotein–proteinalign-mentsaswell,butitisnotlikelytobethetoolofchoiceforthese.TheproteindatabasesarestillsmallenoughthatBLASTPcanhandlethemeasily,andBLASTPismoresensitivethanBLAT.

BLATcanhandleverylongdatabasesequenceseffi-ciently.Itismoreefficientatshortquerysequencesthanlongquerysequences.Itisnotrecommendedforquerysequenceslongerthan200,000bases.ItisnotnecessarytomasktheDNAforuntranslatedBLATsearches.Translatedsearchesgen-

Table2.SensitivityofWU-TBLASTXandBLATAppliedto13MillionMouseShotgunReadsandHumanChromosome22%RefSeq%RefSeqMethod%Chr22basesEnrichmentexonsWU-TBLASTX2.67%81.7%31x84.5%BLAT2.89%80.8%28x86.7%The“%Chr22”columnshowsthepercentageofchromosome22coveredbythealignments(genomicdensity).ThenextcolumnisthepercentageofbasesinsideofhumanRefSeqcodingsequencescoveredbythealignments(RefSeqcodingdensity).“Enrichment”istheratiooftheRefSeqcodingdensitycomparedtogenomicdensity.Higherlevelsofenrichmentindicatemorespecificityatthebaselevel.ThelastcolumnshowsthepercentageofRefSeqcodingexonswhereanypartoftheexoniscoveredbyanalign-ment.WU-TBLASTXwasrunwiththeparametersdescribedinthesecondrowofTable1.BLATwasrunwithusingapair-of-4-mersseedandascorecutoffof30.Atthesesettings,BLATtouchesslightlymoreexons,butWU-TBLASTXcoversslightlymorebasesinexons.Genomewww.genome.org

Research

657

Downloaded from genome.cshlp.org on November 2, 2010 - Published by Cold Spring Harbor Laboratory Press

Kent

erallyproducemuchquicker,cleanerresultsifthesequenceismaskedforrepeatsandlowcomplexitysequence.ItisconvenienttointroduceatermthatcountsthenumberofnonoverlappingK-mersinthehomologousregion:

T=floor͑HրK͒

(2)

METHODS

Algorithm

AllfastalignmentprogramsthatIamawareofbreakthealignmentproblemintotwoparts.Initiallyina“searchstage,”theprogramdetectsregionsofthetwosequenceswhicharelikelytobehomologous.Theprogramtheninan“alignmentstage”examinestheseregionsinmoredetailandproducesalignmentsfortheregionswhichareindeedho-mologousaccordingtosomecriteria.Thegoalofthesearchstageistodetectthevastmajorityofhomologousregionswhilereducingtheamountofsequencethatispassedtothealignmentstage.TheprobabilitythatatleastonenonoverlappingK-merinthehomologousregionmatchesperfectlywiththecorrespondingK-merinthequeryis:

P=1−͑1−p1͒T=1−͑1−MK͒T(3)

ThenumberofnonoverlappingK-mersthatareexpectedtomatchbychance,assumingthatalllettersareequallylikelytooccuris:

F=͑Q−K+1͒*͑GրK͒*͑1րA͒K(4)

SearchingWithSinglePerfectMatchesAsimpleandreasonablyeffectivesearchstageistolookforsubsequencesofacertainsize,k,whicharesharedbythequerysequenceandthedatabase.Inmanypracticalimple-mentationsofthissearch,everyK-merinthequeryiscom-paredagainstallnonoverlappingK-mersinthedatabase.Let’sexaminethenumberofhomologousregionsthataremissed,andthenumberofnonhomologousregionsthatarepassedtothealignmentstageusingthesecriteria.First,we’llneedsomedefinitions:K:TheK-mersize.Typicallythisis8–16fornucleotidecomparisonsand3–7foraminoacidcomparisons.

M:Thematchratiobetweenhomologousareas.Thiswouldbetypicallyabout98%forcDNA/genomicalignmentswithinthesamespecies,about89%forproteinalignmentsbetweenhumanandmouse.

H:Thesizeofahomologousarea.Forahumanexonthisistypically50–200bases.

G:Thesizeofthedatabase—3billionbasesforthehu-mangenome.

Q:Thesizeofthequerysequence.

A:Thealphabetsize;20foraminoacids,4fornucleo-tides.

Assumingthateachletterisindependentofthepreviousletter,theprobabilitythataspecificK-merinahomologousregionofthedatabasematchesperfectlythecorrespondingK-merinthequeryissimply:

p1=MK(1)

Tables3and4showPandFvaluesforvariouslevelsofse-quenceidentityandK-mersizes.ForESTalignmentswemightwantthesearchphasetofindatleast99%ofsequencesthathave5%orlesssequencingnoise.LookingatTable3,toachievethislevelofsensitivityusingthissimplesearchmethod,wewouldneedtochooseaKof14orless.AKof14resultsin399regionspassedontothealignmentphasebychancealone.AnysmallerKwouldpasssignificantlymore.Mouseandhumansequencesaverage89%identityattheaminoacidlevel(MakalowskiandBoguski1998).LookingatTable4,tocompareatranslatedmousereadandfindatleast99%ofthesequencesatthislevelofidentitywewouldneedaKof5orless,whichwouldresultin62,625sequencespassedontothealignmentstage.Dependingonthecostofthealign-mentstage,thesesimplesearchcriteriamayormaynotbesuitable.Comparingmouseandhumancodingsequencesatthenucleotidelevel,wherethereisonaverage86%baseiden-tity(MakalowskiandBoguski1998),requiresustoreduceourKto7tofindatleast99%ofthesequences.Thisresultsin13,078,962regionspassedtothealignmentstage,whichwouldprobablynotbepractical.

SearchingWithSingleAlmostPerfectMatches

WhatifinsteadofrequiringperfectmatcheswithaK-mertotriggeranalignment,weallowalmostperfectmatches,thatis,hitswhereonelettermaymismatch?TheprobabilitythatanonoverlappingK-merinahomologousregionofthedata-basematchesalmostperfectlythecorrespondingK-merinthequeryis:

p1=K*MK−1*͑1−M͒+MK(5)

Table3.SensitivityandSpecificityofSinglePerfectNucleotideK-merMatchesasaSearchCriterion780.9150.9530.9780.9920.9981.0001.0001.0001.00082.9e+0690.8330.8970.9450.9750.9910.9981.0001.0001.0009635783100.7260.8150.8880.9420.9760.9930.9991.0001.00010143051110.6070.7110.8080.8880.9460.9810.9950.9991.0001132512120.4860.5950.7070.8110.8970.9560.9870.9981.000127451130.3730.4780.5940.7140.8240.9120.9680.9941.000131719140.3140.4150.5320.6590.7820.8860.9570.9910.99914399A.81%83%85%87%89%91%93%95%97%B.KF0.9740.9880.9960.9991.0001.0001.0001.0001.00071.3e+07(A)ColumnsareforKsizesof7–14.Rowsrepresentvariouspercentageidentitiesbetweenthehomologoussequences.Thetableentriesshowthefractionofhomologiesdetectedascalculatedfromequation3assumingahomologousregionof100bases.ThelargerthevalueofK,thefewerhomologiesaredetected.(B)Krepresentsthesizeoftheperfectmatch.Fshowshowmanyperfectmatchesofthissizeexpectedtooccurbychanceaccordingtoequation4inagenomeof3billionbasesusingaqueryof500bases.658GenomeResearch

www.genome.org

Downloaded from genome.cshlp.org on November 2, 2010 - Published by Cold Spring Harbor Laboratory Press

BLAT—TheBLAST-LikeAlignmentTool

Table4.SensitivityandSpecificityofSinglePerfectAminoAcidK-merMatchesasaSearchCriterionKA.71%73%75%77%79%81%83%85%87%89%91%93%B.KF30.9920.9960.9980.9990.9991.0001.0001.0001.0001.0001.0001.00034.2e+0740.9040.9310.9520.9690.9810.9890.9940.9970.9991.0001.0001.00041.6e+0650.6970.7520.8030.8500.8900.9240.9500.9700.9840.9930.9970.99956262560.4960.5600.6250.6890.7520.8100.8620.9060.9420.9680.9850.9956260970.3170.3740.4360.5030.5740.6460.7180.7870.8500.9030.9450.9757112humanhomologiesatthenucleotidelevelusingthistech-nique.AKsizeof12detectsover99%ofthemousehomolo-gies,andrequireschecking275,671alignments.Attheaminoacidlevel,aKsizeof8hasthedesiredsensitivityandrequirescheckingonly374alignments.

SearchingWithMultiplePerfectMatches

Anotheralternativesearchmethodistorequiremultipleper-fectmatchesthatareconstrainedtobeneareachother.Con-siderasituationwheretheKsizeis10andtherearetwohits—onestartingatposition10inthequeryand1010inthedatabase,andanotherstartingatposition30inthequeryand1030inthedatabase.Thesetwohitscouldeasilybepartofaregionofhomologyextendingfrompositions10–39inthequeryand1010–1039inthedatabase.Ifwesubtractthequerycoordinatefromthedatabasecoordinate,wegeta“diagonal”coordinate.ConsiderthesearchcriteriathattheremustbeNperfectmatches,eachnofurtherthanWlettersfromeachotherinthetargetcoordinate,andhavethesamediagonalcoordinate(Fig.1).ForN=1,theprobabilitythatanonover-lappingK-merinahomologousregionofthedatabasematchesperfectlythecorrespondingK-merinthequeryissimplyasbefore:

p1=MK(8)

(A)ColumnsareforKsizesof3–7.Rowsrepresentvariousper-centageidentitiesbetweenthehomologoussequences.Thetableentriesshowthefractionofhomologiesdetectedascalculatedfromequation3assumingahomologousregionof33aminoacids.(B)Krepresentsthesizeoftheperfectmatch.Fshowshowmanyperfectmatchesofthissizeareexpectedtooccurbychanceaccordingtoequation4inatranslatedgenomeof3billionbasesusingaqueryof167aminoacids(correspondingto500bases).Theprobabilitythatthereareexactlynmatcheswithinthehomologousregionis

Pn=p1n*͑1−p1͒T−n*T!ր͑n!*͑T−n͒!͒

(9)

Aswithasingleperfecthit,theprobabilitythatanynonover-lappingK-merinthehomologousregionmatchesalmostper-fectlywiththecorrespondingK-merinthequeryis:

P=1−͑1−p1͒

TAndtheprobabilitythatthereareNormorematchesisthesum:

P=PN+PN+1+…+PT(10)

(6)

WhereasthenumberofK-merswhichmatchalmostperfectlybychanceare:

F=͑Q−K+1͒*͑GրK͒*͑K*͑1րA͒K−1*͑1−͑1րA͒͒+͑1րA͒K͒

(7)

ThenumberofsetsofNperfectmatchesthatoccurbychanceisalittlecomplextocalculate.ForN=1itiseasy:

F1=͑Q−K+1͒*͑GրK͒*͑1րA͒K(11)

Tables5and6showPandFforvariouslevelsofsequenceidentityandK-mersizes.ForthepurposesofESTalignments,aKof22orlesswouldpassthroughover99%ofthetrulyhomologousregionswhileonaveragepassinglessthanonechancematchthroughtothealigner.Withareasonablyfastalignmentstage,itwouldbefeasibletolookformouse/

TheprobabilityofasecondmatchoccuringwithinWlettersafterthefirstis

S=1−͑1−͑1րA͒K͒WրK(12)

becausethesecondmatchcanoccurwithanyoftheW/KnonoverlappingK-mersinthedatabasewithinWlettersafterthefirstmatch.Wecanextendthisreasoningtoconsiderthe

Table5.SensitivityandSpecificityofSingleNear-Perfect(OneMismatchAllowed)NucleotideK-merMatchesasaSearchCriterion12A.81%83%85%87%89%91%93%95%97%B.KF0.9450.9750.9910.9971.0001.0001.0001.0001.00012275671130.8800.9360.9710.9900.9971.0001.0001.0001.0001368775140.8310.9040.9540.9830.9950.9991.0001.0001.0001417163150.7210.8200.9000.9540.9840.9960.9991.0001.000154284160.6570.7700.8650.9350.9760.9940.9991.0001.000161070170.5260.6490.7670.8670.9390.9790.9961.0001.00017267180.4650.5910.7190.8330.9200.9710.9940.9991.0001867190.4080.5350.6690.7960.8970.9620.9910.9991.0001917200.3560.4800.6190.7570.8720.9500.9880.9991.000204.2210.2550.3610.4900.6340.7750.8900.9630.9941.000211.0220.2180.3180.4450.5910.7410.8690.9540.9921.000220.3(A)ColumnsareforKsizesof12–22.Rowsrepresentvariouspercentageidentitiesbetweenthehomologoussequences.Thetableentriesshowthefractionofhomologiesdetectedascalculatedbyequation6assumingahomologousregionof100bases.(B)Krepresentsthesizeofthenear-perfectmatch.Fshowshowmanyperfectmatchesofthissizeexpectedtooccurbychanceaccordingtoequation7inagenomeof3billionbasesusingaqueryof500bases.GenomeResearch

www.genome.org

659

Downloaded from genome.cshlp.org on November 2, 2010 - Published by Cold Spring Harbor Laboratory Press

Kent

Bothsingleimperfectmatchesandmultipleperfectmatcheshaveasig-456789nificantadvantageoversingleper-fectmatches.Theydrasticallyre-A.71%1.0000.9920.9460.8230.7250.515ducethenumberofalignments

73%1.0000.9950.9650.8670.7850.586whichmustbecheckedtoachievea75%1.0000.9980.9780.9050.8400.657givenlevelofsensitivity,asshown77%1.0000.9990.9870.9350.8860.727inTables9and10.Themultiple-79%1.0000.9990.9930.9590.9240.791perfectmatchcriteriacanbemodi-81%1.0001.0000.9970.9760.9520.849fiedtoallowsmallinsertionsand83%1.0001.0000.9990.9870.9730.897deletionswithinthehomologous85%1.0001.0000.9990.9940.9860.936areabyallowingmatchestobe87%1.0001.0001.0000.9970.9940.964clumpediftheyareneareachother

89%1.0001.0001.0000.9990.9980.982ratherthanidenticalonthediago-91%1.0001.0001.0001.0000.9990.993nalcoordinate.Thisimprovesreal-93%1.0001.0001.0001.0001.0000.998worldsensitivityattheexpenseofincreasingthenumberofalign-B.K456789mentsthatmustbedone.AllowingF1.2E+086.0E+063000781498574937asingleinsertionordeletionin-creasesthealignmentsbyafactor

(A)ColumnsareforKsizesof4–9.Rowsrepresentvariouspercentageidentitiesbetweentheofthree,whereasallowingtwohomologoussequences.Thetableentriesshowthefractionofhomologiesdetected.(B)Krepre-increasesthealignmentsbyafac-sentsthesizeofthenear-perfectmatch.Fshowshowmanyperfectmatchesofthissizeexpectedtoroffive.Ingeneral,twoperfecttooccurbychanceinatranslatedgenomeof3billionbasesusingaqueryof167aminoacids.matcheswiththeappropriateKsizegivespecificityforagivenlevelofsensitivitysimilartothatgivenby

chancethattheNthmatchiswithinWlettersafterthethreeormoreperfectmatches.Thenear-perfectmatchcrite-(Nמ1)thmatch,whichgivesthemoregeneralrelationshiprionoverallissimilartothetwoperfectmatchcriteria.The

near-perfectcriterioncannotaccommodateinsertionsorde-(13)FN=S*FN−1letions,butithassuperiorperformanceonfindingsmallre-gionsofhomology(Table11).Forfindingcodingexonsinwhichcanbesolvedas

mouse/humanalignments,whicheverstrategyisused,greater

N−1specificityisseenattheaminoacidratherthanthenucleotide(14)FN=F1*S

level.

whereFNrepresentsthenumberofchancematchesofNK-Sincesingle-baseinsertionsordeletionsarerelatively

merseachseparatedbynomorethanWfromthepreviouscommonartifactsofthesequencingprocess,nucleotideBLATmatch.usesthetwoperfect11-mermatchcriteriabydefault.Table12

Tables7and8showthesensitivityandspecificityforNshowsactualalignmenttimesfornucleotideBLATonacol-valuesof2and3andvariousvaluesofotherparameterslectionofESTsatvarioussettings.Forproteinmatches,thewhichapproximatecDNAormouse/humanalignments.defaultcriterionisasingleperfect5forthestand-alonepro-gram.ThisisbecausetheextensionphaseofproteinBLATis

extremelyquickinthestand-aloneprogram,sothefalseposi-tivesgeneratedbythisapproachhaverelativelylittlecost.Theclient/serverproteinBLATusesthreeperfect4-mersbydefaultbecauseintheclient/serverversion,aportionofthegenomemustbeloadedfromdiskforeachfalsepositive,arelativelytime-consumingoperation.Asaresult,theclient/serverpro-teinBLATissomewhatlesssensitivethanthestand-alonever-sion.

Table6.SensitivityandSpecificityofSingleNear-Perfect(OneMismatchAllowed)AminoAcidK-merMatchesasaSearchCriterionSelectingInitialMatchCriteria

ClumpingHitsandIdentifyingHomologousRegions

Toimplementthematchcriteria,BLATbuildsupanindexofnonoverlappingK-mersandtheirpositionsinthedatabase.BLATexcludesK-mersthatoccurtoooftenfromtheindex,aswellasK-merscontainingambiguitycodesandoptionallyK-mersthatareinlowercaseratherthanuppercase.BLATthenlooksupeachoverlappingK-merofthequerysequenceintheindex.Inthisway,BLATbuildsalistof“hits”wherethequeryandthetargetmatch.Eachhitcontainsadatabasepositionandaqueryposition.Thefollowingalgorithmisusedtoef-ficientlyclumptogethermultiplehits.Thehitlistissplitintobucketsof64keach,basedonthedatabaseposition.Eachbucketissortedonthediagonal(databaseminusqueryposi-tions).Hitsthatarewithinthegaplimitarebundledtogetherintoproto-clumps.Hitswithinproto-clumpsarethensortedalongthedatabasecoordinateandputintorealclumpsiftheyarewithinthewindowlimitonthedatabasecoordinate.Toavoidmissedclumpsnearthe64kbucketboundary,un-clumpedhitsandclumpsthatarewithinthewindowlimitare

Figure1Apairofhitsandtwootherhits.Thehitsa,b,c,anddareallKletterslong.HitsdandbhavethesamediagonalcoordinateandarewithinWlettersofeachother.Thereforetheywouldmatchthe“twoperfectK-mer”searchcriteria.

660GenomeResearch

www.genome.org

Downloaded from genome.cshlp.org on November 2, 2010 - Published by Cold Spring Harbor Laboratory Press

BLAT—TheBLAST-LikeAlignmentTool

Table7.SensitivityandSpecificityofMultiple(2and3)PerfectNucleotideK-merMatchesasaSearchCriterion2,82,90.5080.6380.7620.8660.9400.9800.9961.0001.0002,9272,100.3480.4750.6150.7520.8680.9470.9860.9981.0002,101.42,110.2200.3260.4600.6110.7610.8840.9620.9931.0002,110.12,120.1290.2080.3180.4610.6250.7870.9120.9790.9992,120.03,80.3890.5290.6760.8090.9100.9690.9930.9991.0003,80.13,90.2210.3390.4870.6490.8010.9140.9760.9971.0003,90.03,100.1120.1930.3130.4700.6480.8150.9330.9870.9993,100.03,110.0510.0990.1800.3050.4760.6730.8510.9610.9973,110.03,120.0210.0450.0930.1770.3140.5050.7220.9020.9873,120.0A.81%83%85%87%89%91%93%95%97%B.N,KF0.6810.7900.8790.9420.9780.9940.9991.0001.0002,8524(A)ColumnsareforNsizesof2and3andKsizesof8–12.Rowsrepresentvariouspercentageidentitiesbetweenthehomologoussequences.Thetableentriesshowthefractionofhomologiesdetectedascalculatedbyequation10.(B)NandKrepresentthenumberandsizeofthenear-perfectmatches,respectively.Fshowshowmanyperfectclusteredmatchesexpectedtooccurbychanceaccordingtoequation14inatranslatedgenomeof3billionbasesusingaqueryof167aminoacids.tossedintothenextbucketforadditionalclumpingopportu-nities.ThesortingalgorithmmSort,whichisrelatedtoqSort,isused.ThebucketingtendstokeepNrelativelysmall.

Clumpswithlessthantheminimumnumberofhitsarediscarded,andtherestareusedtodefineregionsofthedata-basewhicharehomologoustothequerysequence.Clumpswhicharewithin300basesor100aminoacidsinthedatabasearemergedtogether.Fivehundredadditionalbasesareaddedoneachsidetoformthefinalhomologousregion.

ofsensitivity,thenear-perfectmatchcriterionruns15ןmoreslowlythanthemultiple-perfectmatchcriterioninBLAT(Table13).Thenear-perfectmatchcriterionseemsbestsuitedforprogramsthathashthequerysequenceratherthanthedatabase.Aquerysequenceissufficientlysmallthateachpos-siblenearlymatchingK-mercouldbehashed,andthereforetheindexwouldnothavetobescannedrepeatedly.

AlignmentStage

Thealignmentstageperformsadetailedalignmentbetweenthequerysequenceandthehomologousregions.Forhistori-calreasons,thealignmentstagefornucleotideandproteinalignmentsisquitedifferent.Bothhavelimitations,andaregoodcandidatesforfutureBLATupgrades.Ontheotherhand,botharequiteusefulintheirpresentformforsequenceswhicharenottoodivergent.

SearchingforNearPerfectMatches

BLAThasanoptiontoallowonemismatchinahit.ThisisimplementedbyscanningtheindexrepeatedlyforeachK-merinthequery.EverypossibleK-merthatmatchesinallbutoneposition,aswellastheK-merthatmatchesateverypo-sition,islookedup.Inall,K*(Aמ1)+1lookupsarerequired.Foranamino-acidsearchwithK=8,thisamountsto153lookups.Becauseastraightindexof8-merswouldrequire208indexpositionsorabout100billionbytes,itisnecessarytoswitchtoahashingschemeratherthananindexingscheme,furthercuttingefficiency.Asaconsequence,foragivenlevel

NucleotideAlignments

ThenucleotidealignmentstageisbasedonacDNAalignmentprogramfirstusedintheIntronerator(http://www.cse.

Table8.SensitivityandSpecificityofMultiple(2and3)PerfectAminoAcidK-merMatchesasaSearchCriterion2,32,40.6430.7120.7760.8330.8820.9220.9520.9730.9870.9950.9981.0002,42452,50.2970.3630.4360.5140.5960.6780.7570.8290.8890.9360.9690.9882,50.42,60.1260.1670.2180.2800.3530.4350.5260.6220.7190.8090.8860.9442,60.02,70.0440.0630.0890.1230.1690.2260.2980.3850.4850.5960.7120.8232,70.03,30.9450.9650.9780.9870.9930.9970.9990.9991.0001.0001.0001.0003,37083,40.6430.7120.7760.8330.8820.9220.9520.9730.9870.9950.9981.0003,40.03,50.2970.3630.4360.5140.5960.6780.7570.8290.8890.9360.9690.9883,50.03,60.1260.1670.2180.2800.3530.4350.5260.6220.7190.8090.8860.9443,60.03,70.0440.0630.0890.1230.1690.2260.2980.3850.4850.5960.7120.8233,70.0A.71%73%75%77%79%81%83%85%87%89%91%93%B.N,KF0.9450.9650.9780.9870.9930.9970.9990.9991.0001.0001.0001.0002,3171875(A)ColumnsareforNsizesof2and3andKsizesof3–7.Rowsrepresentvariouspercentageidentitiesbetweenthehomologoussequences.Thetableentriesshowthefractionofhomologiesdetected.(B)NandKrepresentsthenumberandsizeoftheperfectmatches,respectively.Fshowshowmanyperfectclusteredmatchesexpectedtooccurbychanceinatranslatedgenomeof3billionbasesusingaqueryof167aminoacids.GenomeResearch

www.genome.org

661

Downloaded from genome.cshlp.org on November 2, 2010 - Published by Cold Spring Harbor Laboratory Press

Kent

Table9.ESTAlignmentChoicesKF1Perfect14399Near-Perfect220.32Perfect110.13Perfect90.0MaximumKsizesandnumberofchancematchespassedtothealignmentstagewhensearchingfor100basesof95%homologywithatleasta99%chanceofdetectingthehomology.ThesevaluesreflectourtargetsforESTalignments.ucsc.edu/∼kent/intronerator)(KentandZahler2000).Theal-gorithmstartsbygeneratingahitlistbetweenthequeryandthehomologousregionofthedatabase.Becausethehomolo-gousregionismuchsmallerthanthedatabaseasawhole,thealgorithmlooksforrelativelysmall,perfecthits.IfaK-merinthequerymatchesmultipleK-mersintheregionofhomol-ogy,theK-merisextendedbyonerepeatedlyuntilthematchisuniqueortheK-merexceedsacertainsize.Thehitsarethenextendedasfaraspossibleallowingnomismatches,andover-lappinghitsaremerged.Theextendedhitsthatfolloweachotherinbothqueryanddatabasecoordinatesarethenlinkedtogetherintoanalignment.Iftherearegapsinthealignmentonboththequeryanddatabaseside,thealgorithmrecursestofillinthesegaps.Becausethegapsaresmallerthantheorigi-nalqueryanddatabasesequences,asmallerkcanbeusedingeneratingthehitlist.Thiscontinuesuntileithertherecur-sionfindsnoadditionalhits,orthegapisfivebasesorless.Atthispoint,extensionsthroughNs,extensionsthatallowoneortwomismatchesiffollowedbymultiplematches,andfi-nallyextensionsthatallowoneortwoinsertionsordeletions(indels)followedbymultiplematchesarepursued.FormRNAalignments,itisoftenthecasethatthereareseveralequiva-lent-scoringplacementsforalargegapinthequerysequence.Generallysuchgapscorrespondtoanintron.SuchgapsareslidaroundtofindtheirbestmatchtotheGT/AGconsensussequenceforintronends.

ThenucleotidealignmentstrategyworkswellformRNAalignmentsandthetypeofalignmentsneededforgenomicassembly.Inthesecases,thesequenceidentityistypically95%orbetter.Thestrategystartstobreakdownwhenbase

Table10.Mouse/HumanAlignmentChoicesKFFTranslated1Perfect-DNA713,078,9621Perfect-AA562,625187,875Near-perfect-DNA12275,671Near-perfect-AA87492,2472PerfectDNA6237,9832PerfectAA42457343PerfectDNA5109,7073PerfectAA37082,123Assuming86%baseidentityand89%aminoacididentity,thistableshowsthemaximumKsizesandnumberofchancematchespassedtothealignmentstagewhensearchingforregionsof100bases(or33aminoacids)withatleasta99%chanceofdetectingthehomology.Thesevaluesreflectourtargetsforhuman/mousealignments.FortranslatedDNAsequences,theFvalueismulti-pliedbysixtoreflectthreereadingframesonbothstrandsofthequery.Evenwiththismultiplication,thespecificityforagivensensitivityisseveralordersofmagnitudegreaterintheaminoacidratherthanthenucleotidedomain.662Genomewww.genome.org

Research

identityisbelow90%,andisthereforenotsuitableformostcross-speciesalignments.

ProteinAlignments

Theproteinalignmentstrategyissimpler.Thehitsfromthesearchstagearekeptandextendedintomaximallyscoringungappedalignments(HSPs)usingascorefunctionwhereamatchisworth2andamismatchcosts1.AgraphisbuiltwithHSPsasnodes.IfHSPAstartsbeforeHSPBinbothqueryanddatabasecoordinates,anedgeisplacedfromAtoB.TheedgeisweightedbythescoreofBminusagappenaltybasedonthedistancebetweenAandB.InthecasewhereAandBoverlap,a“crossover”pointisselectedwhichmaximizesthesumofthescoresofAuptothecrossoverandBstartingatthecross-over,andthedifferencebetweenthefullscoresandthescoresjustuptothecrossoverissubtractedfromtheedgescore.Adynamicprogramthenextractsthemaximal-scoringalign-mentbytraversingthisgraph.TheHSPsinthemaximal-scoringalignmentareremoved,andifanyHSPsareleftthedynamicprogramisrunagain.

Themajorlimitationofthisproteinalignmentstrategyisthatifthereisanindel,partofthealignmentwillbelostunlessthesearchstagemanagestofindbothsidesofthein-del.Forthetranslatedmouseversustranslatedhumange-nomejob,whichwasthemajormotivationforproteinBLAT,thislimitationisnotasseriousasitwouldbewhensearchingformoredistanthomologs.Indeedinthetranslatedmouse/translatedhumancase,thislimitonindelsisactuallyusefulinsomewaysasitreducestheamountofpseudogeneswhicharefoundbyBLATmorethanitreducestheamountofgenesfound.Evenso,inthefuturewehopetoreplacethissimplisticextensionphasewithabanded(onlysmallgapsallowed)Smith-Watermanalgorithm(Chaoetal.1992).

StitchingandFillingIn

Itisoftenthecasethatthealignmentofageneisscatteredacrossmultiplehomologousregionsfoundinthesearchphase.ThesealignmentsarestitchedtogetherusingaminorvariationofthealgorithmusedtostitchtogetherproteinHSPs.ForDNAalignmentsatthisstage,thegappenaltyisequaltoaconstantplusthelogofthesizeofthegap.FormRNA/genomicalignments,ifafterstitchingtherearegapsleftbetweenaligningblocksinboththedatabaseandquerysequence,thenucleotidealignmentalgorithmiscalledonthegaptoattempttofillitin.ThisgivesBLATachancetofindsmallinternalexonsthatarefurtherawaythan500basesfromotherexons,andwhicharetoosmalltobefoundbythesearchstage.

SincethesorttimeisO(NlogN),thatis,proportionaltoNtimeslogN,whereNisthenumberofhitstobesorted,andthedynamicprogramtimeisO(N2)whereNisthenumberofHSPs,anadditionalstepisnecessarytomakeBLATefficientonlongerquerysequences.Untranslatednucleotidequerieslongerthan5000basesandtranslatedquerieslongerthan1500basesarebrokenintosubqueriesthathaveapproxi-mately250basesofoverlap.Eachsubqueryisalignedasabove,andtheresultingalignmentsarestitchedtogether.Currentlythissubdividingandstitchingisonlyavailableforthestand-aloneBLAT,nottheclient/serverversion.

DISCUSSION

Asshownabove,BLATisaveryeffectivetoolfordoingnucleotidealignmentsbetweenmRNAandgenomicDNAtakenfromthesamespecies.ItismoreaccurateandordersofmagnitudefasterthanSim4.Sim4inturnismoreaccurateandordersofmagnitudefasterthanotherpublishedtoolssuchasest_genome(Mott1997;Floreaetal.1998).AlthoughthealignmentstrategyBLATusesfornucleotidealignmentsbecomeslesseffectivebelow90%sequenceidentity,iteffi-

Downloaded from genome.cshlp.org on November 2, 2010 - Published by Cold Spring Harbor Laboratory Press

BLAT—TheBLAST-LikeAlignmentTool

Table11.HomologySize(InNucleotides)vs.SensitivityforFourSearchCriteriaasAppliedtoMouse/HumanComparisonsAssuming86%NucleotideIdentityand89%AminoAcidIdentity20ABCD0.6250.0000.4830.000300.8050.3940.7330.783400.8810.6870.8620.783500.9270.8510.9290.953600.9620.9320.9630.953700.9770.9320.9650.953800.9860.9700.9810.990900.9930.9870.9900.9901000.9950.9950.9950.9981100.9970.9980.9970.9981200.9990.9990.9991.0001300.9990.9990.9991.000Themediansizeofahumanexonis120bases(InternationalHumanGenomeSequencingConsortium2001).A,perfect5baseaminoacidmatch;B,twoperfect4-baseaminoacidmatcheswithin100aminoacidsandondiagonal;C,near-perfect12nucleotidematch;D,near-perfect8aminoacidmatch.Table12.AlignmentTimesinSecondsof10,000ESTs(AverageSize380Bases)AgainstHumanGenomicSequenceUsingVariousKSizesandNSizesK101011111212N2323232؂1063.93.22.42.33.93.72؂10735.621.48.16.57.06.42؂108680.1348.792.461.839.933.8The2ן106genomicsequenceisctg12414,whichis2,034,363baseslongandwastakenfromtheDecember2000UCSChumangenomeassembly(http://genome.ucsc.edu).The2ן107ge-nomicsequenceisctg15424andis20,341,418baseslong.The2ן108columnischromosome4andis200,175,155baseslong.Thetwomajorcomponentsoftherun-timearethetimeittakestobinandsorttheK-merhits(clumpingisalmostinstantaneousaftersorting),andthetimeittakestoextendtheclumpsintoalignments.Thebin/sorttimedependsonthenumberofhits,whichisproportionalto4מK.Thebin/sorttimeissomewherebetweenO(n)andO(nlogn).Theextendtimeislinearwithrespecttothenumberofclumps.ciently“unsplices”mRNA,andaccommodatesthelevelofsequencedivergenceintroducedbysequencingerror.BLATisabletounspliceallthehumanmRNAinGenBank,includingtheESTs,inlessthanadayona100-CPUcomputercluster.Sincethehuman,mouse,andotherlargegenomeprojectsareupdatingsequencesatarapidrate,andGenBankcontinuestogrowatarapidrate,rapidalignmentisneededtokeepge-nomeannotationsinsynchronywithimprovinggenomeas-semblies.

BLATworkingintranslatedmodeiscapableofrapidlyaligningdataacrossvertebratespecieswithoutsignificantcompromise.WhileTBLASTXcanbeconfiguredtobemoresensitivethanBLAT,atsettingscommonlyusedformammal–mammalcomparisons,BLATrunsapproximately50timesfaster.EvenusingBLAT,analignmentofpublicmousewhole-genomeshotgundatatook12daysonour100-CPUcluster.Itwouldbedifficulttokeepthemouse–humanhomologyin-formationuptodatewithaslowertool.

High-speedalignmentprogramshavetwomajorstages—asearchstagethatusesaheuristictoidentifyregionslikelytobehomologous,andanalignmentstagethatdoesdetailedalignmentsofthepreviouslydefinedhomologousregions.Togetadequatespeedwhenoperatingatthescaleofwholegenomes,thesearchstageiscrucial.Anindexofsomesortiskeytoanefficientsearchstage.BLATindexestheda-tabaseratherthanthequerysequence.Thismorethanany-

Table13.TheRelativeSensitivityandSpecificityofBLATatVariousSettingsWithandWithoutaBest-MatchFilterAllnonhumanmRNAalignmentsK548OnlythebestalignmentforeachK548N121mRNAN12INearperfectnonoyesNearperfectnonoyesCPUdays8.696.0695.25CPUdays8.696.0695.25%genome0.90%0.83%0.86%%Genome0.70%0.66%0.68%%RefSeq69%68%69%%RefSeq67%65%66%Enrichment77x81x80xEnrichment95x99x97xThehumangenome(August2001freezeUCSCassemblyhttp://genome.ucsc.edu)wasalignedagainstacollectionof86,000nonhumanmRNAsequencestotaling123,000,000basestakenfromGenbank.TheCPUdaysweremeasuredon800MhZPentiums.ThepercentagegenomecolumnshowswhatpercentofbasesinthehumangenomearepartofgaplessalignmentswiththemRNAs.ThepercentageRefSeqshowsthepercentageofcodingbasesinhumanRefSeq-definedgenes(fromthedatabaseathttp://genome.ucsc.edu)whicharepartofagaplessalignment.EnrichmentistheratiooftheRefSeqcodingdensityinsideofthealignmentscomparedtocodingdensityinthegenomeasawhole.Higherlevelsofenrichmentindicategreaterspecificity.AlsoshownarethesamestatisticswhenonlythesinglebestplacethatanonhumanmRNAalignswasconsidered.GenomeResearch

www.genome.org

663

Downloaded from genome.cshlp.org on November 2, 2010 - Published by Cold Spring Harbor Laboratory Press

Kent

thingisresponsiblefortherelativelyhighspeedofBLATcom-paredtoSim4orTBLASTX.Ratherthanhavingtolinearlyscanthroughadatabaseofgigabasesofsequencelookingforindexmatches,BLATonlyhastoscanthrougharelativelyshortquerysequence.TheprogramSSAHAindexesthedata-baseinamannerverysimilartothatofBLAT,andisanex-tremelyeffectivetoolforaligninggenomicregionsfromthesameorganismagainsteachother.CurrentlySSAHAdoesnotimplementunsplicinglogic,andalwaysusesasingleperfectmatchasaseed.

Thechallengetoindexingthedatabaseistwofold:thesizeoftheindexandthetimeittakestogeneratetheindex.Fortunately,computerswithseveralgigabytesofRAMhavebecomeaffordableandcommonplaceinthelastfewyears,sothesizeoftheindexisnottheproblemitoncewas.Bymak-ingtheindexrelativelyefficient(onlyfourbytesperindexentryvs.eightinthepublishedversionofSSAHA),andonlyindexingnonoverlappingwords,BLATisabletoindexthehumangenomeatthenucleotidelevelin0.9gigabytesandtoindexaRepeatMaskermasked,translatedhumangenomein2.5gigabytes.Theindexdoestakesometimetogenerate:30minforatranslatedindexofthehumangenome.Fortunatelytheindexisnotgeneratedoften.Inbatchmode,BLATgen-eratestheindexonceatthestartofprocessingabatchoftypicallyhundredsofthousandsofquerysequences.Ininter-activeclient/servermode,thegenomeonlyhastobegener-atedonceforeachgenomeassembly.Thetypicalusersimplypastesinasequencetoawebform,andquicklyreceivesanalignmentinresponse.Howanindexisusedisalsoimportanttothespeedofanalignmentalgorithm.Triggeringanalignmentforeachmatchtotheindexisnotalwaystheoptimalstrategy.Forasensitivesearch,itisdesirabletoindexrelativelysmallK-mers.How-ever,smallK-merswillreturnmanyfalsepositives,potentiallycreatingabottleneckinthealignmentstageifthealignmentstageiscomputationallyexpensive.RequiringmultiplenearbymatchesorusinglongerK-mersbuttoleratingamis-matchassearchcriteriabothhavemuchgreaterspecificityforagivenlevelofsensitivitythanthecriterionofasingleperfectmatch.BLATimplementsaveryquickalgorithmforfindingmultiplenearbyperfectmatches,whichallowsthesearchstagetobespecificenoughthatthegenomeitselfcanbekeptondiskandonlytheindexkeptinRAMinmemoryintheclient/servermode.

TheBLATsoftwareinsourceandexecutableformisavail-ablewithoutchargefornonprofit,academic,andpersonaluses.Nonexclusivecommerciallicensesareavailablefromtheauthor.Thesoftwarecanbedownloadedfromthesourceandexecutablelinksathttp://www.soe.ucsc.edu/∼kent.

ACKNOWLEDGMENTS

MywarmestthankstotheGeneCatsgroup,AlanZahlerattheUniversityofCaliforniaSantaCruz,andtoHeidi,Mira,andTisaathomefortheirencouragement,advice,andentertain-mentduringthiswork.ThankstotheInternationalHumanGenomeProject,theMouseSequencingConsortium,theMammalianGeneCollection,andothergenomeandmRNAsequenceprojectsforgeneratingsomuchsequencedata.ThankstoHHMIandNHGRIforfundingtheUCSCcomputerclusterswherethesoftwarecouldbeappliedatafull-genomescale.ThankstoTomPringle,HeidiBrumbaugh,WebbMiller,

664Genomewww.genome.org

Research

AlanZahler,andDavidHausslerforathoroughreadingofthemanuscriptandhelpfulcomments.

Thepublicationcostsofthisarticleweredefrayedinpartbypaymentofpagecharges.Thisarticlemustthereforebeherebymarked“advertisement”inaccordancewith18USCsection1734solelytoindicatethisfact.

REFERENCES

Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.,andLipman,D.J.

1990.Basiclocalalignmentsearchtool.J.Mol.Biol.215:403–410.

Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,

Miller,W.,andLipman,D.J.1997.GappedBLASTandPSI-BLAST:Anewgenerationofproteindatabasesearchprograms.NucleicAcidsRes.25:3389–3402.

Chao,K.M.,Pearson,W.R.,andMiller,W.1992.Aligningtwo

sequenceswithinaspecifieddiagonalband.Comput.Appl.Biosci.8:481–487.

Dunham,I.,Shimizu,N.,Roe,B.A.,Chissoe,S.,Hunt,A.R.,Collins,

J.E.,Bruskiewich,R.,Beare,D.M.,Clamp,M.,Smink,L.J.,etal.1999.TheDNAsequenceofhumanchromosome22.Nature402:489–495.

Florea,L.,Hartzell,G.,Zhang,Z.,Rubin,G.M.,andMiller,W.1998.

AcomputerprogramforaligningacDNAsequencewithagenomicDNAsequence.GenomeRes.8:967–974.

Gish,W.andStates,D.J.1993.Identificationofproteincoding

regionsbydatabasesimilaritysearch.Nat.Genet.3:266–272.Gotoh,O.1990.Optimalsequencealignmentallowingforlong

gaps.Bull.Math.Biol.52:359–373.

Gotoh,O.2000.Homology-basedgenestructureprediction:

Simplifiedmatchingalgorithmusingatranslatedcodon(tron)andimprovedaccuracybyallowingforlonggaps.Bioinformatics16:190–202.

TheInternationalHumanGenomeSequencingConsortium.2001.

Initialsequencingandanalysisofthehumangenome.Nature409:860–921.

Karplus,K.,Barrett,C.,andHughey,R.1998.HiddenMarkov

modelsfordetectingremoteproteinhomologies.Bioinformatics14:846–856.

Kent,W.J.andZahler,A.M.2000.TheIntronerator:Exploring

intronsandalternativesplicinginC.elegans.NucleicAcidsRes.28:91–93.

Makalowski,W.andBoguski,M.S.1998.Evolutionaryparametersof

thetranscribedmammaliangenome:Ananalysisof2,820

orthologousrodentandhumansequences.Proc.Natl.Acad.Sci.95:9407–9412.

Mott,R.1997.EST_GENOME:AprogramtoalignsplicedDNA

sequencestounsplicedgenomicDNA.Comput.Appl.Biosci.13:477–478.

Ning,Z.,Cox,A.J.,andMullikin,J.C.2001.SSAHA:Afastsearch

methodforlargeDNAdatabases.GenomeRes.11:1725–1729.Pearson,W.R.andLipman,D.J.1988.Improvedtoolsforbiological

sequencecomparison.Proc.Natl.Acad.Sci.85:2444–2448.RoestCrollius,H.,Jaillon,O.,Bernot,A.,Dasilva,C.,Bouneau,L.,

Fischer,C.,Fizames,C.,Wincker,P.,Brottier,P.,Quetier,F.,etal.2000.Estimateofhumangenenumberprovidedbygenome-wideanalysisusingTetraodonnigroviridisDNAsequence.Nat.Genet.25:235–238.

Smith,T.F.andWaterman,M.S.1981.Identificationofcommon

molecularsubsequences.J.Mol.Biol.147:195–197.

States,D.J.andGish,W.1994.Combineduseofsequencesimilarity

andcodonbiasforcodingregionidentification.J.Comput.Biol.1:39–50.

Wiehe,T.,Gebauer-Jung,S.,Mitchell-Olds,T.,andGuigo,R.2001.

SGP-1:Predictionandvalidationofhomologousgenesbasedonsequencealignments.GenomeRes.11:1574–1583.

Zhang,Z.,Schwartz,S.,Wagner,L.,andMiller,W.2000.Agreedy

algorithmforaligningDNAsequences.J.Comput.Biol.7:203–214.

ReceivedDecember19,2001;acceptedinrevisedformJanuary25,2002.

因篇幅问题不能全部显示,请点此查看更多更全内容

Top