TheNASAAstrophysicsDataSystem:DataHoldings
C.Grant,A.Accomazzi,G.Eichhorn,M.J.Kurtz,andS.S.Murray
Harvard-SmithsonianCenterforAstrophysics,Cambridge,MA02138Received/Accepted
000Abstract.Sinceitsinceptionin1993,theADSAbstract2Servicehasbecomeanindispensableresearchtoolforas- ebtronomersandastrophysicistsworldwide.Inthosesevenyears,muchefforthasbeendirectedtowardimprovingFboththequantityandthequalityofreferencesinthe 4database.Fromtheoriginaldatabaseofapproximately 160,000astronomyabstracts,ourdatasethasgrownal-1mosttenfoldtoapproximately1.5millionreferencescov-v3eringastronomy,astrophysics,planetarysciences,physics,0optics,andengineering.Wecollectandstandardizedata1fromapproximately200journalsandpresenttheresulting2informationinauniform,coherentmanner.Withtheco-0operationofjournalpublishersworldwide,wehavebeen0abletoplacescansoffulljournalarticleson-linebackto0/thefirstvolumesofmanyastronomicaljournals,andwehareabletolinktocurrentversionofarticles,abstracts,andp-datasetsforessentiallyallofthecurrentastronomyliter-oature.Thetrendtowardelectronicpublishinginthefield,rttheuseofelectronicsubmissionofabstractsforjournalsarticlesandconferenceproceedings,andtheincreasinglya:prominentuseoftheWorldWideWebtodisseminatein-vformationhaveenabledtheADStobuildadatabaseun-iXparalleledinotherdisciplines.rTheADScanbeaccessedatahttp://adswww.harvard.edu
Keywords:methods:dataanalysis–astronomicalbib-liography–astronomicalsociology
Sendoffprintrequeststo:C.GrantCorrespondenceto:C.Grant
duetoacombinationoffactors:growthinprofessionalsocietymembership,anincreaseinpapersbymultipleauthors,thelaunchingofnewspacecrafts,andincreasedcompetitionforjobsandPIsinthefield(sincecandidateevaluationispartiallybasedonpublicationhistory).Asthenumberofpapersinthefieldgrows,sodoestheneedfortoolswhichastronomerscanusetolocatethatfractionofpaperswhichpertaintotheirspecificinterests.TheADSAbstractServiceisoneofseveralbiblio-graphicserviceswhichprovidethisfunctionforastron-omy,butduetothebroadscopeofourcoverageandthesimplicityofaccesstoourdata,astronomersnowrelyex-tensivelyontheADS,andotherbibliographicservicesnotonlylinktous,butsomehavebuilttheirbibliographicsearchcapabilitiesontopoftheADSsystem.TheInter-nationalSocietyforOpticalEngineering(SPIE)andtheNASATechnicalReportService(NTRS)aretwosuchser-vices.
TheevolutionoftheAstrophysicsDataSystem(ADS)hasbeenlargelydata-driven.Oursearchtoolsandindex-ingroutineshavebeenmodifiedtomaximizespeedandefficiencybasedonthecontentofourdataset.Asnewtypesofdata(suchaselectronicversionsofarticles)be-cameavailable,theAbstractServicequicklyincorporatedthatnewfeature.TheorganizationandstandardizationofthedatabasecontentistheverycoreuponwhichtheAbstractServicehasbeenbuilt.
ThispapercontainsadescriptionoftheADSAbstractServicefroma“data”pointofview,specificallydescrip-tionsofourholdingsandoftheprocessesbywhichweingestnewdataintothesystem.Detailsareprovidedontheorganizationofthedatabases(section2),thedescrip-tionofthedatainthedatabases(section3),thecreationofbibliographicrecords(section4),theproceduresforup-datingthedatabase(section5),andonthescannedarti-clesintheAstronomydatabase(section6).WediscusstheinteractionbetweentheADSandthejournalpublishers(section7)andanalyzesomeofthenumberscorrespond-ingtothedatasets(section8).InconjunctionwiththreeotherADSpapersinthisvolume,thispaperisintendedto
2C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
offerdetailsontheentireAbstractServicewiththehopesthatastronomerswillhaveabetterunderstandingofthereferencedatauponwhichtheyrelyfortheirresearch.Inaddition,wehopethatresearchersinotherdisciplinesmaybeabletobenefitfromsomeofthedetailsdescribedherein.
AsisoftenthecasewithdescriptionsofactiveInternetresources,whatfollowsisadescriptionofthepresentsit-uationwiththeADSAbstractService.Newfeaturesarealwaysbeingadded,someofwhichnecessitatechangesinourcurrentprocedures.Furthermore,withthegrowthofelectronicpublishing,someofourcoreideasaboutbibli-ographictoolsandrequirementsmustbereconsideredinordertobeabletotakefulladvantageofnewpublishingtechnologiesforanewmillennium.
2.TheDatabases
TheADSAbstractServicewasoriginallyconceivedofinthemid1980’sasawaytoprovideon-lineaccesstobibliographiesofastronomerswhichwerepreviouslyavailableonlythroughexpensivelibrariansearchservicesorthroughtheA&AAbstractsseries(Schmadel1979,Schmadel1982,Schmadel19),publishedbytheAs-tronomischesRechen-InstitutinHeidelberg.WhiletheideasbehindtheAbstractServicesearchenginewerebeingdeveloped(seeKurtzetal.2000,hereafterOVERVIEW),concurrenteffortswereunderwaytoacquireareliabledatasourceonwhichtobuildtheserver.Inordertobestde-velopthelogisticsofthesearchengineitwasnecessarytohaveaccesstorealliteraturedatafromthepastandpresent,andtosetupamechanismforacquiringdatainthefuture.
Anelectronicpublishingmeetinginthespringof1991broughttogetheranumberoforganizationswhoseulti-matecooperationwouldbenecessarytomakethesystemareality(seeOVERVIEWfordetails).NASA’sScientificandTechnicalInformationProgram(STI)offeredtopro-videabstractstotheADS.STI’sabstractswerearewrit-tenversionoftheoriginalabstracts,categorizedandkey-wordedbyprofessionaleditors.Theynotonlyabstractedtheastronomicalliterature,butmanyotherscientificdis-ciplinesaswell.WithSTIagreeabletoprovidingthepastandpresentliterature,andthejournalscommittedtopro-vidingthefutureliterature,thedatabehindthesystemfellintoplace.TheterminationofthejournalabstractingbytheSTIprojectseveralyearslaterwasunfortunate,butdidnotcausethecollapseoftheADSAbstractSer-vicebecauseofthecommitmentofthejournalpublisherstodistributetheirinformationfreely.
TheSTIabstractingapproximatelycoveredtheperiodfrom1975to1995.WiththeSTIdataalone,weestimatedthecompletenessoftheAstronomydatabasetobebetterthan90%forthecoreastronomicaljournals.Fortunately,withtheadditionaldatasuppliedbythejournals,bySIM-
BAD(SetofIdentifications,Measurements,andBibli-ographiesforAstronomicalData,Egret&Wenger1988)attheCDS(CentredeDonn´eesAstronomiquesdeStras-bourg),andbyperformingOpticalCharacterRecognition(OCR)onthescannedtableofcontents(seesection6below),wearenowcloserto99%completeforthatpe-riod.Intheperiodsincethenweare100%completeforthosejournalswhichprovideuswithdata,andsignifi-cantlylesscompleteforthosewhichdonot(e.g.manyob-servatorypublicationsandnon-U.S.journals).Thedatapriorto1975arealsosignificantlyincomplete,althoughwearecurrentlyworkingtoimprovethecompletenessoftheearlydata,primarilythroughscanningthetableofcontentsforjournalvolumesastheyareplacedon-line.Weare100%completeforanyjournalvolumewhichwehavescannedandputon-line,sinceweverifythatwehaveallbibliographicentriesduringtheprocedureofputtingscanson-line.
SincetheSTIdataweredividedintocategories,itwaseasytocreateadditionaldatabaseswithnon-astronomicaldatawhichwerestillofinteresttoastronomers.Thecre-ationofanInstrumentationdatabasehasenabledustoprovideadatabaseforliteraturerelatedtoastronomicalinstrumentation,ofparticularinteresttothosescientistsbuildingastronomicaltelescopesandsatelliteinstruments.WewerefortunatetogetthecooperationoftheSPIEveryquicklyafterreleasingtheInstrumentationdatabase.SPIEhasbecomeourmajorsourceofabstractsfortheIn-strumentationdatabasenowthatSTInolongersuppliesuswithdata.
OurPhysicsandGeophysicsdatabase,thethirddatabasetogoon-line,isintendedforscientistsworkinginphysics-relatedfields.WeaddauthorsandtitlesfromallofthephysicsjournalsoftheAmericanInstituteofPhysics(AIP),theInstituteofPhysics(IOP),andtheAmericanPhysicalSociety(APS),aswellasmanyphysicsjournalsfrompublisherssuchasElsevierandAcademicPress(AP(AP)).
Thefourthdatabaseinthesystem,thePreprintdatabase,containsasubsetoftheLosAlamosNationalLaboratory’s(LANL)PreprintArchive(LosAlamosNationalLaboratory1991).OurdatabaseincludestheLANLastro-phpreprintswhicharere-trievedfromLANLandindexednightlythroughanautomatedprocedure.Thatdatasetincludespreprintsfromastronomicaljournalssubmitteddirectlybyauthors.
3.DescriptionoftheData
TheoriginalsetofdatafromSTIcontainedseveralbasicfieldsofdata(author,title,keywords,andabstracts)tobeindexedandmadeavailableforsearching.AllrecordswerekeyedonSTI’saccessionnumber,anine-digitcodeconsistingofaletterprefix(AorN)followedbyatwo-digitpublicationyear,followedbyafive-letteridentifier(e.g.
C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings3
A95-12345).Datawerestoredinfilesnamedbyaccessionnumber.
Withtheinclusionofdatafromothersources,pri-marilythejournalpublishersandSIMBAD,weextendedSTI’sconceptoftheaccessionnumbertohandleotherabstractsaswell.SincetheADSmayreceivethesameabstractfrommultiplesources,weoriginallyadoptedasystemofusingadifferentprefixletterwiththeremain-deroftheaccessionnumberbeingthesametodescribeabstractsreceivedfromdifferentsources.Thus,thesameabstractfortheaboveaccessionnumberfromSTIwouldbelistedasJ95-12345fromthejournalpublisherandS95-12345fromSIMBAD.Thisallowedtheindexingroutinestoconsideronlyoneinstanceoftherecordwhenindexing.Recently,limitationsintheformatofaccessionnumbersandthedesiretoindexdatafrommultiplesources(ratherthanjustSTI’sversion)havepromptedustomovetoadatastoragesystembasedentirelyonthebibliographiccode.
3.1.BibliographicCodes
TheconceptofauniquebibliographiccodeusedtoidentifyanarticlewasoriginallyconceivedofbySIMBADandNED(NASA’sExtragalacticDatabase,Helou&Madore1988).Theoriginalspecificationisde-tailedinSchmitzetal.1995.Intheyearssince,theADShasadoptedandexpandedtheirdefinitiontobeabletodescribereferencesoutsideofthescopeofthoseprojects.Thebibliographiccodeisa19-characterstringcom-prisedofseveralfieldswhichusuallyenablesausertoidentifythefullreferencefromthatstring.Itisdefinedasfollows:
YYYYJJJJJVVVVMPPPPA
wherethefieldsaredefinedinTable1.
Thejournalfieldisleft-justifiedandthevolumeandpagefieldsareright-justified.Blankspacesandleadingze-roesarereplacedbyperiods.Forarticleswithpagenum-bersgreaterthan9999,theMfieldcontainsthefirstdigitofthepagenumber.TheAfieldcontainsacolon(“:”)ifthereisnoauthorlisted.
Creatingbibliographiccodesfortheastronomicaljour-nalsisuncontroversial.Eachjournaltypicallyhasacommonly-usedabbreviation,andthevolumeandpageareeasilyassigned(e.g.1999PASP..111..438F).Eachvol-umetendstohaveindividualpagenumbering,andinthosecaseswheremorethanonearticleappearsonapage(suchaserrata),a“Q”,“R”,“S”,etc.isusedasthequal-ifierforpublicationtomakebibliographiccodesunique.Whenpagenumberingisnotcontinuousacrossissuenum-bers(suchasSky&Telescope),theissuenumberisrepre-sentedbyalowercaseletterasthequalifierforpublication(e.g.“a”forissue1).Thisisbecausetheremaybemultiplearticlesinavolumestartingonthesamepagenumber.Creatingbibliographiccodesforthe“grey”literaturesuchasconferenceproceedingsandtechnicalreportsisa
moredifficulttask.TheexpansionintotheseadditionaltypesofdataincludedintheADSrequiredustomod-ifytheoriginalprototypebibliographiccodedefinitioninordertopresentidentifierswhichareeasilyrecognizabletotheuser.Theprototypedefinitionofthebibliographiccodesuggestedusingasingleletterinthesecondplaceofthevolumefieldtoidentifynon-standardreferences(cat-alogs,PhDtheses,reports,preprints,etc.)andusingthethirdandfourthplaceofthatfieldtounduplicateandre-portvolumenumbers(e.g.1981CRJS..R.3...14W).Sincewefeltthatthiscreatedcodesunidentifiabletothetyp-icaluserandsinceNEDandSIMBADdidnotfeelthatusersneededtobeabletoidentifybooksdirectlyfromtheirbibliographiccodes,theADSadopteddifferentrulesforcreatingcodestoidentifythegreyliterature.Itisstraightforwardtocreatebibliographiccodesforconferenceproceedingswhicharepartofaseries.Forex-ample,theIAUSymposiaSeries(IAUS)containsvolumenumbersandthereforefitsthejournalmodelforbiblio-graphiccodes.Otherconferenceproceedings,books,col-loquia,andreportsintheADStypicallycontainafourletterwordinthevolumefieldsuchas“conf”,“proc”,“book”,“coll”,or“rept”.Whenthisisthecasewithabibliographiccode,thejournalfieldtypicallyconsistsofthefirstletterfromimportantwordsinthetitle.Thiscangivetheusertheabilitytoidentifyaconferenceproceed-ingataglance(e.g.“ioda.book”for“InformationandOn-LineDatainAstronomy”).Wewilloftenleavethefifthplaceofthejournalfieldasadotfor“readability”(e.g.1995ioda.book..175M).Formostproceedingswhicharealsopublishedaspartofaseries(e.g.ASPConfer-enceSeries,IAUColloquia,AIPConferenceSeries),weincludeinthesystemtwobibliographiccodes,oneasde-scribedaboveandonewhichcontainstheseriesnameandthevolume(seesection5.1).Wedothissothatuserscansee,forexample,thatapaperpublishedinoneofthe“AstronomicalDataAnalysisSoftwareandSystems”se-riesisclearlylabelledas“adass”whereasatypicalusermightnotrememberwhichvolumeofASPCcontainedthoseADASSpapers.Thisincreasestheuser’sreadabilityofbibliographiccodes.
WiththeSTIdata,thedetailswereoftenunclearastowhetheranarticlewasfromaconferenceproceeding,ameeting,acolloquium,etc.Weassignedthosecodesasbestwecould,makingnosignificantdistinctionbetweenthem.Forconferenceabstractssubmittedbytheeditorsofaproceedingspriortopublication,weoftendonothavepagenumbers.Inthiscase,weuseacounterinlieuofapagenumberandusean“E”(for“Electronic”)inthefour-teenthcolumn,thequalifierforpublication.Ifthesecon-ferenceabstractsarethenpublished,theirbibliographiccodesarereplacedbyabibliographiccodecompletewithpagenumber.Iftheconferenceabstractsarepublishedonlyon-line,theyretaintheirelectronicbibliographiccodewithitsEandcounternumber.
4C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
Table1.BibliographicCodeDefinition(e.g.1996A&AS..115....1S)
YYYYJJJJJVVVVMPublicationYear
JournalAbbreviationVolumeNumber
QualifierforPublication
PPPPAPageNumber
FirstLetteroftheFirstAuthor’sSurname
1997
ApJ,A&A,MNRAS,etc.480
L(forLetter),P(forPinkPage)Q,R,S,etc.forunduplicatinga,b,c,etc.forissuenumber129N
C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings5
allkeywordsinthesystemandequatesthosefromdiffer-entkeywordsystemswhicharesimilar(Leeetal.1999).Thecurrentnon-searchablefieldsintheADSdatabasesincludethejournalfield,authoraffiliation,cat-egory,abstractcopyright,andabstractorigin.Althoughwemaydecidetocreateanindexandsearchinterfaceforsomeoftheseentities(suchascategory),otherswillcontinuetoremainunsearchablesincesearchingthemisnotusefultothetypicaluser.Inparticular,authoraffilia-tionswouldbeusefultosearch,howeverthisinformationisinconsistentlyformattedsoitisvirtuallyimpossibletocollectallvariationsofagiveninstitutionforindexingco-herently.Furthermore,wehavetheauthoraffiliationsforonlyabouthalfoftheentriesintheAstronomydatabasesowehavedecidedtokeepthisfieldnon-searchable.Forresearcherswishingtoanalyzeaffiliationsonalargescale,wecanprovidethisinformationonacollaborativebasis.3.3.DataSources
TheADScurrentlyreceivesabstractsortableofcontents(ToC)referencesfromalmosttwohundredjournalsources.Tables2,3,and4listthesejournals,alongwiththeirbibli-ographiccodeabbreviation,source,frequencywithwhichwereceivethedata,whatdataarereceived,andanylinkswecancreatetothedata.ToCreferencestypicallycon-tainonlyauthorandtitle,althoughsometimeskeywordsareincludedaswell.Thedataarecontributedviaemail,ftp,orretrievedfromwebsitesaroundtheworldatafre-quencyrangingfromonceaweektoapproximatelyonceayear.Theterm“often”usedinthefrequencycolumnimpliesthatwegetthemmorefrequentlythanonceamonth,butnotnecessarilyonaregularbasis.Theterm“occasionally”isusedforthosejournalswhosubmitdatatousinfrequently.
UpdatestotheAstronomyandInstrumentationdatabasesoccurapproximatelyeverytwoweeks,ormoreofteniflogisticallypossible,inordertokeepthedatabasecurrent.Recentenhancementstotheindexingsoftwarehaveenabledustoperforminstantaneousupdates,trig-geredbyanemailcontainingnewdata(seeARCHITEC-TURE).UpdatestothePhysicsdatabaseoccursapprox-imatelyonceeverytwomonths.Asstatedearlier,thePreprintdatabaseisupdatednightly.3.4.DataFormats
TheADSisabletobenefitfromcertainstandardswhichareadheredtointhewritingandsubmissionpracticesofastronomicalliterature.Thejournalssharecommonabbreviationsandtextformattingroutineswhichareusedbytheastronomersaswell.TheuseofTeX(Knuth1984)andLaTeX(Lamport1986),andtheirextensiontoBibTeX(Lamport1986)andAASTeX
(AmericanAstronomicalSociety1999)resultsincommonformatsamongsomeofourdatasources.Thisenablesthereuseofparsingroutinestoconverttheseformatstoourstandardformat.OthervariationsofTeXusedbyjournalpublishersalsoallowsustousecommonparsingroutineswhichgreatlyfacilitatesdataloading.
TeXisapublicdomaintypesettingprogramdesignedespeciallyformathandscience.Itisamarkupsystem,whichmeansthatformattingcommandsareinterspersedwiththetextintheTeXinputfile.Inadditiontocom-mandsforformattingordinarytext,TeXincludesmanyspecialsymbolsandcommandswithwhichyoucanfor-matmathematicalformulaewithbotheaseandprecision.Becauseofitsextraordinarycapabilities,TeXhasbecometheleadingtypesettingsystemforscience,mathematics,andengineering.ItwasdevelopedbyDonaldKnuthatStanfordUniversity.
LaTeXisasimplifieddocumentpreparationsystembuiltonTeX.BecauseLaTeXisavailableforjustaboutanytypeofcomputerandbecauseLaTeXfilesareASCII,scientistsareabletosendtheirpaperselectronicallytocolleaguesaroundtheworldintheformofLaTeXin-put.ThisisalsotrueforothervariantsofTeX,althoughtheastronomicalpublishingcommunityhaslargelycen-teredtheirpublishingstandardsonLaTeXoroneofthesoftwarepackagesbasedonLaTeX,suchasBibTeXorAASTeX.BibTeXisaprogramandfileformatdesignedbyOrenPatashnikandLeslieLamportin1985fortheLaTeXdocumentpreparationsystem,andAASTeXisaLaTeX-basedpackagethatcanbeusedtomarkupmanuscriptsspecificallyforAmericanAstronomicalSo-ciety(AAS)journals.
SimilartothewidespreadacceptanceofTeXanditsvariants,theextensiveuseofSGML(StandardGener-alizedMarkupLanguage,Goldfarb&Rubinsky1991)bythemembersofthepublishingcommunityhasgivenustheabilitytostandardizemanyofourparsingroutines.AlldatagleanedofftheWorldWideWebsharefea-turesduetotheuseofHTML(HyperTextMarkupLan-guage,Powell&Whitworth1998),anexampleofSGML.Furthermore,thetrendtowardsusingXML(ExtensibleMarkupLanguage,Harold1999)todescribetextdoc-umentswillenableustosharestandarddocumentat-tributeswithothermembersoftheastronomicalcommu-nity.XMLisasubsetofSGMLwhichisintendedtoen-ablegenericSGMLtobeserved,received,andprocessedontheWebinthewaythatisnowpossiblewithHTML.TheADSparsingroutinesbenefitfromthesestandardsinseveralways:wecanreuseroutinesdesignedaroundthesesystems;weareabletopreserveoriginaltextrepre-sentationsofentitiessuchasembeddedaccentssotheseentitiesaredisplayedcorrectlyintheuser’sbrowser;andweareabletocapturevalue-addedfeaturessuchaselec-tronicURLsandemailaddressesforuseelsewhereinoursystem.
6C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
Table2.TheADSAstronomyDatabase
SeeaccompanyingtextfileADS
abcdefghij
LettercodesdescribingwhatdataareavailableAstronomischeGesellschaftUniversityofChicagoPressAmericanInstituteofPhysicsOverseasPublishersAssociationAmericanGeophysicalUnion
CentralBureauforAstronomicalTelegramsAcademicPress
UniversitadNacionalAutonomadeMexicoAstronomischesRechen-Institut
Table3.TheADSInstrumentationDatabase
SeeaccompanyingtextfileADS
abcd
LettercodesdescribingwhatdataareavailableOpticalSocietyofAmerica
TheInternationalSocietyforOpticalEngineering(SPIE)InstituteofPhysics
Table4.TheADSPhysicsDatabase
SeeaccompanyingtextfileADS
a
Lettercodesdescribingwhatdataareavailable
InordertofacilitatedataexchangebetweendifferentpartsoftheADS,wemakeuseofataggedformatsimilartothe“Refer”format(Jacobsen1996).Referisaprepro-cessorforthewordprocessorsnroffandtroffwhichfindsandformatsreferences.Whileourtaggedformatssharesomecommonfields(%A,%T,%J,%D),theReferfor-matisnotspecificenoughtobeusedforourpurposes.Itemssuchasobjects,URLsandcopyrightnoticesarebe-yondthescopeoftheRefersyntax.DetailsonourtaggedformatareprovidedinTable5.Readingandwritingrou-tinesforthisformataresharedbyloadingandindexingroutines,andanumberofourdatasourcessubmitab-stractstousinthisformat.
stage.Fromtheimplementationpointofview,thismeansthatbibliographicrecordsprovidedtotheADSbypub-lishersorotherdatasourcestypicallyaresavedasfileswhicharetaggedwiththeirorigin,entrydate,andanyotherancillaryinformationrelevanttotheircontents(e.g.ifthefieldsintherecordcontaindatawhichwastranslit-eratedorconvertedtoASCII).
Forinstance,therecordsprovidedtotheADSbytheUniversityofChicagoPress(thepublisherofseveralmajorU.S.astronomicaljournals)areSGMLdocumentswhichcontainauniquemanuscriptidentifierassignedtothepa-perduringtheelectronicpublishingprocess.Thisidenti-fierissavedinthefilecreatedbytheADSsystemforthisbibliographicentry.
BecausedataaboutaparticularbibliographicentrymaybeprovidedtotheADSbydifferentsourcesandatdifferenttimes,weadoptedamulti-stepprocedureinthecreationandmanagementofbibliographicrecords:1)Tokenization:Parsinginputdataintoamemory-residentdatastructureusingprocedureswhichareformat-andsource-specific.
4.CreatingtheBibliographicRecords
Oneofthebasicprinciplesintheparsingandformat-tingofthebibliographicdataincorporatedintotheADSdatabaseovertheyearshasbeentopreserveasmuchoftheoriginalinformationaspossibleanddelayanysyn-tacticorsemanticinterpretationofthedatauntilalater
C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
Table5.TaggedFormatDefinitions
7
%R%T%A%D%B%C%E%F%G%H%J%K%L%O%Q%U%V%W%X%Y%ZBibliographicCodeTitle
AuthorList
PublicationDateAbstractText
AbstractCopyright
URLforElectronicDataTableAuthorAffiliationOriginEmail
JournalName,Volume,andPageRangeKeywords
LastPageofArticleObjectNameCategory
URLforElectronicDocumentLanguage
Database(AST,PHY,INST)CommentIdentifiersReferences
requiredrequiredrequiredrequired
8C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
Theassessmentofwhatconstitutesamultiplewordlastnameasopposedtoamiddlenameisnon-trivialsincesomenames,suchasDavis,canbeafirstname(DavisHartman),amiddlename(A.G.DavisPhilip),alastname(RobertDavis),orsomecombination(DavisS.Davis).Anotherexampleishowtodeterminewhenthename“Van”isafirstname(VanNguyen),amiddlename(W.VanDykeDixon),orpartofalastname(J.vanAllen).Handlingallofthesecasescor-rectlyrequiresnotonlyfamiliaritywithnamingcon-ventionsworldwide,butanintimatefamiliaritywiththenamesofastronomerswhopublishinthefield.Wearecontinuallyamassingthelatterasweincorporateincreasingamountsofdataintothesystem,andaswegetfeedbackfromourusers.
–Spellchecking:Sincemanyofthehistoricalrecordsen-teredintheADShavebeengeneratedbytypesettingtablesofcontents,typographicalerrorscanoftenbeflaggedinanautomatedwayusingspell-checkingsoft-ware.WehavedevelopedaPERLsoftwaredriverfortheinternationalispellprogram,aUNIXutility,whichcanbeusedasaspell-checkingfilteronallinputtobeconsideredtextualinformation.Acustomdictionarycontainingtermsspecifictoastronomyandspacesci-encesisusedtoincreasetherecognitioncapabilitiesofthesoftwaremodule.Anycorrectionssuggestedbythespell-checkermodulearereviewedbyahumanbeforethedataareactuallyupdated.
–Languagerecognition:Extendingthecapabilityofthespell-checker,wehaveimplementedasoftwaremodulewhichattemptstoguessthelanguageofaninputtextbufferbasedonthepercentageofwordsthatitcanrecognizeinoneofseverallanguages:English,Ger-man,French,Spanish,orItalian.Thismoduleisusedtoflagrecordstobeenteredinourdatabaseinalan-guageotherthanEnglish.Knowledgeofthelanguageofanabstractallowsustocreateaccuratesynonymsforthosewords(seeARCHITECTURE).
4.2.Identification
Wecallidentificationtheactivityofmappingthetokensextractedfromtheparsingofabibliographicrecordintoauniqueidentifier.TheADSadoptedtheuseofbibli-ographiccodesastheidentifierforbibliographicentriesshortlyafteritsinception,inordertofacilitatecommuni-cationbetweentheADSandSIMBAD.Theadvantageofusingbibliographiccodesasuniqueidentifiersisthattheycanmostoftenbecreatedinastraightforwardwayfromtheinformationgiveninthelistofreferencespublishedintheastronomicalliterature,namelythepublicationyear,journalname,volume,andpagenumbers,andfirstau-thor’sname(seesection3.1fordetails).
4.3.Instantiation
“Instantiation”ofabibliographicentryconsistsofthecreationofarecordforitintheADSdatabase.TheADSmusthandlereceiptofthesamedatafrommulti-plesources.Wehavecreatedahierarchyofdatasourcessothatwealwaysknowthepreferreddatasource.Aref-erenceforwhichwehavereceivedrecordsfromSTI,thejournalpublisher,SIMBAD,andNED,forexample,mustbeinthesystemonlyoncewiththebestinformationfromeachsourcepreserved.Whenweloadareferenceintothesystem,wecheckwhetheratextfilealreadyexistsforthatreference.Ifthereisnotextfile,itisanewreferenceandatextfileiscreated.Iftherealreadyisatextfile,weappendthenewinformationtothecurrenttextfile,cre-atinga“merged”textfile.Thismergedtextfilelistseveryinstanceofeveryfieldthatwehavereceived.4.4.Extraction
By“extraction”ofabibliographicentrywemeanthepro-cedureusedtocreateauniquerepresentationofthebibli-ographyfromtheavailablerecords.Thisisessentiallyanactivityofdatafusionandunification,whichremovesre-dundanciesinthebibliographicrecordsobtainedbytheADSandproperlylabelsfieldsbytheircharacteristics.Theextractionalgorithmhasbeendesignedwithourpriorexperienceastothequalityofthedatatoselectthebestfieldsfromeachdatasource,tocross-correlatethefieldsasnecessary,andtocreatea“canonical”textfilewhichcontainsauniqueinstanceofeachfield.Sincethelatteriscreatedthroughsoftware,onlyoneversionofthetextfilemustbemaintained;whenthemergedtextfileisap-pended,thecanonicaltextfileisautomaticallyrecreated.
Theextractionroutineselectsthebestpiecesofin-formationfromeachsourceandcombinesthemintoonereferencewhichismorecompletethantheindividualref-erences.Forexample,authorlistsreceivedfromSTIwereoftentruncatedafterfiveortenauthors.Wheneverwehavealongerauthorlistfromanothersource,thatau-thorlistisusedinstead.Thisnotonlyrecapturesmissingauthors,italsoprovidesfullauthornamesinsteadofau-thorinitialswheneverpossible.Inaddition,ourjournalsourcessometimesomitthelastpagenumberoftherefer-ence,butSIMBADusuallyincludesit,soweareabletopreservethisinformationinourcanonicaltextfile.
Somefieldsneedtobelabelledbytheircharacteris-ticssothattheyareproperlyindexedanddisplayed.Thekeywords,forexample,needtobeattributedtoaspe-cifickeywordsystem.Thesystemdesignationallowsformultiplekeywordsetstobedisplayed(e.g.NASA/STIKeywordsandAASKeywords)andwillbeusedinthekeywordsynonymtablecurrentlyunderdevelopment(Leeetal.1999).
Wealsoattempttocross-correlateauthorswiththeiraffiliationswhereverpossible.Thisisnecessaryforrecords
C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings9
wherethepreferredauthorfieldisfromonesourceandtheaffiliationsarefromanothersource.WeattempttoassigntheproperaffiliationbasedonthelastnameanddonotassumethattheauthororderisaccuratesinceweareawareoforderingdiscrepanciesinsomeoftheSTIrecords.
Throughthesefourstepsintheprocedureofcreatingandmanagingbibliographicrecords,weareabletotakeadvantageofreceivingthesamereferencefrommultiplesources.Westandardizethevariousrecordsandpresenttotheuseracombinationofthemostreliablefieldsfromeachdatasourceinonesuccincttextfile.5.UpdatingtheDatabase
ThesoftwaretoupdatebibliographicrecordsinthedatabaseconsistsofaseriesofPERLscripts,typicallyoneperdatasource,whichreadsinthedata,performsanyspecialprocessingparticulartothatdatasource,andwritesoutthedatatotextfiles.Theloadingroutinesper-formthreefundamentaltasks:1)theyaddnewbiblio-graphiccodestothecurrentmasterlistofbibliographiccodesinthesystem;2)theycreateandorganizethetextfilescontainingthereferencedata;and3)theymaintainthelistsofbibliographiccodesusedtoindicatewhatitemsareavailableforagivenreference.5.1.TheMasterList
Themasterlistisatablecontainingbibliographiccodestogetherwiththeirpublicationdates(YYYYMM)anden-trydatesintothesystem(YYYYMMDD).Thereisonemasterlistperdatabasewithonelineperreference.Themostimportantaspectofthemasterlististhatitre-tainsinformationabout“alternative”bibliographiccodesandmatchesthemtotheircorrespondingpreferredbiblio-graphiccode.Analternativebibliographiccodeisusuallyareferencewhichwereceivefromanothersource(primar-ilySIMBADorNED)whichhasbeenassignedadifferentbibliographiccodefromtheoneusedbytheADS.Some-timesthisisduetothedifferentrulesusedtobuildbibli-ographiccodesfornon-standardpublications(seesection3.1),butoftenitisjustanincorrectyear,volume,page,orauthorinitialinoneofthedatabases(SIMBADorNEDortheADS).Ineithercase,theADSmustkeeptheal-ternativebibliographiccodeinthesystemsothatitcanbefoundwhenreferencedbytheothersource(e.g.whenSIMBADsendsbackalistoftheircodesrelatedtoanobject).TheADSmatchesthealternativebibliographiccodetoourcorrespondingoneandreplacesanyinstancesofthealternativecodewhenreferencedbytheotherdatasource.Alternativebibliographiccodesinthemasterlistareprependedwithanidentificationletter(SforSIM-BAD,NforNED,JforJournal)sothattheiroriginisretained.
Whilewemakeeveryefforttopropagatecorrectionsbacktoourdatasources,sometimesthereissimplyavaliddiscrepancy.Forexample,alternativebibliographiccodesareoftendifferentfromtheADSbibliographiccodeduetoambiguousdifferencessuchaswhichnameisthesurnameofaChineseauthor.SinceAmericanstendtoinvertChinesenamesoneway(Zheng,Wei)andEuro-peansanother(Wei,Zheng),thisresultsintwodifferent,butequallyvalidcodes.Similarly,discrepanciesinjournalnamessuchasBAAS(forthepublishedabstractsintheBulletinoftheAmericanAstronomicalSociety)andAAS(fortheequivalentabstractwithmeetingandsessionnum-ber,butnovolumeorpagenumber)needdifferentcodestorefertothesamepaper.RussianandChinesetransla-tionjournals(AstronomicheskiiZhurnalvs.SovietAstron-omyandActaAstronomicaSinicavs.ChineseAstronomyandAstrophysics)sharethesameproblem.Thesepapersappearonceintheforeignjournalandonceinthetrans-lationjournal(usuallywithdifferentpagenumbers),butareactuallythesamepaperwhichshouldbeinthesys-temonlyonce.TheADSmustthereforemaintainmultiplebibliographiccodesforthesamearticlesinceeachjournalhasitsownabbreviation,andqueriesforeitheronemustbeabletoberecognized.Themasterlististhesourceofthiscorrelationandenablestheindexingproceduresandsearchenginetorecognizealternativebibliographiccodes.5.2.TheTextFiles
TextfilesintheADSarestoredinadirectorytreebybibliographiccode.Thetoplevelofdirectoriesisdividedintodirectorieswithfour-digitnamesbypublicationyear(characters1through4ofthebibliographiccode).Thenextlevelcontainsdirectorieswithfive-characternamesaccordingtojournal(characters5through9),andthetextfilesarenamedbyfullbibliographiccodeunderthesejournaldirectories.Thus,asamplepathnameis1998/MNRAS/1998MNRAS.295...75E.Alternativebibli-ographiccodesdonothaveatextfilenamedbythatcode,sincethetranslationtotheequivalentpreferredbiblio-graphiccodeisdonepriortoaccessingthetextfile.
Asampletextfileisgivenintheappendices.AppendixBshowsthefullbibliographicentry,includingallrecordsasreceivedfromSTI,MNRAS,andSIMBAD.ItcontainsXML-taggedfieldsfromeachsource,showingallinstancesofeveryfield.AppendixCshowstheextractedcanonicalversionofthebibliographicentrywhichcontainsonlyse-lectedinformationfromthemergedtextfile.Thislatterversionisdisplayedtotheuserthroughtheuserinterface(seeSEARCH).5.3.TheCodesFiles
Thethirdbasicfunctionoftheloadingproceduresistomodifyandmaintainthelistingsforavailableitems.The
10C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
ADSdisplaystheavailabilityofresourcesorinformationrelatedtobibliographicentriesaslettercodesinthere-sultslistofqueriesandasmoredescriptivehyperlinksinthepagedisplayingthefullinformationavailableforabib-liographicentry.AfulllistingoftheavailableitemcodesandtheirmeaningisgiveninSEARCH.
Theloadingroutinesmaintainlistsofbibliographiccodesforeachlettercodeinthesystemwhicharecon-vertedtoURLsbytheindexingroutines(seeARCHITEC-TURE).Bibliographiccodesareappendedtothelistsei-therduringtheloadingprocessoraspost-processingworkdependingontheavailabilityoftheresource.Whenelec-tronicavailabilityofdatacoincideswithourreceiptofthedata,thebibliographiccodescanbeappendedtothelistsbytheloadingprocedures.Whenwereceivethedatapriortoelectronicavailability,post-processingroutinesmustberuntoupdatethebibliographiccodelistsafterweareno-tifiedthatwemayactivatethelinks.6.TheArticles
TheADSisabletoscanandprovidefreeaccesstopastissuesoftheastronomicaljournalsbecauseofthewillingcollaborationofthejournalpublishers.Theprimaryrea-sonthatthejournalpublishershaveagreedtoallowthescanningoftheiroldvolumesisthatthelossofindivid-ualsubscriptionsdoesnotposeathreattotheirliveli-hood.Unlikemanydisciplines,mostastronomyjournalsareabletopayfortheirpublicationsthroughthecostofpagechargestoastronomerswhowritethearticlesandthroughlibrarysubscriptionswhichareunlikelytobecan-celledinspiteoffreeaccesstooldervolumesthroughtheADS.Thejournalpublisherscontinuetochargeforaccesstothecurrentvolumes,whichispaidforbymostinstitu-tionallibraries.Thisarrangementplacesastronomersinafortunatepositionforelectronicaccessibilityofastronomyarticles.
TheoriginalelectronicpublishingplansfortheastronomicalcommunitycalledforSTELAR(STudyofElectronicLiteratureforAstronomicalRe-search,vanSteenberg1992,vanSteenbergetal.1992,Warnocketal.1992,Warnocketal.1993)tohandlethescanninganddisseminationofthefulljournalarticles.However,whentheSTELARprojectwasterminatedin1993,theADSassumedresponsibilityforprovidingscannedfulljournalarticlestotheastronomicalcommu-nity.ThefirsttestjournaltobescannedwastheApJLetterswhichwasscannedinJanuary,1995at300dotsperinch(dpi).Itshouldbenotedthatthosescanswereintendedtobe600dpiandwewillsoonrescanthematthehigher600dpiresolution.Complicationsinthejournalpublishingformat(platesattheendofsomevolumesandinthemiddleofothers)werenotedanddetailedinstructionsprovidedtothescanningcompanysothattheresultingscanswouldbenamedproperlybypageorplatenumber.Allofthescanssincetheoriginaltestbatchhavebeenscannedat600dpiusingahighspeedscannerandgener-atinga1bit/pixelmonochromeimageforeachpage.Thefilescreatedarethenautomaticallyprocessedinordertode-skewandcenterthetextineachpage,resizeimagestoastandardU.S.Lettersize(8.5x11inches),andaddacopyrightnoticeatthebottomofeachpage.Foreachoriginalscannedpage,twoseparateimagefilesofdifferentresolutionsaregeneratedandstoredondisk.Theavail-abilityofdifferentresolutionsallowsuserstheflexibilityofdownloadingeitherhighormediumqualitydocuments,dependingonthespeedoftheirinternetconnection.Theimageformatsandcompressionusedwerechosenbasedontheavailablecompressionalgorithmsandbrowserca-pabilities.Thehighresolutionfilescurrentlyusedare600dpi,1bit/pixelTIFF(TaggedImageFileFormat)files,compressedusingtheCCITTGroup4facsimileencod-ingalgorithm.Themediumresolutionfilesare200dpi,1bit/pixelTIFFfiles,alsowithCCITTGroup4facsimilecompression.
Conversiontoprintingformats(PDF,PCL,andPostscript)isdoneondemand,asrequestedbytheuser.Similarly,conversionfromtheTIFFfilestoalowreso-lutionGIF(GraphicInterchangeFormat)file(75,100,or150dpi,dependingonuserpreferences)forviewingonthecomputerscreenisdoneondemand,thencachedsothatthemostfrequentlyaccessedpagesdonotneedtobecre-atedeverytime.AprocedurerunnightlydeletestheGIFfileswiththeoldestaccesstimestampsothatthetotalsizeofthediskcacheiskeptunderapre-definedlimit.Thecurrent10GBytesofcachesizeinuseattheSAOArticleServercausesonlyfileswhichhavenotbeenaccessedforaboutamonthtobedeleted.Likethefull-screenGIFim-ages,theADSalsocachesthumbnailimagesofthearticlepageswhichprovideuserswiththecapabilityofviewingtheentirearticleataglance.
TheADSusesOpticalCharacterRecognition(OCR)softwaretogainadditionaldatafromTIFFfilesofarticlescans.TheOCRsoftwareisnotyetadequateforaccu-ratereproductionofthescannedpages.Greeksymbols,equations,charts,andtablesdonottranslateaccuratelyenoughtoremaintruetotheoriginalprintedpage.Forthisreason,wehavechosennottodisplaytotheuseranythingrenderedbytheOCRsoftwareinanunsuper-visedfashion.However,wearestillabletotakeadvantageoftheOCRsoftwareforseveralpurposes.
First,weareabletoidentifyandextracttheabstractparagraph(s)forusewhenwedonothavetheabstractfromanothersource.Inthesecases,theOCR’dtextisin-dexedsothatitissearchableandtheextractedimageoftheabstractparagraphisdisplayedinlieuofanASCIIversionoftheabstract.Extractingtheabstractfromthescannedpagesissomewhattedious,asitrequiresestab-lishingdifferentsetsofparametersforeachjournal,aswellasfordifferentfontsusedovertheyearsbythesamejour-nal.TheOCRsoftwarecanbetaughthowtodetermine
C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings11
wheretheabstractends,butitdoesnotworkforeveryarticleduetoodditiessuchasauthorlistswhichextendbeyondthefirstpageofanarticle,andarticleswhichareinadifferentformatfromothersinthesamevolume(e.g.nokeywordsormultiplecolumns).TheADScurrentlycon-tainsapproximately25,000oftheseabstractimagesandmorewillbeaddedaswecontinuetoscanthehistoricalliterature.
WearealsocurrentlyusingtheOCRsoftwaretoren-derelectronicversionsoftheentirescannedarticlesforindexingpurposes.Wewillnotusethisfordisplaytotheusers,buthopetobeabletoindexittoprovidethepos-sibilityoffulltextsearchingatsomefuturedate.Weesti-matethattheindexingofouralmostonemillionscannedpageswithourcurrenthardwareandsoftwarewilltakeapproximatelytwoyearsofdedicatedCPUtime.
ThelastbenefitthatwegainfromtheOCRsoftwareistheconversionofthereferencelistattheendofarticles.WeuseparsedreferencelistsfromthescannedarticlestobuildcitationandreferencelistsfordisplaythroughtheCandRlinksoftheavailableitems.Sincereferencelistsaretypicallyinoneofseveralstandardformats,weparseeachreferenceforauthor,journal,volumeandpagenum-berformostjournalarticles,andconferencename,au-thor,andpagenumberformanyconferenceproceedings.Thisenablesustobuildbibliographiccodelistsforrefer-encescontainedinthatarticle(Rlinks)andinverttheseliststobuildbibliographiccodelistsofarticleswhichcitethispaper(Clinks).Weareabletousethisprocesstoidentifyandthereforeaddcommonly-citedarticleswhicharecurrentlymissingfromtheADS.Thisisusuallydatapriorto1975orastronomy-relatedarticlespublishedinnon-astronomyjournals.
TheArticleServicecurrentlycontains250GBytesofscans,whichconsistsof1,128,955articlepagescomprising138,7articles.Thesenumbersincreaseonaregularba-sis,bothasweaddmorearticlesfromtheolderliteratureandaswescannewjournals.7.ADS/JournalInteraction
AdescriptionofthedataintheADSwouldbeincom-pletewithoutadiscussionoftheinteractionbetweentheADSandtheelectronicjournals.Thedataavailableon-linefromthejournalpublishersisanextensionofthedataintheADSandviceversa.Thisinteractionisgreatlyfa-cilitatedbytheacceptanceofthebibliographiccodebymanyjournalpublishersasameansforaccessingtheiron-linearticles.
Accesstoarticlescurrentlyon-lineatthejournalsitesthroughtheADScomprisesasignificantpercentoftheon-linejournalaccess(seeOVERVIEW).ThebestmodelforinteractionbetweentheADSandajournalpublisheristheUniversityofChicagoPress(hereafterUCP),publisherofApJ,ApJL,ApJS,AJ,andPASP.Whenanewvolumeappearson-lineatUCP,theADSisnotifiedbyemailandanSGMLheaderfileforeachofthosearticlesissimulta-neouslytransferredtooursite.Thedataareparsedandloadedintothesystemandappropriatelinksarecreated.However,priortothis,theUCPhasmadeuseoftheADStobuildtheirelectronicversionthroughtheuseofourbibliographiccodereferenceresolver.
Ourbibliographiccodereferenceresolver(Accomazzietal.1999)wasdevelopedtoprovidethecapabilitytoautomaticallyparse,identify,andverifycitationsappearinginastronomicalliterature.ByverifyingtheexistenceofareferencethroughtheADS,journalsandconferenceproceedingseditorsareabletopublishdocumentscontaininghyperlinkspointingtostable,uniqueURLs.IncreasinglymorejournalsarelinkingtotheADSintheirreferencesections,providinguserswiththeabilitytoreadreferencedarticleswiththeclickofamousebutton.
Duringthecopyeditingphase,UCPeditorsquerytheADSreferenceresolveranddetermineifeachreferenceexactlymatchesabibliographiccodeintheADS.Ifthereisamatch,alinktotheADSisestablishedforthisen-tryintheirreferencesection.Ifthereisnotamatch,oneofseveralscenariostakesplace.First,ifitisavalidref-erencenotyetincludedintheADS(mostoftenthecasefor“fringe”articles,thoseperipherallyassociatedwithas-tronomy),ourreferenceresolvercapturestheinformationnecessarytoaddittoourdatabaseduringthenextup-date.Second,ifitisavalidreferenceunabletobeparsedbytheresolver(sometimesthecaseforconferencepro-ceedingsorPhDtheses),noactionistakenandnolinkislistedinthereferencesection.Third,ifthereisanerrorinthereferenceasdeterminedbythereferenceresolver,theUCPeditorsmayaskforacorrectionorclarificationfromtheauthors.
Thelastoptiondemonstratesthepoweroftheref-erenceresolver,whichhasbeentaughtonajournal-by-journalbasishowcompletethecoverageofthatjournalisintheADS.Beforetheimplementationofthereferenceresolver,UCPwasabletomatch72%ofreferencesinApJarticles(E.Owens,privatecommunication).Earlyresultsfromtheuseofthereferenceresolvershowthatwearenowabletomatchconferenceproceedings,sothisnum-bershouldbecomesomewhatlarger.Itisunlikelythatwewillevermatchmorethan90%ofreferencesinanarticleduetoreferencessuchas“privatecommunication”,“inpress”,andpreprints,aswellasauthorerrors(seesection8).OurownreferenceresolvingofOCR’dreferencelistsshowsthatwecanmatchapproximately86
TheADSprovidesmultiplewaysforauthorsandjour-nalpublisherstolinktotheADS(seeSEARCH).Wemakeeveryefforttofacilitateindividualsandorganizationslink-ingtous.Thisiseasilydoneforsimplesearchessuchastheverificationofabibliographiccodeoranauthorsearchforasinglespelling.However,giventhecomplexityofthesys-tem,theseautomatedsearchescanquicklybecomecompli-
12C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
cated.DetailsforconferenceproceedingseditorsorjournalpublisherswhoareinterestedinestablishingorimprovinglinkstotheADSareavailableuponrequest.Inparticu-lar,thosewhohaveindividualTeXmacrosincorporatedintheirreferencescanuseourbibliographiccoderesolvertofacilitatelinkingtotheADS.
8.DiscussionandSummary
Asofthiswriting(12/1999),thereare524,304referencesintheAstronomydatabase,523,498referencesintheIn-strumentationdatabase,443,858referencesinthePhysicsdatabase,and3467referencesinthePreprintdatabase,foratotalofalmost1.5millionreferencesinthesystem.Astronomerscurrentlywriteapproximately18,000jour-nalarticlesannually,andpossiblythatmanyadditionalconferenceproceedingspapersperyear.Morethanhalfofthejournalpapersappearinpeer-reviewedjournals.Thesenumbersaremorethandoublewhattheywerein1975,inspiteofanincreaseinthenumberofwordsperpageinmostofthemajorjournals(Abt1995),andanincreaseinnumberofpagesperarticle(Schulmanetal.1997).Atthecurrentrateofpublication,astronomerscouldbewriting25,000journalpapersperyearby2001andanadditional20,000conferenceproceedingspapers.Figure1showsthetotalnumberofpapersforeachyearintheAstronomydatabasesince1975,dividedintorefereedjournalpapers,non-refereedjournalpapers,andconferences(includingre-portsandtheses).Therearethreefeaturesworthnoting.First,theincreaseintotalreferencesin1980isduetotheinclusionofHelenKnudsen’sMonthlyAstronomyandAs-trophysicsIndex,arichsourceofdataforbothjournalsandconferenceproceedingswhichbegancoverageinlate1979andcontinueduntil1995.Second,therecentincreaseinconferencesincludedintheAstronomydatabase(start-ingaround1996)isduetotheinclusionofconferenceproceedingstableofcontentsprovidedbycollaboratinglibrariansandtypedinbyourcontractors.Last,thede-creaseinnumbersfor1999isduetocoverageforthatyearnotyetbeingcompleteintheADS.
ThegrowthrateoftheInstrumentationandPhysicsdatabasesisdifficulttoestimate,primarilybecausewedonothavedatasetswhichareascompleteasastronomy.Inanycase,theneedfortheorganizationandmaintenanceofthislargevolumeofdataisclearlyimportanttoeveryresearchastronomer.Fortunately,theADSwasdesignedtobeabletohandlethislargequantityofdataandtobeabletogrowwithnewkindsofdata.Newavailableitemlinkshavebeenaddedfornewtypesofdataastheybecameavailable(e.g.thelinkstocompletebookentriesattheLibraryofCongress)andfuturedatasets(e.g.fromfuturespacemissions)shouldbeabletobeaddedinthesamefashion.
Aswithanydatasetofthismagnitude,thereissomefractionofreferencesinthesystemwhichareincor-rect.Thisisunavoidablegiventhelargenumberofdata
secnerefer fo #579135791357977799999999999991111111111111Year
Fig.1.Histogramshowingthenumberofrefereedjournalpa-pers,non-refereedjournalpapers,andconferences(includingreportsandtheses)foreachyearintheAstronomydatabasesince1975.
sources,errorsinindicesandtablesofcontentsasorigi-nallypublished,andhumanerror.Inaddition,manyau-thorsdonotgivefullattentiontoverifyingallreferencesinapaper,resultingintheintroductionoferrorsinmanyplaces.Inasystematicstudyofmorethan1000referencescontainedinasingleissueoftheAstrophysicalJournal,Abt(1992)foundthatmorethan12%ofthosecontainederrors.ThisnumbershouldbesignificantlyreducedwiththeintegrationoftheADSreferenceresolverintheelec-tronicpublishingprocess.However,anymistakesintheADScanandwillgetpropagated,sostepsarebeingtakenbyustomaximizeaccuracyofourentries.
Locatingandidentifyingcorrelationsbetweenmulti-plebibliographiccodeswhichdescribethesamearticleisatime-consumingandsometimessubjectivetaskasmanypairsofbibliographiccodesneedtobeverifiedbymanu-allylookinguppapersinthelibrary.WeusetheAbstractServiceitselfforgrossmatchingofbibliographiccodes,submittingasearchwithauthorandtitle,andconsider-inganyresultingmatcheswithascoreof1.0asapotentialmatch.Thesematchesareonlypotentialmatcheswhichrequireverificationsinceauthorscansubmitthesamepa-pertomorethanonepublicationsource(e.g.BAASandarefereedjournal),andsinceerratapublishedwiththesametitleandauthorlistwillperfectlymatchtheoriginalpaper.
Whenavolumeoryearismismatched,itisusuallyobviouswhichofapairofmatchedbibliographiccodesiscorrect,butifapagenumberisoff,thedecisionastowhichcodeiscorrectcannotalwaysbeautomated.Wealsoneedtoconsidermatcheswithveryhighscoreslessthan1.0sincethesearethematcheswhereanauthornamemaybeincorrect.Thecorrectionoferrorsofthissortisongoingworkwhichiscarriedoutasoftenastimeandresourcespermit.
C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings13
TheevolutionoftheInternetandtheWorldWideWeb,alongwiththeexplosionofastronomicalservicesontheWebhasenabledtheADStoprovideaccesstoourdatabasesinanopenanduniformenvironment.Wehavebeenabletohyperlinkbothtoourownresourcesandtootheron-lineresourcessuchasthejournalbibliographies(Boyce&Biemesderfer1996).AspartoftheinternationalcollaborationUrania(UniversalResearchArchiveofNet-workedInformationinAstronomy,Boyce1998),theADSenablesafullyfunctioningdistributeddigitallibraryofas-tronomicalinformationwhichprovidespowerandutilitypreviouslyunavailabletotheresearcher.
PerhapsthelargestfactorwhichhascontributedtothesuccessoftheADSisthewillingcooperationoftheAAS,CDS,andallthejournalpublishers.TheADShaslargelybecomethemeansforlinkingtogethersmallerpiecesofabiggerpicture,makinganelaboratedigitallibraryforastronomersareality.Wecurrentlycollaboratewithoverfiftygroupsincreatingandmaintainingcross-linksamongdatacenters.Theseadditionalcollaborationswithindivid-ualsandinstitutionsworldwideallowustoprovidemanyvalue-addedfeaturestothesystemsuchasobjectinforma-tion,authoremailaddresses,mailorderformsforarticles,citations,articlescans,andmore.Alistingofthesecol-laborationsisprovidedinTable6.Anyomissionsfromthistablearepurelyunintentional,astheADSvaluesallofourcolleaguesandtheusersbenefitnotonlyfromthemajorcollaboratorsbuttheminoronesaswell,astheseareoftenmoredifficultforuserstolearnaboutindepen-dently.MostoftheabbreviationsarelistedinTables2,3,and4.
ThesuccessfulcoordinationofdataexchangeswitheachofourcollaboratorsandtheeffortswhichwentintoestablishingtheminthefirstplacehavebeenkeytothesuccessoftheADS.Establishinglinkstoandfromthejournalpublishers,changingtheselinksduetorevisionsatpublisherwebsites,andtrackingandfixingbrokenlinksisallconsideredroutinedatamaintenanceforthesystem.Sinceitisnecessaryforustomaintainconnectivitytoex-ternalsites,routinechecksofsamplelinksareperformedonaregularbasistoverifythatthelinksarestillactive.UsagestatisticsfortheAbstractService(seeOVERVIEW)indicatethatastronomersandlibrariansatscientificinstitutionsareeagertotakeadvantageoftheinformationthattheADSprovides.Thewidespreadac-ceptanceoftheADSbytheastronomicalcommunityischanginghowastronomersdoresearch,placingextensivebibliographicinformationattheirfingertips.Thisenablesresearcherstoincreasetheirproductivityandtoimprovethequalityoftheirwork.
AnumberofimprovementstothedataintheADSareplannedforthenearfuture.Asalways,wewillcontinueoureffortstoincreasethecompletenessofcoverage,par-ticularlyforthedatapriorto1975.WehavecollectedmostofthemajorjournalsbacktothefirstissueforscanningandaddingtotheAstronomydatabase.Inaddition,we
arescanningandOCR’ingtableofcontentsforconferenceproceedingstoimproveourcoverageinthatarea.WearecurrentlyOCR’ingfulljournalarticlestoprovidefulltextsearchingandtoimprovethecompletenessofourrefer-enceandcitationcoverage.Finally,astheADSbecomescommonplaceforallastronomers,valuablefeedbackfromouruserstoinformusaboutmissingpapers,errorsinthedatabase,andsuggestedimprovementstothesystemservetoguidethefutureoftheADSandtoensurethattheADScontinuestoevolveintoamorevaluableresearchtoolforthescientificcommunity.9.Acknowledgments
TheotherADSTeammembers:MarkusDemleitner,Eliz-abethBohlen,andDonnaThompsoncontributemuchonadailybasis.FundingforthisprojecthasbeenprovidedbyNASAunderNASAGrantNCC5-1.References
Abt,H.A.1994,PASP,106,1015Abt,H.A.1995,ApJ,455,407
Accomazzi,A.,Eichhorn,G.,Grant,C.S.,Kurtz,M.J.,&Mur-ray,S.S.2000,(thisissue)
Accomazzi,A.,Eichhorn,G.,Kurtz,M.J.,Grant,C.S.,&Mur-ray,S.S.1998,ASPConf.Ser.172:AstronomicalDataAnalysisSoftwareandSystemsVIII,291
AmericanAstronomicalSocietyManuscriptPreparation,1999,
http://www.journals.uchicago.edu/AAS/AASTeX
Boyce,PeterB.1998,ASPConf.Ser.153:LibraryandInfor-mationServicesinAstronomyIII,107
Boyce,P.B.&Biemesderfer,C.1996,ASPConf.Ser.101:As-tronomicalDataAnalysisSoftwareandSystemsV,7Egret,D.&Wenger,M.1988,ESOConf.#28:Astronomy
fromLargeDatabases,edsF.MurtaghandA.Heck,323Eichhorn,G.,Kurtz,M.J.,Accomazzi,A.,Grant,C.S.,&Mur-ray,S.S.2000,(thisissue)
Goldfarb,CharlesF.&Rubinsky,Yuri,1991,TheSgmlHand-book(ClarendonPress)
Harold,ElliotteRusty,1998,Xml:ExtensibleMarkupLan-guage(IDGBooksWorldwide)
Helou,G.&Madore,B.1988,ESOConf.#28:Astronomy
fromLargeDatabases,edsF.MurtaghandA.Heck,335InternationalOrganizationforStandardization1987,Informa-tionandProcessing—8–bitSingle–byteCodedGraphicCharacterSets(Geneva)
Jacobsen,Dana,1996,http://www.ecst.csuchico.edu/-˜jacobsd/bib/formats/refer.html
Knuth,DonaldE.1984,TheTeXbook(Addison-WesleyPub-lishingCo.)
Kurtz,M.J.,Eichhorn,G.,Accomazzi,A.,Grant,C.S.,Mur-ray,S.S.,&Watson,J.M.2000,(thisissue)
Lamport,Leslie1986,LaTex:ADocumentationPreparation
System(Addison-WesleyPublishingCo.)
Lee,J.,Dubin,D.S.,Kurtz,M.J.1999,ASPConf.Ser.172:
AstronomicalDataAnalysisSoftwareandSystemsVIII,287
14
Table6.Collaborators
C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
A.G.DavisPhilipAcademicPress(AP)
AmericanAstronomicalSociety(AAS)AmericanInstituteofPhysicsAndreHeck
AnnualReviews,Inc.
AstronomicalDataCenter(ADC)
AstronomicalInstituteofCzechoslovakia
AstronomicalInstituteoftheSlovakAcademyofSciencesAstronomicalSocietyofAustraliaAstronomicalSocietyofIndiaAstronomicalSocietyofJapan
AstronomicalSocietyofthePacific(ASP)AstronomischeGesellschaftAstronomischeNachrichtenBalticAstronomy
BritishAstronomicalAssociationCambridgeUniversityPress
CentralBureauforAstronomicalTelegrams(CBAT)ChrisBennEDPSciences
ElsevierPublishers
GeneralCatalogueofPhotometricData(GCPD)InstituteforScientificInformation(ISI)
InternationalSocietyforOpticalEngineering(SPIE)KoreanAstronomicalSocietyKluwerPublishers
LibraryofCongress(LOC)
LosAlamosNationalLaboratory(LANL)LunarandPlanetaryScienceInstitute(LPI)MeteoriticalSocietyNED
TheObservatory
RoyalAstronomicalSocietySIMBAD
SpringerVerlag
UniversitadNacionalAutonomadeMexico(UNAM)UniversityofChicagoPress(UCP)ScanningofConferenceProceedingsScanningofIcarus
Citations,ScanningofAJ,ApJ,ApJL,ApJS,AASPBa,BAASScanningofSvAL
StarHeads(AuthorHomePages)ScanningofARA&ADlinkstodataScanningofBAICzScanningofCoSkaScanningofPASAScanningofBASIScanningofPASJ
ScanningofPASPandConferenceProceedingsScanningofRvMAScanningofANScanningofBaltAScanningofJBAA
Mlinkstoorderforms,ScanningObjectsearches
Astropersons.lis(AuthorEmail)ScanningofA&ASElinkstoarticlesDlinkstodataCitations
MlinkstoorderformsScanningofJKAS
Mlinkstoorderforms,ScanningofSoPhZ39.50interface,LlinkstodataPreprintArchive
Scanning,ObjectsearchesScanningofM&PS
Nlinkstoobjects,ObjectsearchesScanning
ScanningofMNRAS
Slinkstoobjects,Dlinkstodata,ObjectsearchesScanningofA&A,ZAb
ScanningofRMxAA,RMxACReferenceResolving
C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
Warnock,A.,Gass,J.,Brotzman,L.E.,vanSteenberg,M.E.,
Kovalsky,D.,&Giovane,F.1992,NewsletteroftheAmer-icanAstronomicalSociety,62,10
Warnock,A.,vanSteenberg,M.E.,Brotzman,L.E.,Gass,J.E.,
Kovalsky,D.,&Giovane,F.1993,ASPConf.Ser.52:As-tronomicalDataAnalysisSoftwareandSystemsII,137
15
AppendixA:
Version1.0oftheXMLDTDdescribingtextfilesintheADSAbstractService.
DocumentTypeDefinitionfortheADSbibliographicrecords
Syntaxpolicy=============
-Theelementnamesareinuppercaseinordertohelpthereading.
-Theattributenamesarepreferablyinlowercase
-TheattributevaluesareallowedtobeoftypeCDATAtoallowmoreflexibilityforadditionalvalues;however,attributestypicallymayonlyassumeoneofawell-definedsetofvalues
-Cross-referencingamongelementssuchasAU,AF,andEMisaccomplishedthroughtheuseofattributesoftypeIDREFS(forAU)andID(forAFandEM)
TITLE?,AUTHORS?,
AFFILIATIONS?,EMAILS?,FOOTNOTES?,BIBCODE,MSTRING,MONOGRAPH?,SERIES?,PAGE?,LPAGE?,
COPYRIGHT?,PUBDATE,
CATEGORIES*,COMMENTS*,ANOTE?,BIBTYPE?,
IDENTIFIERS?,ORIGINS,
16C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
OBJECTS*,KEYWORDS*,ABSTRACT*)>
originCDATAlangCDATA
#REQUIRED
#IMPLIED>
CREATOR,CDATE,EDATE)>
AFIDREFSEMIDREFSFNIDREFS
#IMPLIED#IMPLIED#IMPLIED>
-->FNAME?,
addresseswiththeindividualauthorrecords.
ThisistheonlyexceptionofattributesinTypicallythisiscreatedbyparsingthe
C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings17
textintheMSTRINGelement.Example:SERVOL?,
-->
VOLUME?,MNOTE?,EDITORS?,EDITION?,
FormattingisasforAUTHORSandAUelementsCOPYRIGHT(#PCDATA)>
FNAME?,LNAME,SUFF?)>
assigned.STI/RECONhasalwaysassignedacategoryforeachentryintheirsystem,butotherwisethereislittleelseinourdatabase.Theattributesoriginandsystemofferings.The\"system\"attributecharacterizes
thesystemusedtoexpresstheidentifier-->systemNMTOKEN#IMPLIED>
ConferenceSeries).NotethattheremaybeseveralSERIESelements,sincesomeaseries.-->
18C.Grantetal.:TheNASAAstrophysicsDataSystem:DataHoldings
sources(E.g.STI,whichkeyedabstractsinmostcases).ThereforeweallowseveralABSTRACTelementswithineachrecord,eachwithaseparateoriginorlanguage.
Theattributetypeisusedtokeeptrackofhowtheabstractdatawasgenerated.Forinstance,abstracttextgeneratedbyourOCRsoftwarewillhave:
origin=\"ADS\"type=\"OCR\"lang=\"en\"-->
typeNMTOKEN#IMPLIEDlangCDATA#IMPLIED
>
>>
preformattedtext.-->
Examples: AppendixB: originNMTOKEN#IMPLIEDsystemNMTOKEN#REQUIRED>
Copyright © 2019- huatuo3.com 版权所有 蜀ICP备2023022190号-1
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务