asaProgrammingEnvironment
ShulyWintner
DepartmentofComputerScience,UniversityofHaifa,31905Haifa,Israel
shuly@cs.haifa.ac.il
Abstract.Finite-statetechnologyisconsideredthepreferredmodelforrepresentingthephonologyandmorphologyofnaturallanguages.Theattractivenessofthistechnologyfornaturallanguageprocessingstemsfromfoursources:modularityofthedesign,duetotheclosureproper-tiesofregularlanguagesandrelations;thecompactrepresentationthatisachievedthroughminimization;efficiency,whichisaresultoflinearrecognitiontimewithfinite-statedevices;andreversibility,resultingfromthedeclarativenatureofsuchdevices.
However,whenwide-coveragegrammarsareconsidered,finite-statetechnologydoesnotscaleupwell,andthebenefitsofthistechnologycanbeovershadowedbythelimitationsitimposesasaprogrammingen-vironmentforlanguageprocessing.Thispaperfocusesonseveralaspectsoflarge-scalegrammardevelopment.Usingareal-worldbenchmark,wecompareafinite-stateimplementationwithanequivalentJavaprogramwithrespecttoeaseofdevelopment,modularity,maintainabilityofthecodeandspaceandtimeefficiency.Weidentifytwomainproblems,abstractionandincrementaldevelopment,whicharecurrentlynotad-dressedsufficientlywellbyfinite-statetechnology,andwhichwebelieveshouldbethefocusoffutureresearchanddevelopment.
1Introduction
Finite-statetechnology(FST)denotestheuseoffinite-statedevices,suchasau-tomataandtransducers,innaturallanguageprocessing(NLP).Sincetheearlyworkswhichdemonstratedtheapplicabilityofthistechnologytolinguisticrepre-sentation[1,2,3],FSTisconsideredadequatefordescribingthephonologicalandmorphologicalprocessesoftheworld’slanguages[4,5].Evennon-concatenativeprocessessuchascircumfixation,root-and-patternmorphologyorreduplication,wereshowntobeinprincipleimplementableinFST[6,7].
TheutilityofFSTforNLPwasemphasizedbytheimplementationofseveraltoolboxeswhichprovideextendedregularexpressionlanguagesandcompilerswhichconvertexpressionstofinite-stateautomataandtransducers.Thesein-cludeINTEX[8];FSM[9],whichisaunix-basedsetofprogramsformanipu-latingautomataandtransducers;FSAUtilities[10],whichisafreelyavailable,Prologimplementedsystem;andXFST[5],whichisacommercialpackageas-sumedtobethemostsuitableforlinguisticapplicationsbyprovidingthemostexpressivelanguage.
A.Gelbukh(Ed.):CICLing2007,LNCS4394,pp.97–106,2007.cSpringer-VerlagBerlinHeidelberg2007
98S.Wintner
ThebenefitsofFSTforNLPstemfromseveralpropertiesoffinite-statedevices:
Truerepresentation:FollowingthepioneeringworkofJohnson[1],itisnow
clearthatthekindofphonologicalandmorphologicalrulesthatarecommoninlinguistictheoriescanbedirectlyimplementedasfinite-staterelations.TheimplementationoflinguisticallymotivatedrulesinFSTisthereforestraightforwardanddirect[11].
Modularity:Theclosurepropertiesofregularlanguagesandrelationsprovide
variousmeansforcombiningregularexpressions,supportingavarietyofoperationsonthelanguagestheseexpressionsdenote.Forexample,closureunderunionfacilitatesaseparatedevelopmentoftwogrammarfragmentswhichcanthenbedirectlycombinedinasingleoperation.Themostusefuloperationsunderwhichtransductionsareclosedisprobablycomposition,whichisthecentralvehicleforimplementingreplacerules[3,11].
Compactness:Finite-stateautomatacanbeminimized,guaranteeingthatfor
agivenlanguage,anautomatonwithaminimalnumberofstatescanal-waysbegenerated.Toolboxescanapplyminimizationeitherexplicitlyorimplicitlytoimprovestoragerequirements.
Efficiency:Whenanautomatonisdeterministic,recognitionisoptimallyef-ficient(linearinthelengthofthestringtoberecognized).Automatacanalwaysbedeterminized,andtoolboxescantakeadvantageofthistoimprovetimeefficiency.
Reversibility:Finite-stateautomataandtransducersareinherentlydeclara-tive:itistheapplicationprogramwhicheitherimplementsrecognitionorgeneration.Inparticular,transducerscanbeusedtomapstringsfromtheupperlanguagetothelowerlanguageorviceversawithnochangesintheunderlyingfinite-statedevice.Thesebenefitsencouragedthedevelopmentofseverallarge-scalemorphologicalgrammarsforavarietyoflanguages,includingsomewithcomplexmorphologysuchasGerman,French,Finnish,Turkish,ArabicandHebrew.
Themainclaimofthispaper,however,isthatfinite-statetechnologyisstillinferiortoitsalternativeswhenthedevelopmentoflarge-scalegrammarsiscon-cerned.Thisclaimissupportedbyarealisticexperimentdefiningasophisticatedmorphologicaltask,bothusingFST(section2)andwithadirectimplementa-tioninJavaofthesamegrammar(section3).Wecomparethetwoapproachesinsection4alongseveralaxes.Theconclusion(section5)istheidentificationoftwomainAchillesHeelsincontemporarytechnology:thelackofabstrac-tionmechanismsandthecomputationalburdenofincrementalchanges.Webelievethatthesetwoissuesshouldbethefocusoffutureresearchinfinite-statetechnology.
2AMotivatingExample
Inordertoevaluatethescalabilityoffinite-statetechnologyweconsider,asabenchmark,alarge-scaletask:accountingforthemorphologicalandortho-
Finite-StateTechnologyasaProgrammingEnvironment99
graphicphenomenaofHebrew,anaturallanguagewithnon-trivialmorphology.Clearly,languageswithsimplemorphology(e.g.,English)donotbenefitfromFSTapproaches,simplybecauseitissoinexpensivetogenerateandstorealltheinflectedforms.ItisonlywhenrelativelycomplicatedmorphologicalprocessesareinvolvedthatthebenefitsofFSTbecomeapparent,andHebrewischosenhereonlyasaparticularexample;theobservationsreportedinsection4arevalidingeneral,forallsimilartasks.
Hebrew,likeotherSemiticlanguages,hasarichandcomplexmorphology.Themajorwordformationmachineryisroot-and-pattern,whererootsaresequencesofthree(typically)ormoreconsonantsandpatternsaresequencesofvowelsand,sometimes,alsoconsonants,with“slots”intowhichtheroot’sconsonantsareinserted.Aftertherootcombineswiththepattern,somemorpho-phonologicalalterationstakeplace,whichmaybenon-trivial.Thecombinationofarootwithapatternproducesalexeme,whichcanthenbeinflectedinvariousforms.Inflectionalmorphologyishighlyproductiveandconsistsmostlyofsuffixes,butsometimesofprefixesorcircumfixes.Themorphologicalproblemsareamplifiedbyissuesoforthography.ThestandardHebrewscriptleavesmostofthevowelsunspecified.Furthermore,manyparticles,includingprepositions,conjunctionsandthedefinitearticle,attachtothewordswhichimmediatelyfollowthem.Asaresult,surfaceformsarehighlyambiguous.
Thefinite-stategrammarwhichweusedasabenchmarkhereisHAMSAH[12],anXFSTimplementationofHebrewmorphology.Thegrammarisobtainedbycomposingalarge-scalelexiconofHebrew(over20,000entries)withalargesetofrules,implementingmostlymorphologicalandorthographicprocessesinthelanguage.Asthelexiconisdevelopedindependently[13]andisrepresentedinXML,itmustbeconvertedtoXFSTbeforeitcanbeincorporatedinthegrammar.ThisisdonebyasetofPerlscriptswhichhadtobespecificallywrittenforthispurpose.Inotherwords,thesystemitselfisnotpurelyfinite-state,andwemaintainthatfewlarge-scalesystemsformorphologicalanalysiscanbepurelyfinite-state,assuchsystemsmustinteractwithindependentlydevelopedcomponentssuchaslexicons,annotationtools,userinterfacesetc.Aspecializedsetofrulesimplementsthemorphologicalprocesseswhichapplytoeachmajorpartofspeech.Forexample,figure1depictsasomewhatsimplifiedversionoftherulewhichaccountsforthewtsuffixofHebrewnouns.Thisrulemakesextensiveuseofcomposition(denotedby‘.o.’)andreplacerules(‘->’and‘<-’).Theeffectofthisruleisdual:onthesurfacelevel,itaccountsforalterationsintheconcatenationofthesuffixwiththestem(e.g.,iihbecomesih,wtchangestowiandafinalhortareelided);onthelexicallevel,itchangesthespecificationofnumberfromsingulartoplural.
Theruleshouldbereadfromthecenteroutwards.Thevariablenounde-notesthesetofalllexicalitemswhosepartofspeechisnoun;bydefault,thesenounsaresingularmasculine.InXFST,asetofwordsisidentifiedwithwiththeidentitytransductionwhichrelateseachwordinthesetwithitself.Thefirstcompositionontopofthenountransductionselectsonlythosenounswhosepluralattributeislexicallyspecifiedaswt(othernounsmaybelexicallyspeci-
100S.Wintner
fiedforadifferentpluralsuffix).Ofthose,onlytheoneswhosenumberattributeissingularareselected.Then,thevaluesingularinthelexical(upper)stringisreplacedbypluralinthecontextofimmediatelyfollowingtheattributenumber.Inthesurface(lower)language,meanwhile,asetofcompositionsop-eratorstakescareofthenecessaryorthographicchanges,andfinally,thepluralsuffixwtisconcatenatedtotheendofthesurfacestring.
definepluralWTNoun[[
[plural<-singular||number_].o.$[numbersingular].o.$[pluralwt].o.noun
.o.[iih->ih||_.#.].o.[it->i||_.#.].o.[wt->wi||_.#.].o.[[h|t]->0||_.#.]][0.x.[wt]]];
Fig.1.XFSTaccountofpluralnouns
Thisruleisagoodexampleofhowasinglephenomenonisfactoredoutandaccountedforindependentlyofotherphenomena:therulereferstolexi-calinformation,suchas‘number’or‘plural’,butcompletelyignoresirrelevantinformationsuchas,say,gender.However,italsohintsathowinformationismanipulatedbyregularexpressions.Sincefinite-statenetworkshavenomemory,saveforthestate,allinformationisencodedbystringswhicharemanipulatedbytherules.Thus,asimpleoperationsuchaschangingthevalueofthenumberfeaturefromsingulartopluralmustbecarriedoutbythesamereplaceruleswhichaccountforthechangestothesurfaceform.Furthermore,thereisnowaytostructuresuchinformation,asiscommoninprogramminglanguages;andthereisnowaytoencapsulateit.
3AnAlternativeImplementation
Were-implementedtheHAMSAHgrammardirectlyasaJavaprogram.Themethodweusedwasanalysisbygeneration:wefirstgeneratedalltheinflectedformsinducedbythelexiconandstoretheminadatabase;then,analysisissimplyadatabaselookup.Itiscommontothinkthatforlanguageswithrichmorphologysuchamethodisimpractical.Whilethismayhavebeenthecaseinthepast,contemporarycomputerscanefficientlystoreandretrievemillionsofinflectedforms.Ofcourse,thismethodwouldbreakinthefaceofaninfi-nitelexicon(whichcaneasilyberepresentedwithFST),butformostpracticalpurposesitissafetoassumethatnaturallanguagelexiconsarefinite.
Finite-StateTechnologyasaProgrammingEnvironment101
Toseparatelinguisticknowledgefromprocessingcodeasmuchaspossible,ourJavaimplementationusesadatabaseofrules,whicharesimplestringtransduc-tionsintendedtoaccountforsimple(mostlymorphemeboundary)morphologi-calandorthographicalterations.Whengeneratinginflectedforms,theprogramidentifiescertainconditions(e.g.,apluralsuffixwtistobeattachedtoanoun).Itthenlooksupthisconditionintheruledatabaseandretrievestheactiontoapply,dependingonthesuffixoftheinputstring.Anexampleoftheruledatabase,withalterationspertainingtothesuffixwt(cf.figure1),isdepictedinfigure2.Formostmorphologicalprocesses,solutionssuchasthiscanaccuratelystandforlinguisticrulesoftheformdepictedinfigure1.
Wheninputendsin:iihitwth,tdefaultReplaceitby:ihiwiThenadd:wtwtwtwtwtFig.2.Directaccountofpluralnouns
Notethatrulessuchastheonedepictedinfigure2aregenerationrules,
andmustnotbeconfusedwiththekindofad-hocrulesusedatruntimefor,e.g.,stemming.Theyfullyreflectthelinguisticknowledgeencodedinfinite-statereplacerules.Granted,theexampleruleissimplistic,andmorecomplexphe-nomenarequiremorecomplicatedrepresentation,butsincemostofmorphologytakesplacealongmorphemeboundaries,thisisareasonablerepresentation.Themorphologicalanalyzerwasobtainedbydirectlyimplementingtherulesandapplyingthemtothelexicon.Thenumberofinflectedforms(beforeat-tachingprefixes)is473,880(over300,000ofthoseareinflectednouns,andcloseto150,000areinflectedverbforms).Inadditiontoinflectedforms,theanalyzeralsoallowsasmanyas174differentsequencesofprefixparticlestobeattachedtowords;separationofprefixesfrominflectedformsisdoneatanalysistime.Thedirectimplementationisequivalenttothefinite-stategrammar:thiswasverifiedbyexhaustivelygeneratingalltheinflectedformswitheachofthesystemsandanalyzingthemwiththeothersystem.
4ComparisonandEvaluation
HavingdescribedtheXFSTbenchmarkgrammaranditsdirectJavaimplemen-tation,wenowcomparethetwoapproachesalongseveralaxes.Itisimportanttoemphasizethatwedonotwishtocomparethetwosystems,butratherthemethodology.Inparticular,wechoseXFSTasitisoneofthemostefficient,andcertainlythemostexpressive,FSTtoolboxavailable.ArecentcomparisonofXFSTwiththeFSAUtilitiespackage[14]showsthatthelattersimplycannothandlegrammarsofthescaleofHAMSAH.Thefollowingisalistofissuesinwhichfinite-statetechnologyturnedouttobeproblematiccomparedwiththealternative;inthenextsectionwefocusonissuesthatwebelieveshouldbegiven
102S.Wintner
moreattentioninfutureresearchonFST.Allexperimentsweredoneonadual2GHzprocessorLinuxmachinewith2.5Gbofmemory.
Truthfulness.OneoftheassetsofFSTisthatitallowsforaveryaccurateimplementationoflinguisticrules.However,agoodorganizationofthesoftwarecanprovideaclearseparationbetweenlinguisticknowledgeandprocessinginanyprogrammingenvironment,sothatlinguisticrulescanbeexpressedconciselyanddeclaratively,asexemplifiedinfigure2.
Reversibility.AclearadvantageofFSTisthatgrammarsarefullyreversible.However,withtheanalysisbygenerationparadigmthesameholdsalsoforadirectimplementation:thegeneratorisdirectlyimplemented,andtheanalyzerisimplementedassearchinthedatabaseofgeneratedforms.
Expressivity.Here,thedisadvantagesoffinite-statetechnologyasaprogram-mingenvironmentareclear.Programmingwithfinite-statetechnologyisverydifferentfromprogramminginordinarylanguages,mainlyduetothehighlyconstrainedexpressivepowerofregularrelations(programmerssometimesfeelthattheyareworkingwiththeirhandstiedbehindtheirbacks).WhileFSTcantheoreticallyaccountfornon-concatenativeprocesses,existingtoolboxespro-videapartial,andsometimesoverlycomplicated,solutionsforsuchproblems.Sometimesatrans-regularoperationiscalledfor,andmanyothertimestheconstrainedexpressivityofregularrelationsistoolimiting.
Portability.XFSTisaproprietarypackagewiththreeversionsavailableforthreecommonoperatingsystems.Otherfinite-statetoolboxesarefreer;FSAisopensource,butaswenotedearlieritsimplycannotcopewithgrammarsthesizeofHAMSAH.FSMisavailableforavarietyofUnixoperatingsystems,asabi-naryonly,whereasINTEXisdistributedasaWindowsexecutable.Incontrast,aJavaimplementationcanbedeliveredtouserswithallkindsof(contempo-rary)operatingsystemsandhardware,andisoptimallyportable.Thepracticalportabilitylimitationsdirectlyhampertheutilizationoffinite-statetechnologyinpractical,commercialsystems.
Abstraction.Large-scalemorphologicalgrammarstendtobeextremelynon-modular.Eachsurfacestringisassociated,duringitsprocessing,withalexicalcounterpartwhichdescribesitsstructure.Thelexicalstringisconstantlyre-writtenbytherules,asinfigure1.Duetotheinherentsequentialityofstrings,alltheinformationwhichisassociatedwithsurfacestringsisencodedsequentially.Inparticular,addingapieceofinformation(e.g.,addingthefeaturegendertoanexistinggrammarwhichdidnotspecifythisproperty)requiresachangeinalltheruleswhichaccountforthisinformation;thereisnowaytoabstractawayfromtheactualimplementationofthisinformation,andthegrammardevelopermustbeconsistentwithrespecttowherethisinformationisspecified(i.e.,whetheritprecedesorfollowsinformationonnumber).
Sinceinformationcannotbeencapsulatedandthelanguageprovidesnoab-stractionmechanisms,collaborativedevelopmentoffinite-stategrammarsisdif-ficult.Allgrammardevelopersmustbeawareofhowinformationisrepresented
Finite-StateTechnologyasaProgrammingEnvironment103
atalltimes.Furthermore,sincetheonlydatatypeisstrings,debuggingbecomesproblematic:veryfewerrorscanbedetectedatcompiletime.
Incontrast,adirectJavaimplementationbenefitsfromalltheadvantagesofdevelopinginanobject-orientedenvironment.Forexample,themoduleswhichinflectnounsandadjectivesinheritfromthesamemodule,accountingforallnominals,whichinturninheritsfromageneralmoduleofinflectionrules.Collaborativedevelopment.Adifferentfacetofmodularityhastodowiththequalificationsofthegrammardevelopers.InordertotakeadvantageofthefullpowerofXFST,grammardevelopersmustbesimultaneouslytrainedlinguistsandexperiencedprogrammers.Withadirectimplementation,atrueinterdisci-plinarycollaborationisenabledwherealinguistcanbeinchargeofcharacteriz-ingthelinguisticphenomena(andbuildingtheruledatabase)andaprogrammercanberesponsibleonlyfortheactualimplementation.
Maintenance.Aby-productofthenon-modularityofFSTgrammarsisthatmaintainingthemisdifficultandexpensive.Itishardtofindasinglepersonwhoisknowledgeableinallaspectsofthedesign,andanychangeinthegrammarispainful.Thismustbeaddedtothepoorcompile-timeperformance,whichagainhampersmaintainability.
Compile-timeefficiency.AmajorobstacleinthedevelopmentofXFSTgram-marsisthespeedofcompilation.Asiswellknown,manyofthefinite-stateoper-atorsresultinhugenetworks:theoretically,compositionofnetworksofmandnstatesyieldsanetworkwithO(m×n)states,andreplacerulesareimplementedusingcomposition.Thisleadstotemporarynetworkswhicharesometimeslargerthantheavailablememory,requiringdiskaccessandtherebyslowingcompila-tiondowndramatically.Whileautomatacanalwaysbeminimized,thisisnotthecasefortransducers[15].
Theoretically,itisveryeasytocomeupwithverysmallregularexpressionswhosecompilationisintractable.Foranyintegern>2,thereexistsann-stateautomatonA,suchthatanyautomatonthatacceptsthecomplementofL(A)needsatleast2n−2states[16].AnexampleofanXFSTexpressionwhosecompilationtimeisexponentialinnis:~[[a|b]*a[a|b]^nb[a|b]*].Inpractice,thecompleteHebrewgrammarisrepresented,inXFST,byanetworkofapproximately2millionstatesand2.2milliontransitions.Compilingtheentirenetworktakesover48minutesandrequires3Gbofmemory.
Compilationtimeisusuallyconsideredanegligiblecriterionforevaluatingsystemperformance.However,whendevelopingalarge-scalesystem,theabilitytomakeminorchangesandquicklyre-makethesystemiscrucial.WithXFST,modificationofevenasinglelexicalentryrequiresatleastanintersectionof(theXFSTrepresentationof)thisentrywiththenetworkrepresentingtheruleswhichapplytoit.Asaconcreteexample,addingasingletwo-characterpropername(whichdoesnotinflect)tothelexiconincreasedthesizeofthenetworkby9statesand10arcs,buttookalmostthreeminutestocompile.Addingatwo-characteradjectiveresultedintheadditionof27statesand30arcs,andtookaboutthesametime.
104S.Wintner
Inthedirectimplementation,modificationofasinglelexicalentryrequiresgenerationofallinflectedformsofthisentry,whichtakesafractionofasecond;thetimeittakestogenerateklexicalentriesisproportionaltokandisinde-pendentofthesizeoftheremainderofthesystem.Theanalysisprogramisnotaltered.
Tosummarizethedifferences,figure3showsthetimeittakestocompileanetworkwhenklexicalentriesaremodified,forthreevaluesofk,correspondingtothenumberofadjectives,adverbsandthesizeoftheentirelexicon.Thistimeiscomparedwiththetimeittakestogenerateallinflectedformsofthesesub-lexiconsintheanalysisbygenerationparadigm.
#items360(adverbs)1,648(adjectives)21,400(all)FST13:4713:5548:12Java0:143:5930:34
Fig.3.Compilation/generationtimes(inminutes)whensomelexicalitemschange
Run-timeefficiency.Whilefinite-stateautomataguaranteelinearrecognition
time,thisisnotthecasewithtransducers,whichcannotalwaysbedeterminized[17].Evenwhenadevicecanbedeterminized,thedeterminizationalgorithmisinefficient(theoretically,thesizeofthedeterministicautomatoncanbeexpo-nentialinthesizeofitsnon-deterministiccounterpart).
Asitturnsout,storingadatabaseofhalfamillioninflectedforms(alongwiththeiranalyses)isinexpensive,andretrievingitemsfromthedatabasecanbedoneveryefficiently.Weexperimentedwithtwoversions:oneusesMySQLasthedatabaseandtheotherloadstheinflectedformsintoahashtable.Inthislatterversion,mostofthetimeisspentonloadingthedatabase,andretrievaltimeisnegligible.
Wecomparedtheperformanceofthetwosystemsonfourtasks,analyzingtextfilesof10,100,1,000and10,000tokens.Theresultsaresummarizedinfigure4,andclearlydemonstratethesuperiorityofthedirectimplementation.Intermsofmemoryrequirements,XFSTrequiresapproximately57Mbofmemory,whereastheJavaimplementationusesnomorethan10Mb.Thisisnotasignificantissuewithcontemporaryhardware.
5Discussion
Wecomparedtheprocessofdevelopingalarge-scalemorphologicalgrammarforHebrewwithfinite-statetechnologywithadirectimplementationofthemor-phologicalrulesinJava.Ourconclusionisthatfinite-statetechnologyremainssuperiortoitsalternativeswithrespecttothetruerepresentationoflinguisticknowledge,andisthereforemoreadequateforsmaller-scalegrammars,especiallythosewhosegoalistodemonstratespecificlinguisticphenomenaratherthanformthebasisoflargepracticalsystems.However,viewedasaprogramming
Finite-StateTechnologyasaProgrammingEnvironment
#Tokens10FST1.25Java+MySQL1.24Java+Hash5.00
100
2.403.045.15
1,00010,00012.97118.718.8444.945.597.64
105
Fig.4.Timeperformanceofbothanalyzers(inseconds)
environment,FSTsuffersfromseverelimitations,themostsignificantofwhicharelackofabstractionanddifficultiesinincrementalprocessing.
Abstractionistheessenceofcomputerscienceandthekeytosoftwarede-velopment.Workingwithregularexpressionsanddevelopingruleswhichusestringsastheonlydatastructuredoesnotleavemuchspaceforsophisticatedabstraction.Severalworksattempttoremedythisproblem.XFSTitselfprovidesalimitedsolution,intheformofflagdiacritics[5].Thesearefeature-valuepairswhichcanbeaddedtotheunderlyingmachinesinordertoaddlimitedmemorytonetworks;asimilarsolution,whichisfullyworked-outmathematically,ispro-videdbyfinite-stateregisteredautomata[7].Theseapproachesaretoolow-leveltoprovidethekindofabstractionthatprogrammershavebecomeusedto.Astepintherightdirectionistheincorporationoffeaturestructuresandunificationintofinite-statetransducers[18],andinparticulartherecentproposaltousetypedfeaturestructuresastheentitiesonwhichsuchtransducersoperate[19].Moreresearchisneededinordertofullydevelopthisdirectionandincorporateitsconsequencesintoafinite-statebasedgrammardevelopmentframework.Theproblemofincrementalgrammardevelopment,exemplifiedinfigure3,canalsoberemediedbyincorporatingsomerecenttheoreticalresults,inpartic-ularinincrementalconstructionoflexicons[20,21],intoanexistingframework.Ordinaryprogramminglanguagesbenefitfromdecadesofresearchandinnova-tionincompilationtheoryandoptimization.Inorderforfinite-statetechnologytobecomeaviableprogrammingenvironmentfornaturallanguagemorphologyapplications,moreresearchisneededalongthelinessuggestedhere.
Acknowledgments.ThisworkwasfundedbytheIsraeliMinistryofScienceandTechnology,undertheauspicesoftheKnowledgeCenterforProcessingHe-brew.TheresearchwassupportedbyagrantfromtheIsraelInternetAssociation.IamverygratefultoShlomoYonaforimplementingtheXFSTgrammarandtoDaliaBojanforimplementingtheJavasystem.KemalOflazerprovidedusefulcommentsonanearlierversionofthispaper.IalsowishtothankYaelCohen-Sygal,AlonItai,NuritMelnikandShiraSchwartzfortheirhelp.Theviewsex-pressedinthispaperaswellasallremainingerrorsare,ofcourse,myown.
References
1.Johnson,C.D.:FormalAspectsofPhonologicalDescription.Mouton,TheHague(1972)
2.Koskenniemi,K.:Two-LevelMorphology:aGeneralComputationalModelforWord-FormRecognitionandProduction.TheDepartmentofGeneralLinguistics,UniversityofHelsinki(1983)
106S.Wintner
3.Kaplan,R.M.,Kay,M.:Regularmodelsofphonologicalrulesystems.Computa-tionalLinguistics20(1994)331–378
4.Roche,E.,Schabes,Y.,eds.:Finite-StateLanguageProcessing.Language,SpeechandCommunication.MITPress,Cambridge,MA(1997)
5.Beesley,K.R.,Karttunen,L.:Finite-StateMorphology:XeroxToolsandTech-niques.CSLI,Stanford(2003)
6.Beesley,K.R.:Arabicmorphologyusingonlyfinite-stateoperations.InRosner,M.,ed.:ProceedingsoftheWorkshoponComputationalApproachestoSemiticlanguages,Montreal,Quebec,COLING-ACL’98(1998)50–577.Cohen-Sygal,Y.,Wintner,S.:Finite-stateregisteredautomatafornon-concatenativemorphology.ComputationalLinguistics32(2006)49–828.Silberztein,M.:Dictionnaires´electroniquesetanalyseautomatiquedetextes:lesyst`emeINTEX.Masson,Paris(1993)
9.Mohri,M.,Pereira,F.,Riley,M.:Thedesignprinciplesofaweightedfinite-statetransducerlibrary.TheoreticalComputerScience231(2000)17–32
10.vanNoord,G.,Gerdemann,D.:Anextendibleregularexpressioncompilerfor
finite-stateapproachesinnaturallanguageprocessing.InBoldt,O.,J¨urgensen,H.,eds.:AutomataImplementation.Number2214inLectureNotesinComputerScience.Springer(2001)
11.Karttunen,L.:Thereplaceoperator.In:ProceedingsoftheAnnualMeetingof
theAssociationforComputationalLinguistics.(1995)16–23
12.Yona,S.,Wintner,S.:Afinite-statemorphologicalgrammarofHebrew.Natural
LanguageEngineering(Forthcoming)
13.Itai,A.,Wintner,S.,Yona,S.:AcomputationallexiconofcontemporaryHebrew.
In:ProceedingsofThefifthinternationalconferenceonLanguageResourcesandEvaluation(LREC-2006),Genoa,Italy(2006)
14.Cohen-Sygal,Y.,Wintner,S.:XFST2FSA:Comparingtwofinite-statetoolboxes.
In:ProceedingsoftheACL-2005WorkshoponSoftware,AnnArbor,MI(2005)15.Mohri,M.:Minimizationalgorithmsforsequentialtransducers.TheoreticalCom-puterScience234(2000)177–201
16.Holzer,M.,Kutrib,M.:Statecomplexityofbasicoperationsonnondeterministic
finiteautomata.In:ImplementationandApplicationofAutomata(CIAA’02).(2002)151–160
17.Mohri,M.:Finite-statetransducersinlanguageandspeechprocessing.Computa-tionalLinguistics23(1997)269–312
18.Zajac,R.:Featurestructures,unificationandfinite-statetransducers.In:
FSMNLP’98:TheInternationalWorkshoponFinite-stateMethodsinNaturalLan-guageProcessing,Ankara,Turkey(1998)
19.Amtrup,J.W.:Morphologyinmachinetranslationsystems:Efficientintegration
offinitestatetransducersandfeaturestructuredescriptions.MachineTranslation18(2003)217–238
20.Daciuk,J.,Mihov,S.,Watson,B.W.,Watson,R.E.:Incrementalconstructionof
minimalacyclicfinite-stateautomata.ComputationalLinguistics26(2000)3–1621.Carrasco,R.C.,Forcada,M.L.:Incrementalconstructionandmaintenanceofmin-imalfinite-stateautomata.ComputationalLinguistics28(2002)207–216
因篇幅问题不能全部显示,请点此查看更多更全内容