您的当前位置:首页正文

Finite-state technology as a programming environment

来源:爱站旅游
导读Finite-state technology as a programming environment
Finite-StateTechnology

asaProgrammingEnvironment

ShulyWintner

DepartmentofComputerScience,UniversityofHaifa,31905Haifa,Israel

shuly@cs.haifa.ac.il

Abstract.Finite-statetechnologyisconsideredthepreferredmodelforrepresentingthephonologyandmorphologyofnaturallanguages.Theattractivenessofthistechnologyfornaturallanguageprocessingstemsfromfoursources:modularityofthedesign,duetotheclosureproper-tiesofregularlanguagesandrelations;thecompactrepresentationthatisachievedthroughminimization;efficiency,whichisaresultoflinearrecognitiontimewithfinite-statedevices;andreversibility,resultingfromthedeclarativenatureofsuchdevices.

However,whenwide-coveragegrammarsareconsidered,finite-statetechnologydoesnotscaleupwell,andthebenefitsofthistechnologycanbeovershadowedbythelimitationsitimposesasaprogrammingen-vironmentforlanguageprocessing.Thispaperfocusesonseveralaspectsoflarge-scalegrammardevelopment.Usingareal-worldbenchmark,wecompareafinite-stateimplementationwithanequivalentJavaprogramwithrespecttoeaseofdevelopment,modularity,maintainabilityofthecodeandspaceandtimeefficiency.Weidentifytwomainproblems,abstractionandincrementaldevelopment,whicharecurrentlynotad-dressedsufficientlywellbyfinite-statetechnology,andwhichwebelieveshouldbethefocusoffutureresearchanddevelopment.

1Introduction

Finite-statetechnology(FST)denotestheuseoffinite-statedevices,suchasau-tomataandtransducers,innaturallanguageprocessing(NLP).Sincetheearlyworkswhichdemonstratedtheapplicabilityofthistechnologytolinguisticrepre-sentation[1,2,3],FSTisconsideredadequatefordescribingthephonologicalandmorphologicalprocessesoftheworld’slanguages[4,5].Evennon-concatenativeprocessessuchascircumfixation,root-and-patternmorphologyorreduplication,wereshowntobeinprincipleimplementableinFST[6,7].

TheutilityofFSTforNLPwasemphasizedbytheimplementationofseveraltoolboxeswhichprovideextendedregularexpressionlanguagesandcompilerswhichconvertexpressionstofinite-stateautomataandtransducers.Thesein-cludeINTEX[8];FSM[9],whichisaunix-basedsetofprogramsformanipu-latingautomataandtransducers;FSAUtilities[10],whichisafreelyavailable,Prologimplementedsystem;andXFST[5],whichisacommercialpackageas-sumedtobethemostsuitableforlinguisticapplicationsbyprovidingthemostexpressivelanguage.

A.Gelbukh(Ed.):CICLing2007,LNCS4394,pp.97–106,2007.cSpringer-VerlagBerlinHeidelberg2007󰀁

98S.Wintner

ThebenefitsofFSTforNLPstemfromseveralpropertiesoffinite-statedevices:

Truerepresentation:FollowingthepioneeringworkofJohnson[1],itisnow

clearthatthekindofphonologicalandmorphologicalrulesthatarecommoninlinguistictheoriescanbedirectlyimplementedasfinite-staterelations.TheimplementationoflinguisticallymotivatedrulesinFSTisthereforestraightforwardanddirect[11].

Modularity:Theclosurepropertiesofregularlanguagesandrelationsprovide

variousmeansforcombiningregularexpressions,supportingavarietyofoperationsonthelanguagestheseexpressionsdenote.Forexample,closureunderunionfacilitatesaseparatedevelopmentoftwogrammarfragmentswhichcanthenbedirectlycombinedinasingleoperation.Themostusefuloperationsunderwhichtransductionsareclosedisprobablycomposition,whichisthecentralvehicleforimplementingreplacerules[3,11].

Compactness:Finite-stateautomatacanbeminimized,guaranteeingthatfor

agivenlanguage,anautomatonwithaminimalnumberofstatescanal-waysbegenerated.Toolboxescanapplyminimizationeitherexplicitlyorimplicitlytoimprovestoragerequirements.

Efficiency:Whenanautomatonisdeterministic,recognitionisoptimallyef-ficient(linearinthelengthofthestringtoberecognized).Automatacanalwaysbedeterminized,andtoolboxescantakeadvantageofthistoimprovetimeefficiency.

Reversibility:Finite-stateautomataandtransducersareinherentlydeclara-tive:itistheapplicationprogramwhicheitherimplementsrecognitionorgeneration.Inparticular,transducerscanbeusedtomapstringsfromtheupperlanguagetothelowerlanguageorviceversawithnochangesintheunderlyingfinite-statedevice.Thesebenefitsencouragedthedevelopmentofseverallarge-scalemorphologicalgrammarsforavarietyoflanguages,includingsomewithcomplexmorphologysuchasGerman,French,Finnish,Turkish,ArabicandHebrew.

Themainclaimofthispaper,however,isthatfinite-statetechnologyisstillinferiortoitsalternativeswhenthedevelopmentoflarge-scalegrammarsiscon-cerned.Thisclaimissupportedbyarealisticexperimentdefiningasophisticatedmorphologicaltask,bothusingFST(section2)andwithadirectimplementa-tioninJavaofthesamegrammar(section3).Wecomparethetwoapproachesinsection4alongseveralaxes.Theconclusion(section5)istheidentificationoftwomainAchillesHeelsincontemporarytechnology:thelackofabstrac-tionmechanismsandthecomputationalburdenofincrementalchanges.Webelievethatthesetwoissuesshouldbethefocusoffutureresearchinfinite-statetechnology.

2AMotivatingExample

Inordertoevaluatethescalabilityoffinite-statetechnologyweconsider,asabenchmark,alarge-scaletask:accountingforthemorphologicalandortho-

Finite-StateTechnologyasaProgrammingEnvironment99

graphicphenomenaofHebrew,anaturallanguagewithnon-trivialmorphology.Clearly,languageswithsimplemorphology(e.g.,English)donotbenefitfromFSTapproaches,simplybecauseitissoinexpensivetogenerateandstorealltheinflectedforms.ItisonlywhenrelativelycomplicatedmorphologicalprocessesareinvolvedthatthebenefitsofFSTbecomeapparent,andHebrewischosenhereonlyasaparticularexample;theobservationsreportedinsection4arevalidingeneral,forallsimilartasks.

Hebrew,likeotherSemiticlanguages,hasarichandcomplexmorphology.Themajorwordformationmachineryisroot-and-pattern,whererootsaresequencesofthree(typically)ormoreconsonantsandpatternsaresequencesofvowelsand,sometimes,alsoconsonants,with“slots”intowhichtheroot’sconsonantsareinserted.Aftertherootcombineswiththepattern,somemorpho-phonologicalalterationstakeplace,whichmaybenon-trivial.Thecombinationofarootwithapatternproducesalexeme,whichcanthenbeinflectedinvariousforms.Inflectionalmorphologyishighlyproductiveandconsistsmostlyofsuffixes,butsometimesofprefixesorcircumfixes.Themorphologicalproblemsareamplifiedbyissuesoforthography.ThestandardHebrewscriptleavesmostofthevowelsunspecified.Furthermore,manyparticles,includingprepositions,conjunctionsandthedefinitearticle,attachtothewordswhichimmediatelyfollowthem.Asaresult,surfaceformsarehighlyambiguous.

Thefinite-stategrammarwhichweusedasabenchmarkhereisHAMSAH[12],anXFSTimplementationofHebrewmorphology.Thegrammarisobtainedbycomposingalarge-scalelexiconofHebrew(over20,000entries)withalargesetofrules,implementingmostlymorphologicalandorthographicprocessesinthelanguage.Asthelexiconisdevelopedindependently[13]andisrepresentedinXML,itmustbeconvertedtoXFSTbeforeitcanbeincorporatedinthegrammar.ThisisdonebyasetofPerlscriptswhichhadtobespecificallywrittenforthispurpose.Inotherwords,thesystemitselfisnotpurelyfinite-state,andwemaintainthatfewlarge-scalesystemsformorphologicalanalysiscanbepurelyfinite-state,assuchsystemsmustinteractwithindependentlydevelopedcomponentssuchaslexicons,annotationtools,userinterfacesetc.Aspecializedsetofrulesimplementsthemorphologicalprocesseswhichapplytoeachmajorpartofspeech.Forexample,figure1depictsasomewhatsimplifiedversionoftherulewhichaccountsforthewtsuffixofHebrewnouns.Thisrulemakesextensiveuseofcomposition(denotedby‘.o.’)andreplacerules(‘->’and‘<-’).Theeffectofthisruleisdual:onthesurfacelevel,itaccountsforalterationsintheconcatenationofthesuffixwiththestem(e.g.,iihbecomesih,wtchangestowiandafinalhortareelided);onthelexicallevel,itchangesthespecificationofnumberfromsingulartoplural.

Theruleshouldbereadfromthecenteroutwards.Thevariablenounde-notesthesetofalllexicalitemswhosepartofspeechisnoun;bydefault,thesenounsaresingularmasculine.InXFST,asetofwordsisidentifiedwithwiththeidentitytransductionwhichrelateseachwordinthesetwithitself.Thefirstcompositionontopofthenountransductionselectsonlythosenounswhosepluralattributeislexicallyspecifiedaswt(othernounsmaybelexicallyspeci-

100S.Wintner

fiedforadifferentpluralsuffix).Ofthose,onlytheoneswhosenumberattributeissingularareselected.Then,thevaluesingularinthelexical(upper)stringisreplacedbypluralinthecontextofimmediatelyfollowingtheattributenumber.Inthesurface(lower)language,meanwhile,asetofcompositionsop-eratorstakescareofthenecessaryorthographicchanges,andfinally,thepluralsuffixwtisconcatenatedtotheendofthesurfacestring.

definepluralWTNoun[[

[plural<-singular||number_].o.$[numbersingular].o.$[pluralwt].o.noun

.o.[iih->ih||_.#.].o.[it->i||_.#.].o.[wt->wi||_.#.].o.[[h|t]->0||_.#.]][0.x.[wt]]];

Fig.1.XFSTaccountofpluralnouns

Thisruleisagoodexampleofhowasinglephenomenonisfactoredoutandaccountedforindependentlyofotherphenomena:therulereferstolexi-calinformation,suchas‘number’or‘plural’,butcompletelyignoresirrelevantinformationsuchas,say,gender.However,italsohintsathowinformationismanipulatedbyregularexpressions.Sincefinite-statenetworkshavenomemory,saveforthestate,allinformationisencodedbystringswhicharemanipulatedbytherules.Thus,asimpleoperationsuchaschangingthevalueofthenumberfeaturefromsingulartopluralmustbecarriedoutbythesamereplaceruleswhichaccountforthechangestothesurfaceform.Furthermore,thereisnowaytostructuresuchinformation,asiscommoninprogramminglanguages;andthereisnowaytoencapsulateit.

3AnAlternativeImplementation

Were-implementedtheHAMSAHgrammardirectlyasaJavaprogram.Themethodweusedwasanalysisbygeneration:wefirstgeneratedalltheinflectedformsinducedbythelexiconandstoretheminadatabase;then,analysisissimplyadatabaselookup.Itiscommontothinkthatforlanguageswithrichmorphologysuchamethodisimpractical.Whilethismayhavebeenthecaseinthepast,contemporarycomputerscanefficientlystoreandretrievemillionsofinflectedforms.Ofcourse,thismethodwouldbreakinthefaceofaninfi-nitelexicon(whichcaneasilyberepresentedwithFST),butformostpracticalpurposesitissafetoassumethatnaturallanguagelexiconsarefinite.

Finite-StateTechnologyasaProgrammingEnvironment101

Toseparatelinguisticknowledgefromprocessingcodeasmuchaspossible,ourJavaimplementationusesadatabaseofrules,whicharesimplestringtransduc-tionsintendedtoaccountforsimple(mostlymorphemeboundary)morphologi-calandorthographicalterations.Whengeneratinginflectedforms,theprogramidentifiescertainconditions(e.g.,apluralsuffixwtistobeattachedtoanoun).Itthenlooksupthisconditionintheruledatabaseandretrievestheactiontoapply,dependingonthesuffixoftheinputstring.Anexampleoftheruledatabase,withalterationspertainingtothesuffixwt(cf.figure1),isdepictedinfigure2.Formostmorphologicalprocesses,solutionssuchasthiscanaccuratelystandforlinguisticrulesoftheformdepictedinfigure1.

Wheninputendsin:iihitwth,tdefaultReplaceitby:ihiwi󰁯Thenadd:wtwtwtwtwtFig.2.Directaccountofpluralnouns

Notethatrulessuchastheonedepictedinfigure2aregenerationrules,

andmustnotbeconfusedwiththekindofad-hocrulesusedatruntimefor,e.g.,stemming.Theyfullyreflectthelinguisticknowledgeencodedinfinite-statereplacerules.Granted,theexampleruleissimplistic,andmorecomplexphe-nomenarequiremorecomplicatedrepresentation,butsincemostofmorphologytakesplacealongmorphemeboundaries,thisisareasonablerepresentation.Themorphologicalanalyzerwasobtainedbydirectlyimplementingtherulesandapplyingthemtothelexicon.Thenumberofinflectedforms(beforeat-tachingprefixes)is473,880(over300,000ofthoseareinflectednouns,andcloseto150,000areinflectedverbforms).Inadditiontoinflectedforms,theanalyzeralsoallowsasmanyas174differentsequencesofprefixparticlestobeattachedtowords;separationofprefixesfrominflectedformsisdoneatanalysistime.Thedirectimplementationisequivalenttothefinite-stategrammar:thiswasverifiedbyexhaustivelygeneratingalltheinflectedformswitheachofthesystemsandanalyzingthemwiththeothersystem.

4ComparisonandEvaluation

HavingdescribedtheXFSTbenchmarkgrammaranditsdirectJavaimplemen-tation,wenowcomparethetwoapproachesalongseveralaxes.Itisimportanttoemphasizethatwedonotwishtocomparethetwosystems,butratherthemethodology.Inparticular,wechoseXFSTasitisoneofthemostefficient,andcertainlythemostexpressive,FSTtoolboxavailable.ArecentcomparisonofXFSTwiththeFSAUtilitiespackage[14]showsthatthelattersimplycannothandlegrammarsofthescaleofHAMSAH.Thefollowingisalistofissuesinwhichfinite-statetechnologyturnedouttobeproblematiccomparedwiththealternative;inthenextsectionwefocusonissuesthatwebelieveshouldbegiven

102S.Wintner

moreattentioninfutureresearchonFST.Allexperimentsweredoneonadual2GHzprocessorLinuxmachinewith2.5Gbofmemory.

Truthfulness.OneoftheassetsofFSTisthatitallowsforaveryaccurateimplementationoflinguisticrules.However,agoodorganizationofthesoftwarecanprovideaclearseparationbetweenlinguisticknowledgeandprocessinginanyprogrammingenvironment,sothatlinguisticrulescanbeexpressedconciselyanddeclaratively,asexemplifiedinfigure2.

Reversibility.AclearadvantageofFSTisthatgrammarsarefullyreversible.However,withtheanalysisbygenerationparadigmthesameholdsalsoforadirectimplementation:thegeneratorisdirectlyimplemented,andtheanalyzerisimplementedassearchinthedatabaseofgeneratedforms.

Expressivity.Here,thedisadvantagesoffinite-statetechnologyasaprogram-mingenvironmentareclear.Programmingwithfinite-statetechnologyisverydifferentfromprogramminginordinarylanguages,mainlyduetothehighlyconstrainedexpressivepowerofregularrelations(programmerssometimesfeelthattheyareworkingwiththeirhandstiedbehindtheirbacks).WhileFSTcantheoreticallyaccountfornon-concatenativeprocesses,existingtoolboxespro-videapartial,andsometimesoverlycomplicated,solutionsforsuchproblems.Sometimesatrans-regularoperationiscalledfor,andmanyothertimestheconstrainedexpressivityofregularrelationsistoolimiting.

Portability.XFSTisaproprietarypackagewiththreeversionsavailableforthreecommonoperatingsystems.Otherfinite-statetoolboxesarefreer;FSAisopensource,butaswenotedearlieritsimplycannotcopewithgrammarsthesizeofHAMSAH.FSMisavailableforavarietyofUnixoperatingsystems,asabi-naryonly,whereasINTEXisdistributedasaWindowsexecutable.Incontrast,aJavaimplementationcanbedeliveredtouserswithallkindsof(contempo-rary)operatingsystemsandhardware,andisoptimallyportable.Thepracticalportabilitylimitationsdirectlyhampertheutilizationoffinite-statetechnologyinpractical,commercialsystems.

Abstraction.Large-scalemorphologicalgrammarstendtobeextremelynon-modular.Eachsurfacestringisassociated,duringitsprocessing,withalexicalcounterpartwhichdescribesitsstructure.Thelexicalstringisconstantlyre-writtenbytherules,asinfigure1.Duetotheinherentsequentialityofstrings,alltheinformationwhichisassociatedwithsurfacestringsisencodedsequentially.Inparticular,addingapieceofinformation(e.g.,addingthefeaturegendertoanexistinggrammarwhichdidnotspecifythisproperty)requiresachangeinalltheruleswhichaccountforthisinformation;thereisnowaytoabstractawayfromtheactualimplementationofthisinformation,andthegrammardevelopermustbeconsistentwithrespecttowherethisinformationisspecified(i.e.,whetheritprecedesorfollowsinformationonnumber).

Sinceinformationcannotbeencapsulatedandthelanguageprovidesnoab-stractionmechanisms,collaborativedevelopmentoffinite-stategrammarsisdif-ficult.Allgrammardevelopersmustbeawareofhowinformationisrepresented

Finite-StateTechnologyasaProgrammingEnvironment103

atalltimes.Furthermore,sincetheonlydatatypeisstrings,debuggingbecomesproblematic:veryfewerrorscanbedetectedatcompiletime.

Incontrast,adirectJavaimplementationbenefitsfromalltheadvantagesofdevelopinginanobject-orientedenvironment.Forexample,themoduleswhichinflectnounsandadjectivesinheritfromthesamemodule,accountingforallnominals,whichinturninheritsfromageneralmoduleofinflectionrules.Collaborativedevelopment.Adifferentfacetofmodularityhastodowiththequalificationsofthegrammardevelopers.InordertotakeadvantageofthefullpowerofXFST,grammardevelopersmustbesimultaneouslytrainedlinguistsandexperiencedprogrammers.Withadirectimplementation,atrueinterdisci-plinarycollaborationisenabledwherealinguistcanbeinchargeofcharacteriz-ingthelinguisticphenomena(andbuildingtheruledatabase)andaprogrammercanberesponsibleonlyfortheactualimplementation.

Maintenance.Aby-productofthenon-modularityofFSTgrammarsisthatmaintainingthemisdifficultandexpensive.Itishardtofindasinglepersonwhoisknowledgeableinallaspectsofthedesign,andanychangeinthegrammarispainful.Thismustbeaddedtothepoorcompile-timeperformance,whichagainhampersmaintainability.

Compile-timeefficiency.AmajorobstacleinthedevelopmentofXFSTgram-marsisthespeedofcompilation.Asiswellknown,manyofthefinite-stateoper-atorsresultinhugenetworks:theoretically,compositionofnetworksofmandnstatesyieldsanetworkwithO(m×n)states,andreplacerulesareimplementedusingcomposition.Thisleadstotemporarynetworkswhicharesometimeslargerthantheavailablememory,requiringdiskaccessandtherebyslowingcompila-tiondowndramatically.Whileautomatacanalwaysbeminimized,thisisnotthecasefortransducers[15].

Theoretically,itisveryeasytocomeupwithverysmallregularexpressionswhosecompilationisintractable.Foranyintegern>2,thereexistsann-stateautomatonA,suchthatanyautomatonthatacceptsthecomplementofL(A)needsatleast2n−2states[16].AnexampleofanXFSTexpressionwhosecompilationtimeisexponentialinnis:~[[a|b]*a[a|b]^nb[a|b]*].Inpractice,thecompleteHebrewgrammarisrepresented,inXFST,byanetworkofapproximately2millionstatesand2.2milliontransitions.Compilingtheentirenetworktakesover48minutesandrequires3Gbofmemory.

Compilationtimeisusuallyconsideredanegligiblecriterionforevaluatingsystemperformance.However,whendevelopingalarge-scalesystem,theabilitytomakeminorchangesandquicklyre-makethesystemiscrucial.WithXFST,modificationofevenasinglelexicalentryrequiresatleastanintersectionof(theXFSTrepresentationof)thisentrywiththenetworkrepresentingtheruleswhichapplytoit.Asaconcreteexample,addingasingletwo-characterpropername(whichdoesnotinflect)tothelexiconincreasedthesizeofthenetworkby9statesand10arcs,buttookalmostthreeminutestocompile.Addingatwo-characteradjectiveresultedintheadditionof27statesand30arcs,andtookaboutthesametime.

104S.Wintner

Inthedirectimplementation,modificationofasinglelexicalentryrequiresgenerationofallinflectedformsofthisentry,whichtakesafractionofasecond;thetimeittakestogenerateklexicalentriesisproportionaltokandisinde-pendentofthesizeoftheremainderofthesystem.Theanalysisprogramisnotaltered.

Tosummarizethedifferences,figure3showsthetimeittakestocompileanetworkwhenklexicalentriesaremodified,forthreevaluesofk,correspondingtothenumberofadjectives,adverbsandthesizeoftheentirelexicon.Thistimeiscomparedwiththetimeittakestogenerateallinflectedformsofthesesub-lexiconsintheanalysisbygenerationparadigm.

#items360(adverbs)1,648(adjectives)21,400(all)FST13:4713:5548:12Java0:143:5930:34

Fig.3.Compilation/generationtimes(inminutes)whensomelexicalitemschange

Run-timeefficiency.Whilefinite-stateautomataguaranteelinearrecognition

time,thisisnotthecasewithtransducers,whichcannotalwaysbedeterminized[17].Evenwhenadevicecanbedeterminized,thedeterminizationalgorithmisinefficient(theoretically,thesizeofthedeterministicautomatoncanbeexpo-nentialinthesizeofitsnon-deterministiccounterpart).

Asitturnsout,storingadatabaseofhalfamillioninflectedforms(alongwiththeiranalyses)isinexpensive,andretrievingitemsfromthedatabasecanbedoneveryefficiently.Weexperimentedwithtwoversions:oneusesMySQLasthedatabaseandtheotherloadstheinflectedformsintoahashtable.Inthislatterversion,mostofthetimeisspentonloadingthedatabase,andretrievaltimeisnegligible.

Wecomparedtheperformanceofthetwosystemsonfourtasks,analyzingtextfilesof10,100,1,000and10,000tokens.Theresultsaresummarizedinfigure4,andclearlydemonstratethesuperiorityofthedirectimplementation.Intermsofmemoryrequirements,XFSTrequiresapproximately57Mbofmemory,whereastheJavaimplementationusesnomorethan10Mb.Thisisnotasignificantissuewithcontemporaryhardware.

5Discussion

Wecomparedtheprocessofdevelopingalarge-scalemorphologicalgrammarforHebrewwithfinite-statetechnologywithadirectimplementationofthemor-phologicalrulesinJava.Ourconclusionisthatfinite-statetechnologyremainssuperiortoitsalternativeswithrespecttothetruerepresentationoflinguisticknowledge,andisthereforemoreadequateforsmaller-scalegrammars,especiallythosewhosegoalistodemonstratespecificlinguisticphenomenaratherthanformthebasisoflargepracticalsystems.However,viewedasaprogramming

Finite-StateTechnologyasaProgrammingEnvironment

#Tokens10FST1.25Java+MySQL1.24Java+Hash5.00

100

2.403.045.15

1,00010,00012.97118.718.8444.945.597.64

105

Fig.4.Timeperformanceofbothanalyzers(inseconds)

environment,FSTsuffersfromseverelimitations,themostsignificantofwhicharelackofabstractionanddifficultiesinincrementalprocessing.

Abstractionistheessenceofcomputerscienceandthekeytosoftwarede-velopment.Workingwithregularexpressionsanddevelopingruleswhichusestringsastheonlydatastructuredoesnotleavemuchspaceforsophisticatedabstraction.Severalworksattempttoremedythisproblem.XFSTitselfprovidesalimitedsolution,intheformofflagdiacritics[5].Thesearefeature-valuepairswhichcanbeaddedtotheunderlyingmachinesinordertoaddlimitedmemorytonetworks;asimilarsolution,whichisfullyworked-outmathematically,ispro-videdbyfinite-stateregisteredautomata[7].Theseapproachesaretoolow-leveltoprovidethekindofabstractionthatprogrammershavebecomeusedto.Astepintherightdirectionistheincorporationoffeaturestructuresandunificationintofinite-statetransducers[18],andinparticulartherecentproposaltousetypedfeaturestructuresastheentitiesonwhichsuchtransducersoperate[19].Moreresearchisneededinordertofullydevelopthisdirectionandincorporateitsconsequencesintoafinite-statebasedgrammardevelopmentframework.Theproblemofincrementalgrammardevelopment,exemplifiedinfigure3,canalsoberemediedbyincorporatingsomerecenttheoreticalresults,inpartic-ularinincrementalconstructionoflexicons[20,21],intoanexistingframework.Ordinaryprogramminglanguagesbenefitfromdecadesofresearchandinnova-tionincompilationtheoryandoptimization.Inorderforfinite-statetechnologytobecomeaviableprogrammingenvironmentfornaturallanguagemorphologyapplications,moreresearchisneededalongthelinessuggestedhere.

Acknowledgments.ThisworkwasfundedbytheIsraeliMinistryofScienceandTechnology,undertheauspicesoftheKnowledgeCenterforProcessingHe-brew.TheresearchwassupportedbyagrantfromtheIsraelInternetAssociation.IamverygratefultoShlomoYonaforimplementingtheXFSTgrammarandtoDaliaBojanforimplementingtheJavasystem.KemalOflazerprovidedusefulcommentsonanearlierversionofthispaper.IalsowishtothankYaelCohen-Sygal,AlonItai,NuritMelnikandShiraSchwartzfortheirhelp.Theviewsex-pressedinthispaperaswellasallremainingerrorsare,ofcourse,myown.

References

1.Johnson,C.D.:FormalAspectsofPhonologicalDescription.Mouton,TheHague(1972)

2.Koskenniemi,K.:Two-LevelMorphology:aGeneralComputationalModelforWord-FormRecognitionandProduction.TheDepartmentofGeneralLinguistics,UniversityofHelsinki(1983)

106S.Wintner

3.Kaplan,R.M.,Kay,M.:Regularmodelsofphonologicalrulesystems.Computa-tionalLinguistics20(1994)331–378

4.Roche,E.,Schabes,Y.,eds.:Finite-StateLanguageProcessing.Language,SpeechandCommunication.MITPress,Cambridge,MA(1997)

5.Beesley,K.R.,Karttunen,L.:Finite-StateMorphology:XeroxToolsandTech-niques.CSLI,Stanford(2003)

6.Beesley,K.R.:Arabicmorphologyusingonlyfinite-stateoperations.InRosner,M.,ed.:ProceedingsoftheWorkshoponComputationalApproachestoSemiticlanguages,Montreal,Quebec,COLING-ACL’98(1998)50–577.Cohen-Sygal,Y.,Wintner,S.:Finite-stateregisteredautomatafornon-concatenativemorphology.ComputationalLinguistics32(2006)49–828.Silberztein,M.:Dictionnaires´electroniquesetanalyseautomatiquedetextes:lesyst`emeINTEX.Masson,Paris(1993)

9.Mohri,M.,Pereira,F.,Riley,M.:Thedesignprinciplesofaweightedfinite-statetransducerlibrary.TheoreticalComputerScience231(2000)17–32

10.vanNoord,G.,Gerdemann,D.:Anextendibleregularexpressioncompilerfor

finite-stateapproachesinnaturallanguageprocessing.InBoldt,O.,J¨urgensen,H.,eds.:AutomataImplementation.Number2214inLectureNotesinComputerScience.Springer(2001)

11.Karttunen,L.:Thereplaceoperator.In:ProceedingsoftheAnnualMeetingof

theAssociationforComputationalLinguistics.(1995)16–23

12.Yona,S.,Wintner,S.:Afinite-statemorphologicalgrammarofHebrew.Natural

LanguageEngineering(Forthcoming)

13.Itai,A.,Wintner,S.,Yona,S.:AcomputationallexiconofcontemporaryHebrew.

In:ProceedingsofThefifthinternationalconferenceonLanguageResourcesandEvaluation(LREC-2006),Genoa,Italy(2006)

14.Cohen-Sygal,Y.,Wintner,S.:XFST2FSA:Comparingtwofinite-statetoolboxes.

In:ProceedingsoftheACL-2005WorkshoponSoftware,AnnArbor,MI(2005)15.Mohri,M.:Minimizationalgorithmsforsequentialtransducers.TheoreticalCom-puterScience234(2000)177–201

16.Holzer,M.,Kutrib,M.:Statecomplexityofbasicoperationsonnondeterministic

finiteautomata.In:ImplementationandApplicationofAutomata(CIAA’02).(2002)151–160

17.Mohri,M.:Finite-statetransducersinlanguageandspeechprocessing.Computa-tionalLinguistics23(1997)269–312

18.Zajac,R.:Featurestructures,unificationandfinite-statetransducers.In:

FSMNLP’98:TheInternationalWorkshoponFinite-stateMethodsinNaturalLan-guageProcessing,Ankara,Turkey(1998)

19.Amtrup,J.W.:Morphologyinmachinetranslationsystems:Efficientintegration

offinitestatetransducersandfeaturestructuredescriptions.MachineTranslation18(2003)217–238

20.Daciuk,J.,Mihov,S.,Watson,B.W.,Watson,R.E.:Incrementalconstructionof

minimalacyclicfinite-stateautomata.ComputationalLinguistics26(2000)3–1621.Carrasco,R.C.,Forcada,M.L.:Incrementalconstructionandmaintenanceofmin-imalfinite-stateautomata.ComputationalLinguistics28(2002)207–216

因篇幅问题不能全部显示,请点此查看更多更全内容