package Lingua::Stem::En; =head1 NAME Lingua::Stem::En - Porter's stemming algorithm for 'generic' English =head1 SYNOPSIS use Lingua::Stem::En; my $stems = Lingua::Stem::En::stem({ -words => $word_list_reference, -locale => 'en', -exceptions => $exceptions_hash, }); =head1 DESCRIPTION This routine applies the Porter Stemming Algorithm to its parameters, returning the stemmed words. It is derived from the C program "stemmer.c" as found in freewais and elsewhere, which contains these notes: Purpose: Implementation of the Porter stemming algorithm documented in: Porter, M.F., "An Algorithm For Suffix Stripping," Program 14 (3), July 1980, pp. 130-137. Provenance: Written by B. Frakes and C. Cox, 1986. I have re-interpreted areas that use Frakes and Cox's "WordSize" function. My version may misbehave on short words starting with "y", but I can't think of any examples. The step numbers correspond to Frakes and Cox, and are probably in Porter's article (which I've not seen). Porter's algorithm still has rough spots (e.g current/currency, -ings words), which I've not attempted to cure, although I have added support for the British -ise suffix. =head1 CHANGES 1999.06.15 - Changed to '.pm' module, moved into Lingua::Stem namespace, optionalized the export of the 'stem' routine into the caller's namespace, added named parameters 1999.06.24 - Switch core implementation of the Porter stemmer to the one written by Jim Richardson 2000.08.25 - 2.11 Added stemming cache 2000.09.14 - 2.12 Fixed *major* :( implementation error of Porter's algorithm Error was entirely my fault - I completely forgot to include rule sets 2,3, and 4 starting with Lingua::Stem 0.30. -- Jerilyn Franz 2003.09.28 - 2.13 Corrected documentation error pointed out by Simon Cozens. 2005.11.20 - 2.14 Changed rule declarations to conform to Perl style convention for 'private' subroutines. Changed Exporter invokation to more portable 'require' vice 'use'. 2006.02.14 - 2.15 Added ability to pass word list by 'handle' for in-place stemming. 2009.07.27 - 2.16 Documentation Fix 2020.06.20 - 2.30 Version renumber for module consistency. =cut ####################################################################### # Initialization ####################################################################### use strict; use warnings; require Exporter; use Carp; use vars qw (@ISA @EXPORT_OK @EXPORT %EXPORT_TAGS $VERSION); BEGIN { $VERSION = "2.30"; @ISA = qw (Exporter); @EXPORT = (); @EXPORT_OK = qw (stem clear_stem_cache stem_caching); %EXPORT_TAGS = (); } my $Stem_Caching = 0; my $Stem_Cache = {}; my %Stem_Cache2 = (); # #V Porter.pm V2.11 25 Aug 2000 stemming cache # Porter.pm V2.1 21 Jun 1999 with '&$sub if defined' not 'eval ""' # Porter.pm V2.0 25 Nov 1994 (for Perl 5.000) # porter.pl V1.0 10 Aug 1994 (for Perl 4.036) # Jim Richardson, University of Sydney # jimr@maths.usyd.edu.au or http://www.maths.usyd.edu.au:8000/jimr.html # Find a canonical stem for a word, assumed to consist entirely of # lower-case letters. The approach is from # # M. F. Porter, An algorithm for suffix stripping, Program (Automated # Library and Information Systems) 14 (3) 130-7, July 1980. # # This algorithm is used by WAIS: for example, see freeWAIS-0.3 at # # http://kudzu.cnidr.org/cnidr_projects/cnidr_projects.html # Some additional rules are used here, mainly to allow for British spellings # like -ise. They are marked ** in the code. # Initialization required before using subroutine stem: # We count syllables slightly differently from Porter: we say the syllable # count increases on each occurrence in the word of an adjacent pair # # [aeiouy][^aeiou] # # This avoids any need to define vowels and consonants, or confusion over # 'y'. It also works slightly better: our definition gives two syllables # in 'yttrium', while Porter's gives only one because the initial 'y' is # taken to be a consonant. But it is not quite obvious: for example, # consider 'mayfly' where, when working backwards (see below), the 'yf' # matches the above pattern, even though it is the 'ay' which in Porter's # terms increments the syllable count. # # We wish to match the above in context, working backwards from the end of # the word: the appropriate regular expression is my $syl = '[aeiou]*[^aeiou][^aeiouy]*[aeiouy]'; # (This works because [^aeiouy] is a subset of [^aeiou].) If we want two # syllables ("m>1" in Porter's terminology) we can just match $syl$syl. # For step 1b we need to be able to detect the presence of a vowel: here # we revert to Porter's definition that a vowel is [aeiou], or y preceded # by a consonant. (If the . below is a vowel, then the . is the desired # vowel; if the . is a consonant the y is the desired vowel.) my $hasvow = '[^aeiouy]*([aeiou]|y.)'; =head1 METHODS =cut ####################################################################### =over 4 =item stem({ -words => \@words, -locale => 'en', -exceptions => \%exceptions }); Stems a list of passed words using the rules of US English. Returns an anonymous array reference to the stemmed words. Example: my @words = ( 'wordy', 'another' ); my $stemmed_words = Lingua::Stem::En::stem({ -words => \@words, -locale => 'en', -exceptions => \%exceptions, }); If the first element of @words is a list reference, then the stemming is performed 'in place' on that list (modifying the passed list directly instead of copying it to a new array). This is only useful if you do not need to keep the original list. If you B need to keep the original list, use the normal semantic of having 'stem' return a new list instead - that is faster than making your own copy B using the 'in place' semantics since the primary difference between 'in place' and 'by value' stemming is the creation of a copy of the original list. If you B need the original list, then the 'in place' stemming is about 60% faster. Example of 'in place' stemming: my $words = [ 'wordy', 'another' ]; my $stemmed_words = Lingua::Stem::En::stem({ -words => [$words], -locale => 'en', -exceptions => \%exceptions, }); The 'in place' mode returns a reference to the original list with the words stemmed. =back =cut sub stem { return [] if ($#_ == -1); my $parm_ref; if (ref $_[0]) { $parm_ref = shift; } else { $parm_ref = { @_ }; } my $words = []; my $locale = 'en'; my $exceptions = {}; foreach (keys %$parm_ref) { my $key = lc ($_); my $value = $parm_ref->{$key}; if ($key eq '-words') { @$words = @$value; if (ref($words->[0]) eq 'ARRAY'){ $words = $words->[0]; } } elsif ($key eq '-exceptions') { $exceptions = $parm_ref->{$key}; } elsif ($key eq '-locale') { $locale = $parm_ref->{$key}; } else { croak (__PACKAGE__ . "::stem() - Unknown parameter '$key' with value '$parm_ref->{$key}'\n"); } } local( $_ ); foreach (@$words) { # Flatten case $_ = lc $_; # Check against cache of stemmed words if (exists $Stem_Cache2{$_}) { $_ = $Stem_Cache2{$_}; next; } # Check against exceptions list if (exists $exceptions->{$_}) { $_ = $exceptions->{$_}; next; } my $original_word = $_; # Step 0 - remove punctuation s/'s$//; s/^[^a-z]+//; s/[^a-z]+$//; next unless /^[a-z]+$/; # Reverse the word so we can easily apply pattern matching to the end: $_ = reverse $_; # Step 1a: plurals -- sses->ss, ies->i, ss->ss, s->0 m!^s! && ( s!^se(ss|i)!$1! || s!^s([^s])!$1! ); # Step 1b: participles -- SYLeed->SYLee, VOWed->VOW, VOWing->VOW; # but ated->ate etc s!^dee($syl)!ee$1!o || ( s!^(de|gni)($hasvow)!$2!o && ( # at->ate, bl->ble, iz->ize, is->ise s!^(ta|lb|[sz]i)!e$1! || # ** ise as well as ize # CC->C (C consonant other than l, s, z) s!^([^aeioulsz])\1!$1! || # (m=1) CVD->CVDe (C consonant, V vowel, D consonant not w, x, y) s!^([^aeiouwxy][aeiouy][^aeiou]+)$!e$1! ) ); # Step 1c: change y to i: happy->happi, sky->sky s!^y($hasvow)!i$1!o; # Step 2: double and triple suffices (part 1) # Switch on last three letters (fails harmlessly if subroutine undefined) -- # thanks to Ian Phillipps who wrote # CPAN authors/id/IANPX/Stem-0.1.tar.gz # for suggesting the replacement of # eval( '&S2' . unpack( 'a3', $_ ) ); # (where the eval ignores undefined subroutines) by the much faster # eval { &{ 'S2' . substr( $_, 0, 3 ) } }; # But the following is slightly faster still: { no strict 'refs'; my $sub; # Step 3: double and triple suffices, etc (part 2) &$sub if defined &{ $sub = '_S2' . substr( $_, 0, 3 ) }; # Step 3: double and triple suffices, etc (part 2) &$sub if defined &{ $sub = '_S3' . substr( $_, 0, 3 ) }; # Step 4: single suffices on polysyllables &$sub if defined &{ $sub = '_S4' . substr( $_, 0, 2 ) }; } # Step 5a: tidy up final e -- probate->probat, rate->rate; cease->ceas m!^e! && ( s!^e($syl$syl)!$1!o || # Porter's ( m=1 and not *o ) E where o = cvd with d a consonant # not w, x or y: ! m!^e[^aeiouwxy][aeiouy][^aeiou]! && # not *o E s!^e($syl[aeiouy]*[^aeiou]*)$!$1!o # m=1 ); # Step 5b: double l -- controll->control, roll->roll # ** Note correction: Porter has m>1 here ($syl$syl), but it seems m>0 # ($syl) is wanted to strip an l off controll. s!^ll($syl)!l$1!o; $_ = scalar( reverse $_ ); $Stem_Cache2{$original_word} = $_ if $Stem_Caching; } %Stem_Cache2 = () if ($Stem_Caching < 2); return $words; } ############################################################## # Rule set 4 sub _S4la { # SYLSYLal -> SYLSYL s!^la($syl$syl)!$1!o; } sub _S4ec { # SYLSYL[ae]nce -> SYLSYL s!^ecn[ae]($syl$syl)!$1!o; } sub _S4re { # SYLSYLer -> SYLSYL s!^re($syl$syl)!$1!o; } sub _S4ci { # SYLSYLic -> SYLSYL s!^ci($syl$syl)!$1!o; } sub _S4el { # SYLSYL[ai]ble -> SYLSYL s!^elb[ai]($syl$syl)!$1!o; } sub _S4tn { # SYLSYLant -> SYLSYL, SYLSYLe?ment -> SYLSYL, SYLSYLent -> SYLSYL s!^tn(a|e(me?)?)($syl$syl)!$3!o; } sub _S4no { # SYLSYL[st]ion -> SYLSYL[st] s!^noi([st]$syl$syl)!$1!o; } sub _S4uo { # SYLSYLou -> SYLSYL e.g. homologou -> homolog s!^uo($syl$syl)!$1!o; } sub _S4ms { # SYLSYLism -> SYLSYL s!^msi($syl$syl)!$1!o; } sub _S4et { # SYLSYLate -> SYLSYL s!^eta($syl$syl)!$1!o; } sub _S4it { # SYLSYLiti -> SYLSYL s!^iti($syl$syl)!$1!o; } sub _S4su { # SYLSYLous -> SYLSYL s!^suo($syl$syl)!$1!o; } sub _S4ev { # SYLSYLive -> SYLSYL s!^evi($syl$syl)!$1!o; } sub _S4ez { # SYLSYLize -> SYLSYL s!^ezi($syl$syl)!$1!o; } sub _S4es { # SYLSYLise -> SYLSYL ** s!^esi($syl$syl)!$1!o; } ############################################################## # Rule set 2 sub _S2lan { # SYLational -> SYLate, SYLtional -> SYLtion s!^lanoita($syl)!eta$1!o || s!^lanoit($syl)!noit$1!o; } sub _S2icn { # SYLanci -> SYLance, SYLency ->SYLence s!^icn([ae]$syl)!ecn$1!o; } sub _S2res { # SYLiser -> SYLise ** &_S2rez; } sub _S2rez { # SYLizer -> SYLize s!^re(.)i($syl)!e$1i$2!o; } sub _S2ilb { # SYLabli -> SYLable, SYLibli -> SYLible ** (e.g. incredibli) s!^ilb([ai]$syl)!elb$1!o; } sub _S2ill { # SYLalli -> SYLal s!^illa($syl)!la$1!o; } sub _S2ilt { # SYLentli -> SYLent s!^iltne($syl)!tne$1!o } sub _S2ile { # SYLeli -> SYLe s!^ile($syl)!e$1!o; } sub _S2ils { # SYLousli -> SYLous s!^ilsuo($syl)!suo$1!o; } sub _S2noi { # SYLization -> SYLize, SYLisation -> SYLise**, SYLation -> SYLate s!^noita([sz])i($syl)!e$1i$2!o || s!^noita($syl)!eta$1!o; } sub _S2rot { # SYLator -> SYLate s!^rota($syl)!eta$1!o; } sub _S2msi { # SYLalism -> SYLal s!^msila($syl)!la$1!o; } sub _S2sse { # SYLiveness -> SYLive, SYLfulness -> SYLful, SYLousness -> SYLous s!^ssen(evi|luf|suo)($syl)!$1$2!o; } sub _S2iti { # SYLaliti -> SYLal, SYLiviti -> SYLive, SYLbiliti ->SYLble s!^iti(la|lib|vi)($syl)! ( $1 eq 'la' ? 'la' : $1 eq 'lib' ? 'elb' : 'evi' ) . $2 !eo; } ############################################################## # Rule set 3 sub _S3eta { # SYLicate -> SYLic s!^etaci($syl)!ci$1!o; } sub _S3evi { # SYLative -> SYL s!^evita($syl)!$1!o; } sub _S3ezi { # SYLalize -> SYLal s!^ezila($syl)!la$1!o; } sub _S3esi { # SYLalise -> SYLal ** s!^esila($syl)!la$1!o; } sub _S3iti { # SYLiciti -> SYLic s!^itici($syl)!ci$1!o; } sub _S3lac { # SYLical -> SYLic s!^laci($syl)!ci$1!o; } sub _S3luf { # SYLful -> SYL s!^luf($syl)!$1!o; } sub _S3sse { # SYLness -> SYL s!^ssen($syl)!$1!o; } ############################################################## =over 4 =item stem_caching({ -level => 0|1|2 }); Sets the level of stem caching. '0' means 'no caching'. This is the default level. '1' means 'cache per run'. This caches stemming results during a single call to 'stem'. '2' means 'cache indefinitely'. This caches stemming results until either the process exits or the 'clear_stem_cache' method is called. =back =cut sub stem_caching { my $parm_ref; if (ref $_[0]) { $parm_ref = shift; } else { $parm_ref = { @_ }; } my $caching_level = $parm_ref->{-level}; if (defined $caching_level) { if ($caching_level !~ m/^[012]$/) { croak(__PACKAGE__ . "::stem_caching() - Legal values are '0','1' or '2'. '$caching_level' is not a legal value"); } $Stem_Caching = $caching_level; if ($caching_level < 2) { %Stem_Cache2 = (); } } return $Stem_Caching; } ############################################################## =over 4 =item clear_stem_cache; Clears the cache of stemmed words =back =cut sub clear_stem_cache { %Stem_Cache2 = (); } ############################################################## =head1 NOTES This code is almost entirely derived from the Porter 2.1 module written by Jim Richardson. =head1 SEE ALSO Lingua::Stem =head1 AUTHOR Jim Richardson, University of Sydney jimr@maths.usyd.edu.au or http://www.maths.usyd.edu.au:8000/jimr.html Integration in Lingua::Stem by Jerilyn Franz, FreeRun Technologies, =head1 COPYRIGHT Jim Richardson, University of Sydney Jerilyn Franz, FreeRun Technologies This code is freely available under the same terms as Perl. =head1 BUGS =head1 TODO =cut 1;