Thursday, July 31, 2008

Lemmatization and Stemming

A colleague brought up the term Lemmatization to me today and I thought it would make a great blog topic. After realizing that he wasn’t insulting me I looked further into the meaning of the word.

My first stop was the online dictionary. Lemmatization apparently means “the act or process of lemmatizing” (thanks a lot online dictionary). Not being happy with that definition I looked at Wikipedia, the most truthful and valuable source of information on the web (sarcasm intended). They offered additional assistance.

Lemmatization is the process by which a word is taken and reduced to its Lemma or its canonical form. For example if you take the word “walking” the lemma or base form of that word would be “walk”. Or if you take the word “better” the lemma of that word would be “good”. Lemmatization takes into account the context and meaning of the word when determining the base form of the word.

Stemming is the process of reducing the word to its root form without necessarily taking into account the context or meaning of the word. For example the words “walking”, “walks”, and “walker” would all have the root of “walk” but the word “ran” would not have a root of “run” when using a suffix stripping stemming algorithm. Lemmatization algorithms are more accurate in that the meaning and context of the word is considered.

Why is this important? Using lemmatization in searching for data in your archive during a discovery request provides more accuracy in that you receive results based on meaning and context which is much more valuable than a straight keyword search.

The next time you are at a major social function please remember this blog. I am sure explaining the meaning of the word lemmatization will not only impress your many friends but also make you the life of the party (or you might be assaulted, it could go either way…). :)

Wikipedia source: http://en.wikipedia.org/wiki/Stemming and http://en.wikipedia.org/wiki/Lemmatisation


Technorati Tags:
, , , ,

No comments: