Exhaustive Affix Stripping And A Malay Word Register To Solve Stemming Errors And Ambiguity Problem In Malay Stemmers

Main Article Content

Salhana Amad Darwis
Rukaini Abdullah
Norisma Idris

Abstract

Stemmers or word stemming algorithms reduce a derivative word to its root word by removing all the affixes. The complexity of Malay Language (ML) morphological rules and Malay lexicon make stemming Malay words difficult. There is no fixed method to determine the affix to be removed from a derivative word to produce the correct root word. Furthermore, a derivative word could contain one or more valid root words. Stemming errors still exist in the previous Malay Language Stemmers (MLS). Regardless of the approaches used, they rely on the first affix matched or the first root word found. Hence, some words were under stemmed or over stemmed while words with many valid root words were not stemmed to reveal the correct root word. This multiple root words or ambiguity problem, however, has never been addressed by previous MLS. To solve the over stemming and under stemming errors, we propose an approach that exhaustively strips all matched affixes to ensure that a valid root word will be extracted. In addition, we also propose the use of a Malay Word Register to address the ambiguity problem of determining the correct root word. We tested the proposed approach with words from newspaper articles, Malay translation of the Quran, History essays and incorrectly stemmed words from the previous stemmers. The results reveal this stemmer is successful with 99.8% accuracy. There were no stemming errors. The imperfect accuracy is due to the ambiguity problem approach.

Downloads

Download data is not yet available.

Article Details

How to Cite
Darwis, S. A., Abdullah, R., & Idris, N. (2012). Exhaustive Affix Stripping And A Malay Word Register To Solve Stemming Errors And Ambiguity Problem In Malay Stemmers. Malaysian Journal of Computer Science, 25(4), 196–209. Retrieved from https://ejournal.um.edu.my/index.php/MJCS/article/view/6717
Section
Articles