Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages

Chew Y Choong, Yoshiki Mikami, CA Marasinghe, ST Nandasara

Abstract


Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n‑gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n‑gram orders and a mix n‑gram model affect the relative performance and accuracy of language identification. The n-gram based algorithm used in this paper does not depend on the n‑gram frequency. Instead, the algorithm is based on a Boolean method to determine the output of matching target n‑grams to training n‑grams. The algorithm is designed to automatically detect the language, script and character encoding scheme of a written text. It is important to identify these three properties due to the reason that a language can be written in different types of scripts and encoded with different types of character encoding schemes. The experimental results show that in one test the algorithm achieved up to 99.59% correct identification rate on selected languages. The results also show that the performance of language identification can be improved by using a mix n‑gram model of bigram and trigram. The mix n-gram model consumed less disk space and computing time, compared to a trigram model.

DOI: 10.4038/icter.v2i2.1385

The International Journal on Advances in ICT for Emerging Regions 2009 02 (02): 21-28


Full Text: PDF

International Journal on Advances in ICT for Emerging Regions (ICTer) ISSN: 1800-4156

SLJOL is supported by INASP