Automatic detection of languages
STILUS Lang is a product of the STILUS family aiming at automatically determining the language in which a certain passage of text is written and then processing the text according to its language.
STILUS Lang can currently distinguish eight languages: Spanish, Catalan, Basque language, Galician language, English, French, German, Italian and Dutch.
To detect the language, the words of the text are analyzed and equivalents are searched in each one of the languages. For this purpose, there is a list of words as well as n-gram frequencies (n-letter sequences) for each one of the languages.
The process is quite easy. The words of a text are extracted one by one, verified in the available lexical bases and the distribution of n-gram frequencies is calculated. When there is detection, the language in question is punctuated. When the initial words are verified in the passage, the language of higher punctuation is the language of the text. However a minimum punctuation is necessary to assign a certain language to the text.