STILUS Core
Automatic processing of texts in different languages
STILUS Core is a complete software library of tools for linguistic processing in Spanish, English, French and Italian: filtering, segmentation and morphosyntactic tagging of texts, superficial syntactic analysis, morphological disambiguation and summary extraction, etc.
Lexical base
Daedalus has a high quality and high coverage lexical base for Spanish, which is the main STILUS dictionary. Its format has been conceived in order to make the incorporation of information by a team of linguists easier. "Object" dictionaries are generated, compiled and optimized from it through specific tools to be accessed and consulted from any application.
Thanks to the refined morphological characterization of all the entries of the dictionary, the tools developed on these resources don't "overrecognize" that is they don't give up for correct some incorrect combinations of roots and morphemes.
Apart from individual words, this lexical base incorporates more than 27,000 multiword terms forming a unity from the syntactic point of view. For example: "a costa de" ("at the expense of"), "Juan Carlos I", "con respecto a" ("as for"), etc.
In total, the lexical base reports more than 130,000 different lemmas of words in Spanish and their nominal or verbal inflections as well as the possible derivations with nominal suffixes (e.g.: "pequeñ+ito") or verbal enclitic pronouns (e.g.: "comprándo+se+lo"), create more than 6 million words in Spanish. From them, the processing of affixation with nominal (e.g.: "súper+pequeño") or verbal ("sobre+actuar") prefixes makes it possible to recognize a great quantity of words, more than 15 millions.
Although the extension of vocabulary of the lexical base assures a very large coverage, the lexical base can lack in some groups of words sometimes very useful or even indispensable for certain users:
- Technical terms or jargon of a certain professional group (economy, law, medicine, etc.)
- Vocabulary of only one or several areas of the Spanish speaking community (e.g.: Spanish from Chile, Spanish from Argentina, etc.)
So the general lexical base is expandable with thematic dictionaries according to the application or the client. STILUS has currently dictionaries of economy, astronomy, music, bullfighting and law as well as common words of the Spanish of different areas of the Spanish speaking community.
Of course specific versions of this lexical base exist for the other languages of STILUS Core: English, French and Italian.
Filtering of documents
Daedalus integrates document filtering technology (owned by third parties) that allows the recognition of many different electronic formats (including MS Office, HTML, PDF, text, XML, RTF, etc.).
The filtering makes it possible to extract automatically properties included in the format of the document or understand its structure. For example it makes the extraction of titles, summaries, authors, etc. of the document easier when this information has been coded in it.
Segmentation of the text
The segmentation is the process of identifying the units that can be analyzed linguistically. This task not only refers to the "words" units, but it can also include the syllabification as well as the recognition of orthotypographic sentences, morphemes and complex lexical units.
These units likely to be analyzed are identified through regular expressions, more or less complex, allowing also the recognition of other textual units, not strictly words, that have an opaque analysis for the type of linguistic processing being done in the texts. So dates, enumerations and email addresses are some of the units that can be identified by this processing.
Another type of complex units, corresponding to more than one textual element, such as phrases, abbreviations or some proper nouns, are recognized through their inclusion in the resources of the lexical base.
Morphosyntactic analysis
As its name suggests, a morphosyntactic (morphological+syntactic) analyzer aims at getting the whole morphosyntactic analysis of a word or a group of words. The resulting analyses of the Daedalus analyzers consist of:
- the possible morphosyntactic categories of the word or group of words, coded through a tag with morphological (function of the word) or syntactical (function in the sentence) features of the STILUS tagging
- the lemma/s corresponding to each category of the analysis
- the semantic information of the word according to this analysis (given by the STILUS Sem module of semantics)
- the so called "canonical form" of the entry that is its capitalisation (capital letters/lower case letters) according to the lexical base, independent of the concrete form in which it appears in the text.
Generally a word has more than one analysis, due to the intrinsic ambiguity of the natural language. For Spanish the average is 1.9 analyses per word, according to the STILUS tagging.
Daedalus has morphological analyzers for different languages, concretely Spanish, Catalan, Basque language and Galician as well as English, French and Italian.
The analyzer is based on a concrete and specific model of morphological processing, created from a model of representation of linguistic information defined in the ARIES system. In this model, the words are composed of one or more forming elements. Generally a word can be composed of one forming element ("farol"), two ("niñ-o") or more ("niñ-it-o"). The composition of the words depends on their coding in the resources. We'll call root the first (or only) element and ending the second one (if there is one).
Each forming element has a certain morphological information of features (or only information of features), that will be used to generate the morphological analysis of the complete word (there can be several). This information consists (in general) of morphosyntactic features, such as gender, number, person, verbal tense, verbal mode, type of pronoun, lemma of the word, etc.
Besides each forming element has some morphological information of concatenation (or only information of concatenation) that indicates the elements it can be concatenated with, because they cannot be linked with all. For example, the forming element "o" with information of features "masculine gender, singular number" is different from the forming element "o" with information of features "1st person singular, present indicative" and that is why they must have different information of concatenation: the first one will have a concatenation with nominal roots and the second with verbal roots that also are first conjugation.
If a forming element composes a complete word, without needing to concatenate with another, there would be no information of concatenation. In this case the word would get directly the information of features of its only element.
If a word is composed of several forming elements, each forming element would bring its information of features to the complete word and the information of the complete word would be given by combining adequately the information of features of each of them, but only if the concatenation is compatible.
The morphological analyzer needs a list of all the forming elements with their information of concatenation and their information of features. The process of morphological analysis consists of:
First, seeing if the word is a complete forming element itself with no information of concatenation. If it is the case, one or several analysis with information of features of the element is generated.
Then, looking through the word from the beginning to the end or the other way round), dividing it into two pieces and verifying, first, if both are valid forming elements (that is, if they are in the resources), and secondly, in case they are, if their information of concatenation is compatible. In this case, the information of features of both is combined and as many morphological analysis as necessary are generated.
Morphological disambiguation
Instead of giving all the possible analyses of a word, STILUS Core can apply a process of disambiguation in order to filter the invalid analyses in the context in which this word appears, which gives an only analysis in general.
For example, "casa" has three analyses: feminine noun "casa", verb "casar" in the 3rd person of the present indicative and the verb "casar" in the imperative singular. By taking into account a linguistic context, for example, "la casa roja", the analysis as a verb would not mean anything and that is why the remaining analysis, as a noun, would be the only one valid.
STILUS Core is currently a disambiguator based on rules and the possibility of adding other statistic techniques to increase its precision is being considered.
Superficial syntactic analysis
Apart from the morphosyntactic analysis, STILUS Core has a function to do a superficial syntactic analysis of the text. It aims at detecting groups of words that have the same function in the sentence. Thus nominal, verbal, prepositional or adverbial syntagms as well as their possible function in each sentence can be detected.
In this way, the sentence can be analyzed semantically by groups that have more abstraction than each one of the individual words.
For example, in the sentence: el hijo de Juan está mirando las manzanas que me trajiste (Juan's son is looking at the apples you brought me), the STILUS superficial syntactic analysis would group "el hijo de Juan" ("Juan's son") as a nominal and (possible) subject syntagm, "está mirando" ("is looking at") as a verbal syntagm and "las manzanas que me trajiste" ("the apples you brought me") as a nominal and (possible) direct object syntagm. In this way, the semantic structure of the sentence would be "hijo" ("son") + "mirar" ("to look at") + "manzana" ("apple"), which illustrates better its meaning.
Summary extraction
STILUS Core integrates a module of summary extraction from an analyzed text. It is possible to get a high quality of the summaries made automatically by adjusting parameters of the system according to the needs of each case.
If you wish to know more about the process of extraction, you can find detailed information in the page about information extraction.
Daedalus has advanced resources and products in Linguistic technology, Information retrieval and Information extraction. If you wish to read a general description of the linguistic technology and the specific capacity of Daedalus related to the linguistic process of texts, download our white paper on Linguistic Technology: products and applications incorporating linguistic technology in other fields are also mentioned in it.
