K-Site Fuzzy
K-Site Fuzzy is the complete system of fuzzy search on databases developed by Daedalus. Unlike STILUS Fuzzy, which gives word-by-word alternatives of the users' consultation, the K-Site Fuzzy suggestions are complete entries stored in this database.
K-Site Fuzzy is an independent running programme, consultable as a service in a socket of the PC, as a software library that can be easily integrated into any application to complement its functionality.
Characteristics
K-Site Fuzzy is internally based on very efficient structures of data storage, as the time of answer of the system will be directly proportional to the time of access to the information of indexes. In this case the indexes are stored in a data structure called "trie", coming from the word "reTRIEval" (pronounced like "tree") and it has basically a tree structure to store chains, in which each letter is situated in a node.
In a trie, the operations of search are very rapid because they are independent from the number of entries stored. Besides the operations linked with the search of prefixes can be realised in an especially efficient way, due to the way of storing chains of text, different words sharing common initial characters.
The system uses the STILUS Fuzzy functions implementing the fuzzy search of individual terms (words). This class receives as an entry a chain with the term to be searched and as a result it provides a list of suggestions ("similar" terms) along with a penalized value (that is the distance of each suggestion from the original term), from 0.0 (correct original term) to a maximum dependent on a maximum number of configured operations of edition (each operation penalizes 10.0).
The linguistic phenomena processed in STILUS Fuzzy are:
- Mistakes of omission of characters (presidente -> *pesidente)
- Mistakes of transposition of characters (construcción -> *cosntrucción)
- Mistakes of substitution of characters (música -> *múzica)
- Mistakes of addition of characters (altavoz -> *altiavoz)
- Mistakes of concatenation of words (San Sebastián -> *Sansebastián)
K-Site Fuzzy implements the fuzzy search of complete entries. It receives as an entry a chain with the consultation to be searched and as a result it provides a list of the entries of the database that are more similar to the original consultation.
The linguistic phenomena processed in this level are:
- Mistakes of segmentation of words (Fuenlabrada -> *Fuen Labrada)
- Removal of stop words
- Substitution of numbers alias (siglo 21 -> siglo veintiuno)
- Substitution of abbreviations (Hermanos -> Hnos)
The search process is the following:
- Segmentation of the consultation into terms
- Getting suggestions of each term (through STILUS Fuzzy)
- Only the term, or concatenating it with the following ones
- If the term has an alias or is an alias ("Fco." -> "Francisco"), adding its correspondence as a suggestion
- Organization of the suggestions of each term, by penalization ("linguistic distance" of each suggestion from the original term), from lower to higher, and selection of the first terms
- Search of complete entries, with all the terms:
- If the term is not a stop word, the term or any of its alternatives must be in the entry
- If it is a stop word, the term or any of its alternatives can be in the entry or not
Therefore, the complete entries are chosen with an AND among all the terms that are not stop words and considering each term as an OR of itself and all its alternatives.
- Partial search of complete entries: in the case there are no suggestions of a complete entry and the consultation has more than a term, entries without all the terms are searched
- Organization of the entries by penalization, from lower to higher, and selection of the "NumberOfSuggestions" first entries (configurable by parameters)
The penalization ("linguistic distance") of each entry is calculated as the sum for all the terms of the lower penalization of each individual term plus the weighting by a number of factors (definable through parameters) such as the penalization by alias, by frequency of word and/or entry or a factor dependent on the length of the entry.
If you want to know more on this product, don't hesitate to contact us. You can also access the demo available in our demo website, Showroom.
