K-SITE INDEX
K-Site Index is the name of the software component developed by Daedalus, offering advanced functions for information retrieval: indexing functions charged of the creation of indexes structures on the documents; retrieval or search functions, aimed at searching information through the content of the indexes created.
In both functions the advanced technology of linguistic processing of texts developed by Daedalus are used, making it possible to make an elaborate processing of the words of the documents, caring its content and linguistic meaning instead of only considering sequences of characters, such as most of the products of information retrieval of the market. In this way, the searches are very precise and therefore, the users get the answers in fewer consultations and less time.
If you want to work with audio and/or video contents, technologies of speech recognition are used, allowing the resulting processing of text. The resulting text of a process of automatic voice conversion has a great number of mistakes processed through the Daedalus linguistic technology, increasing the usefulness of the automatic transcription. On the other hand the metadata that can accompany those contents (for example, a MPEG7 video) are also exploited.
This linguistic technology is based on the STILUS technology: powerful dictionaries resources, bases of lexical knowledge, rules of composition..., constantly updated and improved by the Daedalus team of linguists.
Characteristics
As for the processing of indexing, filtering, segmentation, morphological content analysis and codes extraction, it is possible to select the chosen morphological categories to include them as key words of the document (for example: only named entities and verbs), improving substantially the indexes.
The advanced capacities of filtering included in this version of the product allow the processing of documents with the following input formats:
- All Microsoft Office formats (in all versions):
- Word
- Excel
- Access
- PowerPoint
- PDF files (not protected)
- Postcript files
- HTML files
- RTF files
- Text files
The functions of retrieval included in this version of the product make it possible to find documents by using literal terms (of one or several words) or lemmas:
- The search by individual terms is the simplest one and it makes possible to find a word within the indexed documents, whereas the search for sentences, expressed by entering the phrase between quotation marks ("phrases to be searched"), detects the presence of a sentence or a series of words.
- The search for literals makes possible the inclusion of wildcard characters ("*" and "?") and set characters ([a-z]) to find terms by cutting the word by the left, the right or both sides with these characters.
For example, "prob*" finds documents with words beginning by "prob" and "p[aeiou]pa" represents the words "papa", "pepa", "pipa", "popa" and "pupa".
- The search by lemmas uses the technology of the morphosyntactic analysis system used in the indexing process of documents, which prepares an index with the lemmas of the words of the document.
This makes possible to find words by their main form, considering, for example, the words "andaré" or "andando" as a same action: "andar", which makes independent the searches for particular forms of the words in the text.
The way of expressing a search by lemmas is made through the use of brackets. For example, "(juez)" would give all the documents containing "juez", "jueza", "jueces" and "juezas".
The modes of searching described before can be combined with a syntax similar to the Internet Google search engine one, through the following operators:
- The + (AND) operator indicates that the condition must be complied (the word must compulsorily be present in the documents found)
- The – (NOT) operator indicates that documents without the term in question must be searched and it can only appear if accompanied by the AND operator
- If none of the previous operators is expressed, documents with both possibilities are searched (OR)
The way of expressing the consultations will be:
+compulsory_condition … –excluding_condition … optional_condition …
The results of the searches will be arranged by relevance according to an automatically calculated factor indicating the importance of a document, as for the consultation made. The relevance will be indicated with a number (a certain percentage compared to the more relevant document, 100%).
Technically K-Site Index consists of a set of software components developed in C/C++ and capable of operating in Unix/Linux as well as Microsoft Windows environments. These components are accompanied by a set of linguistic resources needed to provide the described functionality.
As previously mentioned, the mechanisms of access and use of this component can be offered by several forms, from an indexing server capable of answering requests of indexing and search, to a programming interface with a low level access to the services implemented. In the case of Windows platforms, the product can be delivered in any of the following forms: ActiveX or COM, DLL object or static library, .NET library or any other possibility.
As for the data storage system, K-Site Index uses the MySQL database, available in versions for Unix/Linux as well as for Windows.
The whole technology involved in the development of the product is owned by Daedalus, which increases very much the possibilities of adaptation and integration according to the end-users' specific needs.
If you want to know more about this product, don't hesitate to contact us. In our demo website, the application corresponding to search on videos, DALI - Digital Audio Library Indexing, uses this library to provide the functionality of search needed.
