Automatic Named Entity Recognition
The Information Extraction is the field of Natural Language Processing that is aimed at extracting automatically structured knowledge, usually dependant on the context, from information existing in not structured text in natural language, in order to improve its exploitation and reuse. The Named Entity Recognition (NER) is usually the first step of the process of extraction, also known as entity identification or entity extraction, consisting, as its own name suggests, of the detection and classification of the elements of the text into predefined categories, such as names of people, organizations, places, numerical or time phrases, etc., mentioned in a text written in a certain language. This activity can also be named: semantic tagging.
The difficulty of detection is that these entities can have different forms: for example, "Antonio Banderas" => "Banderas", "A. Banderas", "José Antonio Domínguez Banderas", etc.; "Banco Santander Central Hispano" => "Banco Santander", "Santander", "BSCH", etc.
Besides, once they are detected, there is a problem of ambiguity for their classification, either between different categories or within a same category: for example, "Sevilla" can be the city, the football team, etc.
The most adopted approach is based on knowledge that is the use of dictionaries and rules, usually developed manually, to carry out the detection and classification. Basically the rules apply patterns of regular phrases to the dictionary entities in order to generate the different possible variants in which an entity can appear, such as for example:
- (F)irst name (L)ast name => First name / Last name / F. Last name / First name L. / F. L.
Fernando Alonso => Fernando / Alonso / F. Alonso / Fernando A. / F. A.
- (A)aaa (of|the)? (B)bbb (of|the)? (C)cc (of|the)? (D)ddd => ABCD
Organisation of the Petroleum Exporting Countries => OPEP
Besides, our technology allows the advanced recognition of unknown entities that could be Named Entities, which the system proposes as suggestions of possible entities: for example, "D. Aaaaa Bbbbb de Ccccc" can be the name of a person, "Bank Ddddd" an organization, "walk Eeeee" a place, etc.
The main disadvantage of this approach is the great cost of development and maintenance of the resources needed and the fact that these resources highly depend on the domain and the language. That is why other approaches based on automatic learning, using collections of texts manually tagged as training data in order to generate automatically these resources and build models of detection and classification.
In our demo website, Showroom, you are offered a process of Named Entity Recognition.