New automatic phonetic and phonological transcriber for Spanish developed by Daedalus

The phonetic and phonological transcriber for Spanish developed by Daedalus is now available in our showroom page.

Tools of this type currently offer different applications:

  • First of all, they constitute the first module of all voice synthesis systems. Indeed, current systems called TTS (Text to Speech) necessarily include a first component that performs the conversion of text to its phonetic representation. Through other components, the phonetic transcriptions are replaced by acoustic material consisting of the physical realization of the sounds. Voice recognition systems work in a similar, but opposite sense.
  • Transcribers are also an essential tool in philology. First, for training purposes, i.e. for philology students who must learn to transcribe texts correctly. Then, as helpful tools in the professional field: philologists do have to transcribe “by ear” to give account of the phenomena not covered by the standard description of the phonological and phonetic levels of the language, but these tools can also serve as starting point for the ‘standard’ initial transcription, which will be subsequently revised and corrected. In this sense, transcribers are especially useful in the context of Dialectology and its related branches, as well as in the science known as Historical Phonology. Current transcribers tend to adjust to a standard pronunciation, as stated above, and they are conceived only from a synchronic point of view; however, they are often configurable through different variables that allow different levels of transcription. In this regard, our transcriber enables a certain (small, actually) synchronic parameterization, being it phonetic, phonological or etymological.
  • Transcribers are also particularly useful for learning foreign languages, because they provide the notation (with a higher or lower level of detail) of the actual pronunciation of texts. In fact, it is frequent that foreigner-oriented dictionaries include a phonetic transcription (usually encoded in a very simplified IPA) of each one of the words.

Our phonetic and phonological transcriber offers the phonetic transcription of text in Spanish in various well-known phonetic alphabets:

IPA: International Phonetic Alphabet.

RFE: Alphabet of the Revista de Filología Española, the phonetic alphabet used in Spain in the field of philology until recently. Currently it competes in this area with the IPA.

Computing-Oriented Phonetic Alphabets: Phonetic alphabets based on ASCII characters, allowing the further processing of text using computer programs.

  • DEFSFE: Alphabet proposed by Antonio Ríos, from the Universidad Autónoma de Barcelona (see alphabet).
  • SAMPA (Speech Assessment Methods Phonetic Alphabet): Phonetic alphabet based on the IPA. We take into account also the SAMPROSA specifications (SAM Prosodic Alphabet).
  • SALA (SpeechDat across Latin America): SAMPA adaptation for the transcription of Latin American Spanish.
  • WORLDBET: Alphabet based on IPA with additional symbols.
  • VIA: Alphabet of ViaVoice, commercial application for voice recognition.

For more information about the different alphabets, you can access the following links:

Our tool also offers a phonological transcription of text, which is of little use for automated TTS systems as the ones mentioned above, but it is certainly useful in the field of philology, especially in the context of the classical philological currents known as functionalism and structuralism, although on a purely didactic level.

In terms of logical architecture, the transcriber consists of the following modules:

  • Preprocessing module: embeds different sub-modules that preprocess the input text eliminating unnecessary characters, identifying breaks, expanding numbers, etc., thus generating a standard text.
  • Syllabification module: applies an algorithm that breaks down the terms of the text into syllables.
  • Accent module: places the phonetic accents on the terms of the text.
  • Phonetic transcriber module: performs the actual phonetic transcription applying a set of rules that, depending on the phonetic and pragmatic context (the latter selected through the options of the interface, as well as some phonetic neutralizations), allow the identification of the correct allophone.
  • Phonological transcriber module: performs the phonological transcription.

Our phonetic transcriber also offers different transcription options:

  • Preprocessing: process or not abbreviations, symbols, numbers, Roman numerals.
  • Syllabification: syllabify or not terms, phonetically or phonologically.
  • Transcription:
    • Vocalism: mark the nasal allophones of vowels.
    • Consonants: transcribe considering the linguistic phenomenon known as yeísmo or without it; phonetic neutralization of /B/ /D/ /G/.

Other options: mark synalephas, place accents on vowels (for the RFE, basically).

TwitterLinkedInGoogle+FacebookEmail semantic markup: the (very) secret weapon of your online marketing

Now you can benefit from a tool with the potential to increase your CTR by 30% and to improve your organic ranking in search engines… and that your competitors are not using. If you are interested, keep reading.

In a previous post we discussed how every organization with online presence will need to make apparent the meaning of their web content, as search engines are evolving to a more semantic approach. In this new scenario, semantic markup technology can help to make content more relevant to search engines and more attractive to users.

rich snippetThe markup enables website owners to add to their pages an HTML code that allows search engines to identify specific elements of those pages and, in some cases, present them in search results in the form of rich snippets.

The standard, a result of the collaboration between Google, Bing and Yahoo, provides a set of vocabularies used for the markup of structured data in HTML documents, so that they can be understood by search engines, aggregators and social media. With the support of the industry’s leaders, represents the semantic markup’s coming of age.

Improve your inbound marketing with semantic tagging

Online marketing is evolving and, along with traditional paid media (advertising) and owned ones (website, blog), the new earned media (organic search, social conversations) are critical. More than an outbound, interruption-based marketing, now it is all about inbound marketing, where the key is to be findable and to appear in those conversations in which users talk about their needs and the products they use.

schemaorg_keywords-schema-integrations-2014_usThis requires not only to create and publish optimized content about those topics, but also to promote them and make them more findable in search engines and shareable through social media. Semantically tagging our content can help us in many ways:

  • The markup allows increasing the relevance of our content for certain search queries. Optimizing content and tagging it explicitly with specific entities makes easier for all kinds of search engines to identify its meaning so that it will appear in the results of more queries related to those entities. This does not mean that automatically provides a better ranking in the results pages; however, some analyses have shown a certain degree of correlation. This study by Searchmetrics found out that pages incorporating rank better by an average of four positions compared to web pages that do not integrate it. Although (paying attention to the spokesmen of search engines) this is not a causal effect, it appears to exist some indirect relationship between semantic markup and a better ranking.
  • Tagging pages with metadata that identify them as information about products, movies, applications, recipes, etc. makes them more likely to appear in the vertical areas of general search engines and in specialized engines.
  • When a semantically tagged content is shown in the search results, it appears in the form of rich snippets or similar, which include specific data, access to multimedia elements and even the possibility of browsing information and refining the search. This increases the visibility, appeal and “clickability” of that outcome, which results into more visits and social sharing opportunities for that content. In some cases an increase in clicks from organic search has been reported, up to 30%. Above all, allows optimizing the CTR of a link.
  • In the same vein, when content is shared in social media or added by an automated tool, links generated algorithmically by these systems are more appealing and informative and can increase traffic to that content.

However, very few sites are currently using

Despite its numerous benefits, the adoption of semantic markup is still very low. According to the report by Searchmetrics, only 0.3% of the analyzed domains incorporated integrations. And that is more striking when compared with its potential impact: according to the same study, Google enriches search results with information derived from markup in more than 36% of the queries.

Additionally, in only 34.4% of queries the search engine returned results with neither integrations nor any other structured data involved. It is clear that is more popular among the results of search engines than among webmasters. markup tools

This low penetration might be because the detailed semantic markup is mainly a manual process. There are several tools that can be of help in the markup work

Google Structured Data Markup Helpe

The communities of the most popular content management systems have even developed plug-ins for this task. But these tools generate the markup code only once the element to which it refers has been identified (more or less manually).

The following Google tools serve to validate the result of a markup and to get an idea about the volume of structured data that the search engine can see in our pages.

To ease the use of structured data markup and rich snippets, at Daedalus we are developing semantic publishing technologies to automatically tag content incorporating information about all kinds of elements of meaning that appear in it: people, organizations, brands, dates, topics, concepts…

In particular, our product Textalytics - Meaning as a Service includes a specific semantic publishing API featuring markup, as you can check using this demonstrator. Markup Demonstrator

It is very likely that your competition is not using It is time to act, integrate it immediately and make the most of its enormous advantages.

If you need more information, don’t hesitate to contact us.


Semantic markup, rich snippets and expose the meaning of your content

As search engines are evolving to a more semantic approach, every organization with online presence will need to make apparent the meaning of their web content. Semantic markup technology can help to make content more relevant to search engines and more attractive to users.

Search engines: from keywords to entities

Search BarcelonaSearch engines are evolving. Providing users with a series of results that contain a certain string of characters (e.g. “barcelona”) is no longer enough. The objective now is to provide them information related to a certain “thing” or meaning (the city of Barcelona, the Barcelona F. C.) or to respond to the user’s intent (organizing a trip to Barcelona). In the future, search engines must be able to offer precise answers to specific questions (e.g. How many inhabitants does Barcelona’s province have?) without having to navigate in a results page.

This transition towards “things, not strings” and entities rather than keywords collides with search engines’ difficulty in interpreting the meaning of an online content: HTML is a language designed to describe how a web page should be presented, not to express its meaning. Even in pages which aim is to provide a set of structured data —typically residing in a company’s internal databases— the nature of HTML hides that data from the search, social and aggregation ecosystem.

Marketers and, in general, all persons wanting their online contents to be spread and findable, need to enrich these contents with metadata to specify to search engines and other applications what do they mean and not just what it is said on them. In other words, they need technologies that enable them to semantically tag their content.

Semantic markup and rich snippets

drooling dog snippet metadata circledThe major search engines have been experimenting with semantic tagging and structured data applications for years (in fact, few days ago marked the fifth anniversary of the presentation of Google’s rich snippets). Essentially, with these technologies the owners of web sites can add to their pages an HTML markup that enables search engines to identify specific elements of those pages and, in some cases, present them in search results.

rich snippetOver the years, it has been possible to tag HTML content with different syntaxes (microformats, HTML5 microdata, RDFa) and various vocabularies to provide information about products, people, events… This information has usually resulted in rich snippets and similar formats, which are more informative than a simple blue link with a more or less representative text, and more appealing to users.

The advantages for both search engines and content generators are obvious. For search engines, providing their users with rich and more relevant search results is a step forward in their objective of facilitating the access to information.

For content generators it is a chance to appear among the search results, stand out within the results page and get more visits and social shares (we will analyze this in detail in a future post).

However, in order for this technology to gain acceptance, the different agents involved are required to agree upon the vocabulary that is going to be used to identify each type of entity and its properties: a universal language for semantic tagging is essential. the lingua franca of semantic markup

Result of the collaboration between Google, Bing and Yahoo (then joined by Yandex), provides a set of vocabularies used for the markup of structured data in HTML documents, so that they can be understood by search engines, aggregators and social media. With the support of the industry’s leaders, can represent the semantic markup’s coming of age.

Person schema.orgPerson allows identifying people, organizations, places, products, reviews, works (books, movies, recipes…), and its vocabulary is continuously expanding. In addition, it supports different syntaxes: microdata, microformats, RDFa and, recently, JSON-LD. With the creation of a common markup scheme, major search engines aim at improving the understanding of the information contained in web pages and its representation in the results pages. vocabulary enables not only to describe, but also disambiguate and associate elements with their meaning.’s sameAs property permits to associate a particular instance of a “thing” (person, organization, brand…) that appears on a web page with a reference URI indicating unambiguously the identity of the element, for example, a page from Wikipedia, Freebase or an official web site.

Both search engines and all kinds of social media and aggregators depend more and more on this type of semantic references within the web pages (most notably Google, since the launch of its Knowledge Graph and the Hummingbird update of its algorithm).

Google itself has announced that will continue supporting other vocabularies and syntaxes for the markup of structured data, but it will favor the use of This strong support by the largest players will prompt more and more content providers to adopt, which in turn will become the reference vocabulary for the expression of structured data.

The case of the media industry: rNews

Probably one of the sectors in which the need for semantic markup is most urgent is the one of mass media. Online media need to make their content more findable and relatable and improve the targeting of contextual ads that constitute their main revenue stream. For that purpose, the IPTC (a consortium of leading agencies, media companies and providers in that sector) has developed rNews.

It is a standard that defines the use of semantic markup to annotate HTML documents with news-specific metadata, both structural (title, medium, author, date) and content related (people, organizations, concepts, locations).

Being and rNews two projects that were born almost at the same time, and in order to avoid the proliferation of standards, included support to rNews virtually from the beginning for news tagging. Currently, leading media as the New York Times are tagging all articles using rNews on

The semantic tagging of content offers enormous possibilities in the field of SEO and marketing in general, but is not exempt from difficulties: how to tag the thousands —or millions— of existing pages, for example, in a medium?

We will try to cover these issues in upcoming posts.

Meanwhile, if you wish to discover how semantic technologies enable you to produce and publish more valuable content, faster and at lower cost don’t miss this webinar by Daedalus (in Spanish).


Text Analytics market 2014: Seth Grimes interviews Daedalus’ CEO

Seth GrimesSeth Grimes is one of the leading industry analysts covering the text analytics and semantic technology market. During the past month he published a series of interviews with relevant figures in this industry, a material to be included in his forthcoming report Text Analytics 2014: User Perspectives on Solutions and Providers, which will be published before summer (for more info, stay tuned to this blog).

Our CEO, José Carlos González, was one of the selected executives. In the interview, Seth and José Carlos discuss recent changes in the industry, customer cases, features requested by the market, etc.

This is the beginning of the interview:

Text Analytics 2014: Q&A with José Carlos González, Daedalus

How has the market for text technologies, and text-analytics-reliant solutions, changed in the past year? Any surprises?

Over the past year, there has been a lot of buzz around text analytics. We have seen a sharp increase of interest around the topic, along with some caution and distrust by people from markets where a combination of sampling and manual processing has been the rule until now.

We have perceived two (not surprising) phenomena:

  • The blooming of companies addressing specific vertical markets incorporating basic text processing capabilities. Most of the time, text analytics functionality is achieved through integration of general-purpose open source, simple pattern matching or pure statistical solutions. Such solutions can be built rapidly from large resources (corpora) available for free, which has lowered entry barriers for newcomers at the cost of poor adaptation to the task and low accuracy.
  • Providers have strengthened the effort carried out to create or educate the markets. For instance, non-negligible investments have been made to make the technology easily integrable and demonstrable. However, the accuracy of text analytics tools depends to some extent on the typology of text (language, genre, source) and on the purpose and interpretation of the client. General-purpose and do-it-yourself approaches may lead to deceive user expectations due to wrong parametrization or goals outside the scope of particular tools.


Interested? Read the rest of the interview -featuring customer cases, our “Meaning as a Service” product Textalytics and coming functionalities of our offering- on Seth Grimes’ blog.


Mining of useful information in social media: Daedalus at Big Data Week 2014

In the past few days we took part in Big Data Week 2014 in Madrid. Big Data Week is a network of events that take place in different cities of the world and is one of the most important global platforms focused on the social, political and technological impact of Big Data.

Big Data Week 2014 Madrid

These events bring together a global community of data scientists, technology providers and business users, and provide an open and self-organized environment to educate, inform, and inspire in the field of exploitation of massive data. In this year’s edition in Madrid, the Francisco de Vitoria University assumed through the CEIEC the role of City Partner and led the event.

Earthquakes, Buying Signals and… #WTF

With the title “Earthquakes, Buying Signals and… #WTF: Mining of Useful Information in Social Media”, our presentation illustrated how to use semantic technologies to automatically extract valuable information from social media scenarios, where Volume, Variety and Velocity requirements are extreme.

The presentation began by putting social media analysis in a context of unstructured content explosion and Big Data, and introducing semantic processing technologies.  Then, we presented some application scenarios we are developing in our R&D and commercial projects.

These applications basically focus on the areas of Voice of the Customer (VoC) / Customer Insights and Voice of the Citizen:

  • Customer journey and buying signals
  • Brand personality and perception maps
  • Corporate reputation
  • Smart cities and citizen sensor
  • Early detection and monitoring of emergencies

Finally, our service Textalytics “Meaning as a Service” was introduced as the easiest and most productive way to introduce semantic processing into any application, and thus extract useful information from social media and other unstructured content. (Remember that Textalytics can be used for free to process up to 500,000 words/month.)

In addition, in the event’s exhibition area we presented some demos focused on the above mentioned applications.

Here are the slides of the presentation (Spanish).



The workshop of Stilus, one of the ones that sparked more curiosity among the attendees of Lenguando

Translated by Luca De Filippis

The last weekend, in Madrid, Casa del Lector was the best scenario to celebrate Lenguando: the first national meeting on language and technology. The pioneering initiative, brought successfully to reality by our colleagues at Molino de Ideas, Cálamo & Cran and Xosé Castro, was driven, among other sponsors, by Daedalus‘s Stilus.

XoseStilusThe spirit of the conference was to bring together in the same space translators, proofreaders, philologists and other communication and language professionals, with an emphasis on the technological revolution of the sector, among other issues.

The talks about the advances in language technology and the simultaneous workshops on their practical application were the most anticipated. In particular, the workshop given in the main auditorium by Concepción Polo (who’s writing this post) on behalf of the team of Stilus was one of the most anticipated by the attendees, according to the organization.

Corpus Linguistics applied to proofreading

LenguandoWith the intention of presenting innovative content and above all practical, in the workshop we considered the possible applications of Corpus Linguistics (CL) in the specific area of professional automatic proofreading. The first aspect that aroused the interest was the disclosure (for many) of the new features of lemmatized and morphological search finally offered by the academic corpora Nuevo Diccionario Histórico (CDH) and Corpus del Español del Siglo XXI (CORPES XXI). Another key content was the brief comparison between the capabilities of these new corpora of the Spanish Royal Academy and those of the less known, although magnificent and veteran Corpus del Español by Mark Davies.

Sin títuloAfter presenting the theory, some reflections followed: how and for what purpose a professional can apply Corpus Linguistics in decision-making process of proofreading and, also, how to automatize proofreading patterns with Word macros, for example.

In the last part of the workshop we explained how an intelligent automatic proofreader is able to address contextual issues that remain outside the autonomous user’s reach.codigoStiuls It was time to examine and understand the pseudo C++ code on which Stilus’ linguistic rules are based. The surprise among the participants without experience in Natural Language Processing laid both in the potential of this technology and in the mere fact of being able to interpret C rules that handled formal, morphological, syntagmatic and even semantic elements.

Presentation of Stilus Macro

Indeed, the availability of tagged corpora allows carrying out empirical research on syntactic and lexical phenomena of a language on an unimaginable scale, and its application to computational linguistics is highly beneficial. Still, the examination of corpora shows that there are thousands of incorrect sequences of words that can be detected without needing morphosyntactic support, and this is precisely the purpose of Stilus Macro: an add-on —still in development— we presented at the end of the workshop, which is capable of running with high speed more than 230.000 context-independent patterns for spell, grammar and style proofreading with Word; a task essentially simple, but unfeasible from a human point of view.



For more information, access the full presentation.



NextGen Mobile Content Analytics, big data analytics solution for mobile video games

Mobile AnalyticsWe’ve recently started working on the project NextGen Mobile Content Analytics, which aims at researching, designing and developing a solution of Mobile Business Intelligence for developers and publishers of video games on mobile platforms. It is focused on providing analysis and customized recommendations starting from the player’s gaming experience.

Currently, the business tools aimed at analyzing the activity of users in mobile games (as Flurry, for example) limit their functionality to the collection of raw data, which must be interpreted by a human analyst to determine which actions should be performed in the analyzed scenarios. Some scenarios also require an immediate reaction, for example in the case of a player manifesting his will to give up the game soon, or when decisions have to be made with respect to the variation of a digital product’s price.

The project’s goal is to create an intelligent system to identify behavioral patterns and use them to categorize players in real time, according to different gaming, social or economic criteria (e.g. more active players, players who never buy, players who are going to leave the game soon, etc.). This will allow users to run on customized according to business goals actions essential in models where revenues depend on the interaction and evolution of the player in the game. This will allow performing customized actions over users according to business goals, which is essential in a model where revenues depend on the interaction and evolution of the player in the game.

We bring our know-how in the collection, analysis and visualization of massive data (big data). The solution’s architecture consists of a datawharehouse containing a log of the game’s events, with advanced data analytics and reporting modules to exploit the information stored. Furthermore, an SDK will be developed to be integrated into the game’s code in order to register in the datawarehouse the player’s actions. These will be analyzed by automatic classification algorithms (player profiles and related actions to take). Then, through clustering and visualization, potential problems will be identified, which will permit to take concrete decision for further software updates.

The NextGen Mobile Content Analytics project (TSI-100600-2013-198) is funded by the Spanish Ministry of Industry, Energy and Tourism in the framework of the Strategic Plan for Telecommunication and Information Society, National Plan for Scientific Research, Development and Technologic Innovation 2013-2016. We have been developing the project in cooperation with Digital Legends, a leading Spanish company well-known internationally for developing high-quality video games for mobile platforms.


We will publish more information as we move forward in the project and get further results. In any case, if you have any question, do not hesitate to contact us.

[Translation by Luca de Filippis]


TrendMiner, semantic analysis in real time streams

In early December Daedalus was at Luxembourg to join TrendMiner project. TrendMiner is an EU R&D project devoted to the study of technologies to process and analyze large scale and real time media streams in multilingual and multimodal domains. Mentioned analysis of this data stream involves clusteringsummarizationentity recognition and, in general, text mining approaches. During the last two

TrendMiner project

TrendMiner project

years, TrendMiner partners have been working in German, English and Italian languages in financial and political domains. Daedalus has joined the TrendMiner team to adapt Textalytics semantic technology to deal with financial and health domains in Spanish and English languages. Among the initial partners is possible to find companies such as OntotextEurokleis, Internet Memory Research SAS or Sora, and research groups from the DFKIThe University of Sheffield, the University of Southampton. The newcomers group of partners for the next year are the Research Institute for Linguistics from the Hungarian Academy of Sciences, the Institute of Computer Science from the Polish Academy of Science, LaBDA team at Universidad Carlos III de Madrid and, of course, Daedalus.

These partners will work on the extension of TrendMiner real time stream semantic analysis to:

  • new languages such as Polish, Hungarian and Spanish languages and
  • new domains such as:
  1. health; to process biomedical literature in order to extract knowledge from it, and
  2. psychological states; to detect the affective state of somebody posting a (short) text in the social media, helping in deciding on the level of trust one can associate to such posts.

Daedalus participates in Sentiment Analysis Symposium

The coming Sentiment Analysis Symposium, to be held next March in New York, will provide an unmatched opportunity to network and learn about the latest sentiment analysis technologies and how they can be implemented to create real business impact.

Sentiment Analysis Symposium 2014

At Daedalus, we recognize the quality and value of this conference, and we decided to sponsor the event with our brand Textalytics (Meaning as a Service).

Read more (and get a 20% discount in your Sentiment Analysis Symposium registration) at Textalytics’ Blog.


Analysis: Modeling Air Pollution in the city of Santander (Spain)

We have published a new study entitled “Modeling Air Pollution in the City of Santander (Spain)“, carried out in the context of the project Ciudad2020. In this new document – in a similar way to what we did in our study on noise pollution-, we have focused on presenting the full analysis of real application in the modeling of air pollution in the city of Santander (Spain), which had already been summarily described in our whitepaper on pollution predictive modeling techniques in the sustainable city.

One of the objectives of Ciudad2020 as far as pollution in concerned is to install across the city a wide network of low-cost sensors (with respect to the current model, made of few very expensive and accurate measuring stations). However, at present, the mentioned low-cost sensor network has not been deployed in any city yet, and checking the validity of this model requires data about various pollutants related to an urban center.

cimaThe data used in this analysis are historical data provided by the Environmental Research Centre (CIMA).This entity is an autonomous body of the Government of Cantabria created by law in 1991 and headed by the Ministry of Environment. Its activity is centered on the realization of physico-chemical analyses on the state of the environment and the management of sustainability through Environmental Information, Participation, Education and Environmental Volunteering.

The data set consists of measures taken every 15 minutes between 1/1/2011 and 31/1/2013 by 4 automatic measuring stations of the Air Quality Control and Monitoring Network of Cantabria, which are located in the surroundings of Santander. The values associated to pollutants are the following: PM10 (particles in suspension of size less than 10 microns), SO2 (sulphur dioxide), NO and NO2 (nitrogen oxides), CO (carbon monoxide), O3 (ozone), BEN (benzene), TOL (toluene) and XIL (xylene). In addition, those stations that have a meteorological tower measure the following meteorological parameters: DD (wind direction), VV (wind speed), TMP (temperature), HR (relative humidity), PRB (atmospheric pressure), RS (solar radiation) and LL (precipitation level).

As described in the document, the first step in any modeling study consists in the analysis of data, performed variable by variable and from each measuring station. At least a study of the basic statistics by season (average and standard deviation, median, mode), the distribution of values (histogram) both at global and monthly level and the hourly distribution are requested. The moving average is also analyzed, a statistical feature applicable to the analysis of tendencies which smoothes the fluctuations typical of instant measurements and captures the trends in a given period.


The next step is to analyze how the variables depend on the others, in order to select the set of variables that most governs the behavior of the output variable. For that purpose correlation analysis has been employed, which is a statistical tool that allows measuring and describing the degree or intensity of association between two variables. In particular, Pearson’s correlation coefficient has been used, which measures the linear relationship between two random quantitative variables X and Y.

Analyses of dependencies have been carried out at the same moment of time, in moments of the past, with differentiated values (difference between the concentration level registered for a contaminant in a given moment of time and the level of 30 minutes before, aiming at detecting trends over time regardless of absolute values) and the moving average value of such contaminant considering different time intervals.

The next step is to evaluate a series of algorithms of modeling with monitored learning (prediction, classification) or not monitored (grouping) to draw conclusions about the behavior of pollution variables. The prediction analysis has been focused on Santander’s center, with 1-hour, 2-hour, 4-hour, 8-hour and 24-hour prediction horizons. Then, the models for each pollution variable in all those horizons have been trained and evaluated. Different machine learning algorithms have been trained in each case (variable-prediction horizon combination): M5P, IBk, Multilayer Perceptron, linear regression, Regression by Discretization, RepTree, Bagging with RepTree, etc. The assessment is performed by comparing the mean absolute error of all different prediction methods.


For example, when studying the 8-hour prediction, it can be noticed that the hour of the day becomes more important, since citizens behave cyclically and probably what happens at 7 a.m. (e.g. people go to work) relates to what happens at 3 p.m. (e.g. people come back from work).

The last step of the data mining process according to the CRISP-DM methodology would be the implementation in a system of environmental management for obtaining real-time predictions on the different values of pollutants. This implementation has to consider logically the results and conclusions obtained in the analysis and modeling processes at the time of setting up the deployment and prioritizing possible investments.

The most important thing to emphasize is that the analysis illustrates and details the steps to follow in a project of environmental pollution modeling using data mining, although, logically, the analysis and the concrete conclusions only apply, in general, to the city of Santander. You can access the complete study, more information and demos on our website: If you have any questions or comments, please do not hesitate to contact us, we will be happy to assist you.

[Translation by Luca de Filippis]