As a rule, analysis software requires much higher data quality than operational software. A typing error in a customer name in an invoice is not enough to break the transaction in most systems. But when it comes to analyzing the data, much more precision is required. In fact, the issue of data quality was identified by The BI Survey 9 as the most common problem in business intelligence projects.
So what is to be made of the idea of business intelligence and unstructured data? The term “unstructured data” is not particularly precise, but in the end it means data that is difficult to deal with, usually because it has no data model. This certainly does not sound like a very promising area for analysis. Unfortunately, a lot of the information available to companies is in unstructured data, although it is hard to say exactly how much.
In fact, business intelligence tools cannot directly analyze unstructured data directly. Any project of this type has two distinct stages – in the first stage, specialized software analyzes the unstructured data, reduces it and produces a data model that a BI system can deal with. Then the data can be analyzed using business intelligence tools. Text based unstructured data is by far the most common, but there are many other forms. In this article, we will only discuss text analysis.
Text analysis tools
Text analysis tools process texts and add metadata for analysis. The metadata consists of semantic tags to the documents. The resulting data is often stored in a search engine style tables – obviously, there is a large overlap between search engine technology and BI for unstructured data. The analysis software defines clusters, which are sets of data with the same semantic tags. This process removes the “noise” that natural language texts inevitably contain. The clusters are then treated as the objects of analysis. So the final analysis carried out by the end users does not actually look at the raw data itself. Instead, it looks at these cluster objects, which are abstract and modular enough to be used for high level analysis.
Not all companies that want to analyze unstructured data need a lot of complexity in the front-end. One typical application of unstructured data analysis is finding out whether consumers have a positive or negative attitude towards a brand. This can be achieved to a certain extent by analyzing comment on public Web forums. There is a lot of complex technology involved in this process, but the final results are quite simple to present – essentially a thumbs-up (or thumbs-down) for the brand in question.
There are various ways to enrich the data model to make it more susceptible to typical OLAP technology. In some cases, the clusters are arranged hierarchically, so users can drill down from general terms to terms that are more specific. For example, a car company might analyze comments from on-line forums about the opinions car owners have. A high level cluster in this case might be engines, and from here, the user could drill down to engine problems, or performance, or other concepts. It is also common to add a time line to the data. In this case, the analysis is often a simple cluster frequency analysis over time. From the point of view of OLAP analysis, this is not very complicated, but even something this simple can be extremely valuable to a company. Other examples of where the same type of analysis could be useful include the analysis of letters from customers, or call center data.
More than simple string matching
Unstructured data analysis solutions start by using a natural language processing engine to derive the clusters. Measuring keyword density is the most important method in this type of analysis. The first step to measuring keyword density is to apply a so-called stop word list to a text. Stop words are words that do not carry enough meaning to help establish a context. A typical English stop word list would begin “a, about, after, again, against, all…” The more commonly a word occurs in a generic text, the less interesting it is for analysis of keywords in a specific text. Usually the stop words are not removed from the text, just replaced by a single symbol to maintain the text structure.
Text analysis engines also use word root analysis to group words by ignoring inflections, so “engine” and “engines” can be treated as the same keyword. Synonym word lists, which usually depend on the application, are used in the same way to reduce the total number of terms. Synonyms are a typical example of the linguistic noise that makes texts difficult for computers to analyze.
Once the texts have been cleaned up, the system can carry out a statistical analysis of the resulting set of interesting words. This is often done using a word list as a basis – depending on the business. For example a company could look for mentions of its own name, and try to determine if it is associated with positive or negative terms. The document needs to be divided into segments in some way, or the system has some statistical measure of the nearness of terms to one another in the text.
Vendors of text analysis systems tend to adopt one of two approaches: a linguistic approach or a statistical approach. It is difficult to make a general statement on which is better, and companies need to carry out a proof of concept to see which makes more sense in their specific situation. In many cases, the two approaches complement each other. Smaller specialized companies are often strongly oriented towards one approach or the other.
Regardless of the approach the vendor takes, analysis systems for unstructured data all require a training phase. When the project begins, there simply is not enough example data in the system to carry out reliable analyses. Companies need to develop custom dictionaries to get the best results. This means that the projects inevitably take months to get working. And as with any IT project, long deployment times tend to result in changing requirements and disagreements on the goals. Unfortunately, there is no simple solution to this problem. However, one possible approach is to use knowledge already available to the customer to structure the dictionary. In other words, instead of starting the project with a completely blank slate, companies tend to bring in a clear list of concepts and key words they would like to analyze.
Another issue that is related to the amount of predefined information that flows into the system is the balance of flexibility and ease of use for the end user. Just as in business intelligence, the most flexible system is not always the best, because many users require some guidance when navigating complex data.
The solutions we have seen are often produced as a cooperation between two or more companies, with a BI specialist providing the front-end and a specialist for unstructured data processing the raw data to feed the system. It is typical for the more specialized, linguistics-oriented systems that the native front ends are simple and do not offer as many functions. Vendors who provide statistical analysis are often accustomed to producing data models for BI front ends, not least because this type of technology is older, more generally applicable and better established in the market.