‘Big Data’ is a popular subject nowadays. Here is what Wikipedia says on the subject:
“Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.”
“Big data can be described by the following characteristics:
The quantity of generated and stored data. The size of the data determines the value and potential insight – and whether it can actually be considered big data or not.
The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.
In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.
Inconsistency of the data set can hamper processes to handle and manage it.
The quality of captured data can vary greatly, affecting accurate analysis. ”
What is shocking in the above is the fact that dimensionality is not mentioned. In other words, it appears that size, and size alone, is what makes big data big and nasty. Linear thinking once again.
Imagine two cases.
Case 1: analysis of 1 billion e-mail messages or tweets. E-mails or tweets have a small number of features, or attributes, typically a few tens. The corresponding data array in this case is 1 billion rows by a few tens of columns. This means that a similar data set can easily be in the petabyte range. Sounds like big indeed.
Case 2: systemic analysis of all listed companies – there are approximately 45000 listed companies today – based on the last 10 quarterly financial statements of each one. A typical balance statement contains around 100 entries. This means that the corresponding data array is 10 rows by 4.5 million columns. Data size in a case like this is a few hundred megabytes. This certainly does not qualify as big data.
The ratio of data size is orders of magnitude, 3, 4 or more. And yet, case 1 is trivial. It takes more time to read the data into memory than to process it (e.g. to determine some rules or patterns). Case two – the example of listed companies is a real one – requires approximately 32000 cores and over 160 hours of computation to identify the trillions of interactions between the various companies. So, which data is bigger? The issue is not how many measurements (data samples) but how many attributes. It is the number of data attributes (dimensionality) that makes a problem nasty to tackle.
One last point. The definition of data veracity speaks of ‘accurate analysis’. But, data and the corresponding analyses must be relevant not accurate. When high complexity kicks in – the Wikipedia definition indeed speaks of complexity – there is no such thing as accuracy or precision. The Principle of Incompatibility, coined by L. Zadeh at UCLA, states that ‘high complexity is incompatible with high precision’. In other words, when complexity is high, ‘precise statements lose relevance’ and ‘relevant statements lose precision’.
Precise statement that is irrelevant: the probability of default of a given corporation in the next 3 years is 0.025%.
Relevant statement that is not precise: there is a high probability that it may rain tomorrow.
So, next time someone tries to impress you with petabytes or exabytes, ask a simple question: how many dimensions?
Wisdom starts by calling things with their right names.