Complexity Economics Engineering Society

What Makes Data Big?


‘Big Data’ is a popular subject nowadays. Here is what Wikipedia says on the subject:

“Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.”


“Big data can be described by the following characteristics:

The quantity of generated and stored data. The size of the data determines the value and potential insight – and whether it can actually be considered big data or not.

The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.

In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.

Inconsistency of the data set can hamper processes to handle and manage it.

The quality of captured data can vary greatly, affecting accurate analysis. ”

What is shocking in the above is the fact that dimensionality is not mentioned. In other words, it appears that size, and size alone, is what makes big data big and nasty. Linear thinking once again.

Imagine two cases.

Case 1: analysis of 1 billion e-mail messages or tweets. E-mails or tweets have a small number of features, or attributes, typically a few tens. The corresponding data array in this case is 1 billion rows by a few tens of columns. This means that a similar data set can easily be in the petabyte range. Sounds like big indeed.

Case 2: systemic analysis of all listed companies – there are approximately 45000 listed companies today – based on the last 10 quarterly financial statements of each one. A typical balance statement contains around 100 entries. This means that the corresponding data array is 10 rows by 4.5 million columns. Data size in a case like this is a few hundred megabytes. This certainly does not qualify as big data.

The ratio of data size is orders of magnitude, 3, 4 or more. And yet, case 1 is trivial. It takes more time to read the data into memory than to process it (e.g. to determine some rules or patterns). Case two – the example of listed companies is a real one – requires approximately 32000 cores and over 160 hours of computation to identify the trillions of interactions between the various companies. So, which data is bigger? The issue is not how many measurements (data samples)  but how many attributes. It is the number of data attributes (dimensionality) that makes a problem nasty to tackle.

One last point. The definition of data veracity speaks of ‘accurate analysis’. But, data and the corresponding analyses must be relevant not accurate. When high complexity kicks in – the Wikipedia definition indeed speaks of complexity – there is no such thing as accuracy or precision. The Principle of Incompatibility, coined by L. Zadeh at UCLA, states that ‘high complexity is incompatible with high precision’. In other words, when complexity is high, ‘precise statements lose relevance’ and ‘relevant statements lose precision’.


Precise statement that is irrelevant: the probability of default of a given corporation in the next 3 years is 0.025%.

Relevant statement that is not precise: there is a high probability that it may rain tomorrow.

So, next time someone tries to impress you with petabytes or exabytes, ask a simple question: how many dimensions?

Wisdom starts by calling things with their right names.

Established originally in 2005 in the USA, Ontonix is a technology company headquartered in Como, Italy. The unusual technology and solutions developed by Ontonix focus on countering what most threatens safety, advanced products, critical infrastructures, or IT network security - the rapid growth of complexity. In 2007 the company received recognition by being selected as Gartner's Cool Vendor. What makes Ontonix different from all those companies and research centers who claim to manage complexity is that we have a complexity metric. This means that we MEASURE complexity. We detect anomalies in complex defense systems without using Machine Learning for one very good reason: our clients don’t have the luxury of multiple examples of failures necessary to teach software to recognize them. We identify anomalies without having seen them before. Sometimes, you must get it right the first and only time!

0 comments on “What Makes Data Big?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: