Big Data technologies

What are Big Data technologies?

The paradigms of data analysis from empirical measurements or from scientific experiments have recently changed thanks to the emergence of so-called "Big Data" technologies. These technologies offer excellent (horizontal) scalability for the analysis of increasingly large amounts of data. The scale is obtained through the distribution of these data on more processing machines:

  1. each of them is equipped with large capacity disc
  2. each machine have the task of analyzing only a portion of the same data.

The result of the this partial processing is then synthesized to obtain the final result.

A Big Data "cluster" is made from a Personal Computer (server) ordinary, from a minimum of 4 up to entire data centers with tens of thousands of PCs, which, through the use of appropriate software, automatically coordinate to divide the data, process them separately and in parallel, and then combine the results of the processing part to get the final result.

In different architectures like HPC (High Parallel Computing), the bottleneck is constituted by the storage capacity offered by the PC/server and not by the processing capability of its CPU. This makes Big Data data centers profoundly different from data centers design for HPC technologies.

Big Data technologies are nowadays used for the analysis of source data economic, social, statistics, etc. They are the pillars of algorithms for indexing of web pages and information used by big ICT companies such as Google, Yahoo!, Facebook, Twitter. The technology is transversal and can be used for the analysis of any type of data.

It is clear that Big Data represents a phenomenon characterized by the expansion of data, that can be distinguish over four dimensions:

  1. Volume: manage and process large amount of data
  2. Velocity: manage, read, process these data rapidly (reading can represent a bottleneck)
  3. Variety: ability to handle raw data that could be structured or not structured
  4. Veracity: ability to validate the correctness of these large amount of data

The Volume, Velocity and Variety of digital data are growing at an exponential rate and will continue to do so for the foreseeable future. Unstructured raw data must be converted into structure data in order to gather knowledge for achieving new opportunities of innovation and economic value creation.

In order to unlock the potential of Big Data, firms need to overcome a significant number of technological challenges including: managing diverse sources of unstructured data with no common schema (text, voice, social media data, clickstream data, etc.), time analytics, suitable visualisation techniques, etc… However, while technology issues can be challenging, the more difficult issues related to Big Data involve new approaches in management of ICT and new technical-scientific competences to develop (i.e. the so-called data scientists). This happens because it is not enough for data scientists to understand Big Data technology architectures, but they need also to know how the business works and makes money. Only in this way a company can fully exploit the potentiality and value creation opportunities offered by Big Data.

The challenge introduced by Big Data technologies is to find the best tradeoff between performance and handle this huge capacity of data. Indeed, data leads accuracy of the results, not the algorithm used. The main components of the Big Data lifecycle includes the data (also how to collect/capture them), their management, their processing and, finally, the way to present the results obtained to users.