Controversy about Big Data

Big Data is a new and emerging hot topic, that has generated a great deal of controversy:

  • There is no need to distinguish Big Data analytics from data analytics, as data will continue growing, and it will never be small again.
  • Big Data may be a hype to sell Hadoop based computing systems. Hadoop is not always the best tool. It seems that data management system sellers try to sell systems based in Hadoop, and MapReduce may be not always the best programming platform, for example for medium-size companies.
  • In real time analytics, data may be changing. In that case, what it is important is not the size of the data, it is its recency. Claims to accuracy are misleading. As Taleb explains in his new book ‘AntiFragile’, when the number of variables grow, the number of fake correlations also grow. For example, Leinweber showed that the S&P 500 stock index was correlated with butter production in Bangladesh, and other strange correlations.
  • Bigger data are not always better data. It depends if the data is noisy or not, and if it is representative of what we are looking for. For example, some times Twitter users are assumed to be  representative of the global population, when this is not always the case.
  • Ethical concerns about accessibility. The main issue is if it is ethical that people can be analyzed without knowing it.
  • Limited access to Big Data creates new digital divides. There may be a digital divide between people or organizations being able to analyze Big Data or not. Also organizations with access to Big Data will be able to extract knowledge that others without access will not. We may create a division between Big Data rich and poor organizations.

References

Wei Fan, Albert Bifet Mining Big Data: Current Status, and Forecast to the Future SIGKDD Explorations 14(2): 1-5 (2012)

Big Data Mining

The term “Big Data” appeared for first time in 1998 in a Silicon Graphics (SGI) slide deck by John Mashey with the title “Big Data and the Next Wave of InfraStress”. Big Data mining was very relevant from the beginning, as the first book mentioning “Big Data” is a data mining book that appeared also in 1998 by Weiss and Indrukya. However, the first academic paper with the words “Big Data” in the title appeared a bit later in 2000 in a paper by Diebold.

The origin of the term “Big Data” is due to the fact that we are creating a huge amount of data every day. Usama Fayyad in his invited talk at the KDD BigMine’12 Workshop presented amazing data numbers about internet usage, among them the following: each day Google has more than 1 billion queries per day, Twitter has more than 250 milion tweets per day, Facebook has more than 800 million updates per day, and YouTube has more than 4 billion views per day. The data produced nowadays is estimated in the order of zettabytes, and it is growing around 40% every year.

A new large source of data is going to be generated from mobile devices, and big companies such as Google, Apple, Facebook, Yahoo, Twitter are starting to look carefully to this data to find useful patterns to improve user experience. Alex ‘Sandy’ Pentland in his ‘Human Dynamics Laboratory’ at MIT, is doing research in finding patterns in mobile data about what users do, not what they say they do.

We need new algorithms, and new tools to deal with all of this data. Doug Laney was the first to mention the 3 V’s of Big Data management:

  • Volume: there is more data than ever before, its size continues increasing, but not the percent of data that our tools can process
  • Variety: there are many different types of data, as text, sensor data, audio, video, graph, and more
  • Velocity: data is arriving continuously as streams of data, and we are interested in obtaining useful information from it in real time

Nowadays, there are two more V’s:

  • Variability: there are changes in the structure of the data and how users want to interpret that data
  • Value: business value that gives organizations a competitive advantage, due to the ability of making decisions based in answering questions that were previously considered beyond reach

Gartner summarizes this in their definition of Big Data in 2012 as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

There are many applications of Big Data, for example the following:

  • Business: costumer personalization, churn detection
  • Technology: reducing process time from hours to seconds
  • Health: mining DNA of each person, to discover, monitor and improve health aspects of every one
  • Smart cities: cities focused on sustainable economic development and high quality of life, with wise management of natural resources

These applications will allow people to have better services, better costumer experiences, and also be healthier, as personal data will permit to prevent and detect illness much earlier than before.

References

Wei Fan, Albert Bifet Mining Big Data: Current Status, and Forecast to the Future SIGKDD Explorations 14(2): 1-5 (2012)