Big Data Mining

The term “Big Data” appeared for first time in 1998 in a Silicon Graphics (SGI) slide deck by John Mashey with the title “Big Data and the Next Wave of InfraStress”. Big Data mining was very relevant from the beginning, as the first book mentioning “Big Data” is a data mining book that appeared also in 1998 by Weiss and Indrukya. However, the first academic paper with the words “Big Data” in the title appeared a bit later in 2000 in a paper by Diebold.

The origin of the term “Big Data” is due to the fact that we are creating a huge amount of data every day. Usama Fayyad in his invited talk at the KDD BigMine’12 Workshop presented amazing data numbers about internet usage, among them the following: each day Google has more than 1 billion queries per day, Twitter has more than 250 milion tweets per day, Facebook has more than 800 million updates per day, and YouTube has more than 4 billion views per day.┬áThe data produced nowadays is estimated in the order of zettabytes, and it is growing around 40% every year.

A new large source of data is going to be generated from mobile devices, and big companies such as Google, Apple, Facebook, Yahoo, Twitter are starting to look carefully to this data to find useful patterns to improve user experience. Alex ‘Sandy’ Pentland in his ‘Human Dynamics Laboratory’ at MIT, is doing research in finding patterns in mobile data about what users do, not what they say they do.

We need new algorithms, and new tools to deal with all of this data. Doug Laney was the first to mention the 3 V’s of Big Data management:

  • Volume: there is more data than ever before, its size continues increasing, but not the percent of data that our tools can process
  • Variety: there are many different types of data, as text, sensor data, audio, video, graph, and more
  • Velocity: data is arriving continuously as streams of data, and we are interested in obtaining useful information from it in real time

Nowadays, there are two more V’s:

  • Variability: there are changes in the structure of the data and how users want to interpret that data
  • Value: business value that gives organizations a competitive advantage, due to the ability of making decisions based in answering questions that were previously considered beyond reach

Gartner summarizes this in their definition of Big Data in 2012 as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

There are many applications of Big Data, for example the following:

  • Business: costumer personalization, churn detection
  • Technology: reducing process time from hours to seconds
  • Health: mining DNA of each person, to discover, monitor and improve health aspects of every one
  • Smart cities: cities focused on sustainable economic development and high quality of life, with wise management of natural resources

These applications will allow people to have better services, better costumer experiences, and also be healthier, as personal data will permit to prevent and detect illness much earlier than before.

References

Wei Fan, Albert Bifet Mining Big Data: Current Status, and Forecast to the Future SIGKDD Explorations 14(2): 1-5 (2012)