Big Data Mining Future Challenges

Future

There are many future important challenges in Big Data management and analytics, that arise from the nature of data: large, diverse, and evolving. These are some of the challenges that researchers and practitioners will have to deal with in the years to come:

  • Analytics Architecture. It is not clear yet how an optimal architecture of an analytics systems should be constructed to deal with historic data and with real- time data at the same time. An interesting proposal is the Lambda architecture of Nathan Marz. The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in realtime by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer. It combines in the same system as Hadoop for the batch layer, and Storm for the speed layer. The properties of the system are: robust and fault tolerant, scalable, general, extensible, allows ad hoc queries, minimal maintenance, and debuggable.
  • Evaluation. It is important to achieve significant statistical results, and not be fooled by randomness. As Efron explains in his book about Large Scale Inference, it is easy to go wrong with huge data sets and thousands of questions to answer at once. Also, it will be important to avoid the trap of a focus on error or speed as Kiri Wagstaff discusses in her paper “Machine Learning that Matters”.
  • Distributed mining. Many data mining techniques are not trivial to paralyze. To have distributed versions of some methods, a lot of research is needed with practi- cal and theoretical analysis to provide new methods.
  • Time evolving data. Data may be evolving over time, so it is important that the Big Data mining techniques should be able to adapt and in some cases to detect change first. For example, the data stream mining field has very powerful techniques for this task.
  • Compression: Dealing with Big Data, the quantity of space needed to store it is very relevant. There are two main approaches: compression where we don’t lose anything, or sampling where we choose data that is more representative. Using compression, we may take more time and less space, so we can consider it as a transformation from time to space. Using sampling, we are losing information, but the gains in space may be in orders of magnitude. For example Feldman et al. use coresets to reduce the complexity of Big Data problems. Coresets are small sets that provably approximate the original data for a given problem. Using merge-reduce the small sets can then be used for solving hard machine learning problems in parallel.
  • Visualization. A main task of Big Data analysis is how to visualize the results. As the data is so big, it is very difficult to find user-friendly visualizations. New techniques, and frameworks to tell and show stories will be needed, as for example the photographs, infographics and essays in the beautiful book ”The Human Face of Big Data”.
  • Hidden Big Data. Large quantities of useful data are getting lost since new data is largely untagged file- based and unstructured data. The 2012 IDC study on Big Data  explains that in 2012, 23% (643 exabytes) of the digital universe would be useful for Big Data if tagged and analyzed. However, currently only 3% of the potentially useful data is tagged, and even less is analyzed.

Big Data Mining Tools

MOA

The Big Data phenomenon is intrinsically related to the open source software revolution. Large companies such as Facebook, Yahoo!, Twitter, LinkedIn benefit and contribute to open source projects. Big Data infrastructure deals with Hadoop, and other related software as:

  • Apache Hadoop : software for data-intensive distributed applications, based in the MapReduce programming model and a distributed file system called Hadoop Distributed Filesystem (HDFS). Hadoop allows writing applications that rapidly process large amounts of data in parallel on large clusters of compute nodes. A MapReduce job divides the input dataset into independent subsets that are processed by map tasks in parallel. This step of mapping is then followed by a step of reducing tasks. These reduce tasks use the output of the maps to obtain the final result of the job.
  • Apache Hadoop related projects: Apache Pig, Apache Hive, Apache HBase, Apache ZooKeeper, Apache Cassandra, Cascading, Scribe and many others.
  • Apache S4: platform for processing continuous data streams. S4 is designed specifically for managing data streams. S4 apps are designed combining streams and processing elements in real time.
  • Storm: software for streaming data-intensive distributed applications, similar to S4, and developed by Nathan Marz at Twitter.

In Big Data Mining, there are many open source initiatives. The most popular are the following:

  • Apache Mahout: Scalable machine learning and data mining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, clas- sification, collaborative filtering and frequent pattern mining.
  • R: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets.
  • MOA: Stream data mining open source software to perform data mining in real time. It has imple- mentations of classification, regression, clustering and frequent item set mining and frequent graph mining. It started as a project of the Machine Learning group of University of Waikato, New Zealand, famous for the WEKA software. The streams framework provides an environment for defining and running stream processes using simple XML based definitions and is able to use MOA, Android and Storm. SAMOA is a new upcoming software project for distributed stream mining that will combine S4 and Storm with MOA.
  • Vowpal Wabbit: open source project started at Yahoo! Research and continuing at Microsoft Research to design a fast, scalable, useful learning algorithm. VW is able to learn from terafeature datasets. It can exceed the throughput of any single machine network interface when doing linear learning, via parallel learning.

More specific to Big Graph mining we found the following open source tools:

  • Pegasus: big graph mining system built on top of MapReduce. It allows to find patterns and anomalies in massive real-world graphs.
  • GraphLab: high-level graph-parallel system built without using MapReduce. GraphLab computes over dependent records which are stored as vertices in a large distributed data-graph. Algorithms in GraphLab are expressed as vertex-programs which are executed in parallel on each vertex and can interact with neighboring vertices.

Global Pulse: “Big Data for development”

UN Global Pulse

To show the usefulness of Big Data mining, we would like to mention the work that Global Pulse is doing using Big Data to improve life in developing countries. Global Pulse is a United Nations initiative, launched in 2009, that functions as an innovative lab, and that is based in mining Big Data for developing countries. They pursue a strategy that consists of 1) researching innovative methods and techniques for analyzing real-time digital data to detect early emerging vulnerabilities; 2) assembling a free and open source technology toolkit for analyzing real-time data and sharing hypotheses; and 3) establishing an integrated, global network of Pulse Labs, to pilot the approach at country level.

Global Pulse describes the main opportunities Big Data offers to developing countries in their White paper ”Big Data for Development: Challenges & Opportunities”:

  • Early warning: develop fast response in time of crisis, detecting anomalies in the usage of digital media
  •  Real-time awareness: design programs and policies with a more fine-grained representation of reality
  • Real-time feedback: check what policies and programs fail, monitoring them in real time, and using this feedback make the needed changes

The Big Data mining revolution is not restricted to the industrialized world, as mobiles are spreading in developing countries as well. It is estimated that there are over five billion mobile phones, and that 80% are located in developing countries.

Controversy about Big Data

Big Data is a new and emerging hot topic, that has generated a great deal of controversy:

  • There is no need to distinguish Big Data analytics from data analytics, as data will continue growing, and it will never be small again.
  • Big Data may be a hype to sell Hadoop based computing systems. Hadoop is not always the best tool. It seems that data management system sellers try to sell systems based in Hadoop, and MapReduce may be not always the best programming platform, for example for medium-size companies.
  • In real time analytics, data may be changing. In that case, what it is important is not the size of the data, it is its recency. Claims to accuracy are misleading. As Taleb explains in his new book ‘AntiFragile’, when the number of variables grow, the number of fake correlations also grow. For example, Leinweber showed that the S&P 500 stock index was correlated with butter production in Bangladesh, and other strange correlations.
  • Bigger data are not always better data. It depends if the data is noisy or not, and if it is representative of what we are looking for. For example, some times Twitter users are assumed to be  representative of the global population, when this is not always the case.
  • Ethical concerns about accessibility. The main issue is if it is ethical that people can be analyzed without knowing it.
  • Limited access to Big Data creates new digital divides. There may be a digital divide between people or organizations being able to analyze Big Data or not. Also organizations with access to Big Data will be able to extract knowledge that others without access will not. We may create a division between Big Data rich and poor organizations.

References

Wei Fan, Albert Bifet Mining Big Data: Current Status, and Forecast to the Future SIGKDD Explorations 14(2): 1-5 (2012)

Big Data Mining

The term “Big Data” appeared for first time in 1998 in a Silicon Graphics (SGI) slide deck by John Mashey with the title “Big Data and the Next Wave of InfraStress”. Big Data mining was very relevant from the beginning, as the first book mentioning “Big Data” is a data mining book that appeared also in 1998 by Weiss and Indrukya. However, the first academic paper with the words “Big Data” in the title appeared a bit later in 2000 in a paper by Diebold.

The origin of the term “Big Data” is due to the fact that we are creating a huge amount of data every day. Usama Fayyad in his invited talk at the KDD BigMine’12 Workshop presented amazing data numbers about internet usage, among them the following: each day Google has more than 1 billion queries per day, Twitter has more than 250 milion tweets per day, Facebook has more than 800 million updates per day, and YouTube has more than 4 billion views per day. The data produced nowadays is estimated in the order of zettabytes, and it is growing around 40% every year.

A new large source of data is going to be generated from mobile devices, and big companies such as Google, Apple, Facebook, Yahoo, Twitter are starting to look carefully to this data to find useful patterns to improve user experience. Alex ‘Sandy’ Pentland in his ‘Human Dynamics Laboratory’ at MIT, is doing research in finding patterns in mobile data about what users do, not what they say they do.

We need new algorithms, and new tools to deal with all of this data. Doug Laney was the first to mention the 3 V’s of Big Data management:

  • Volume: there is more data than ever before, its size continues increasing, but not the percent of data that our tools can process
  • Variety: there are many different types of data, as text, sensor data, audio, video, graph, and more
  • Velocity: data is arriving continuously as streams of data, and we are interested in obtaining useful information from it in real time

Nowadays, there are two more V’s:

  • Variability: there are changes in the structure of the data and how users want to interpret that data
  • Value: business value that gives organizations a competitive advantage, due to the ability of making decisions based in answering questions that were previously considered beyond reach

Gartner summarizes this in their definition of Big Data in 2012 as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

There are many applications of Big Data, for example the following:

  • Business: costumer personalization, churn detection
  • Technology: reducing process time from hours to seconds
  • Health: mining DNA of each person, to discover, monitor and improve health aspects of every one
  • Smart cities: cities focused on sustainable economic development and high quality of life, with wise management of natural resources

These applications will allow people to have better services, better costumer experiences, and also be healthier, as personal data will permit to prevent and detect illness much earlier than before.

References

Wei Fan, Albert Bifet Mining Big Data: Current Status, and Forecast to the Future SIGKDD Explorations 14(2): 1-5 (2012)