I was happy to be invited to give a keynote talk at BASNA 2014, the 5th International Workshop on Business Applications of Social Network Analysis, that was co-located with the 2014 IEEE International Conference on Data Mining (ICDM 2014) in Shenzhen, China.
The talk was on Real-Time Big Data Stream Analytics, about new techniques in Big Data mining that are able using a small amount of time and memory resources to adapt to changes. As an example, I discussed a social network application of data stream mining to compute user influence probabilities. And I presented the MOA software framework, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4. Here are the slides:
Gianmarco de Francisci Morales presented this week our tutorial “Big Data Stream Mining” at IEEE Big Data 2014 in Washington DC.
This tutorial was a gentle introduction to mining big data streams. The first part introduced data stream learners for classification, regression, clustering, and frequent pattern mining. The second part discussed data stream mining on distributed engines such as Storm, S4, and Samza.
- Fundamentals and Stream Mining Algorithms
- Stream mining setting
- Concept drift
- Classification and Regression
- Frequent Pattern mining
- Distributed Big Data Stream Mining
- Distributed Stream Processing Engines
Slides available in : https://sites.google.com/site/iotminingtutorial/
Extreme classification, where one needs to deal with multi-class and multi-label problems involving a very large number of categories, has opened up a new research frontier in machine learning. Many challenging applications, such as photo and video annotation and web page categorization, can benefit from being formulated as supervised learning tasks with millions, or even billions, of categories. Extreme classification can also give a fresh perspective on core learning problems such as ranking and recommendation by reformulating them as multi-class/label tasks where each item to be ranked or recommended is a separate category.
4th edition of the Large Scale Hierarchical Text Classification (LSHTC) Challenge.
The LSHTC Challenge was a hierarchical text classification competition, using very large datasets. We were happy to be involved in the winning team with Antti Puurula and Jesse Read.
Hierarchies are becoming ever more popular for the organization of text documents, particularly on the Web. Web directories and Wikipedia are two examples of such hierarchies. Along with their widespread use comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue, despite the vastness of available data: as more documents become available, more classes are also added to the hierarchy, and there is a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for new learning methods.
The challenge concerned multi-label classification based on the Wikipedia dataset. The hierarchy is a graph that can have cycles. The number of categories is roughly 325,000 and the number of documents is 2,400,000. A document can appear in multiple classes.
Data is being generated in real-time in increasing quantities and the distribution generating this data may be changing and evolving. In a paper presented at ECML-PKDD 2013 titled “Pitfalls in benchmarking data stream classification and how to avoid them“, we show that classifying data streams has an important temporal component, which we are currently not considering in the evaluation of data-stream classifiers. In this paper we show how a very simple classifier that considers this temporal component, the non-change classifier that predicts only using the last class seen by the classifier, can outperform current state-of-the-art classifiers in some real-world datasets. We propose to evaluate data streams considering this temporal component, using a new evaluation measure, which provides a more accurate gauge of classifier performance.
A list of books and resources that are available online for learning Data Science:
- Mc Kinsey Big data: The next frontier for innovation, competition, and productivity Website
- O’Reilly Big Data Now: 2012 Edition. Website
- IBM Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. Website
- Pentaho Real-Time Big Data Analytics: Emerging Architecture. Website
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Website
- Data stream Mining: A practical approach. Website, Download
- Introduction to Information Retrieval: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Website
- Mining Massive Data Sets: Anand Rajaraman and Jeff Ullman, and Jure Leskovec. Website
Online Courses (suggested by Tim Osterbuhr)
- Free Berkeley course on big data analysis using the Twitter API. Website
- Extensive free data science course (good step-by-step approach). Website
- Coursera course to get a good foundation of algorithms. Website
Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis.
Nowadays, the quantity of data that is created every two days is estimated to be 5 exabytes. Moreover, it was estimated that 2007 was the first year in which it was not possible to store all the data that we are producing. This massive amount of data opens new challenging discovery tasks. Data stream real time analytics are needed to manage the data currently generated, at an ever increasing rate, from such applications as: sensor networks, measurements in network monitoring and traffic management, log records or click-streams in web exploring, manufacturing processes, call detail records, email, blogging, twitter posts and others. In fact, all data generated can be considered as streaming data or as a snapshot of streaming data, since it is obtained from an interval of time. In the data stream model, data arrive at high speed, and algorithms that process them must do so under very strict constraints of space and time. Consequently, data streams pose several challenges for data mining algorithm design. First, algorithms must make use of limited resources (time and memory). Second, they must deal with data whose nature or distribution changes over time.
Invited Talk: 100 Years of Alan Turing and 20 years of SLAIS, Slovenia, 2012
The Big Data phenomenon is intrinsically related to the open source software revolution. Large companies such as Facebook, Yahoo!, Twitter, LinkedIn benefit and contribute to open source projects. Big Data infrastructure deals with Hadoop, and other related software as:
- Apache Hadoop : software for data-intensive distributed applications, based in the MapReduce programming model and a distributed file system called Hadoop Distributed Filesystem (HDFS). Hadoop allows writing applications that rapidly process large amounts of data in parallel on large clusters of compute nodes. A MapReduce job divides the input dataset into independent subsets that are processed by map tasks in parallel. This step of mapping is then followed by a step of reducing tasks. These reduce tasks use the output of the maps to obtain the final result of the job.
- Apache Hadoop related projects: Apache Pig, Apache Hive, Apache HBase, Apache ZooKeeper, Apache Cassandra, Cascading, Scribe and many others.
- Apache S4: platform for processing continuous data streams. S4 is designed specifically for managing data streams. S4 apps are designed combining streams and processing elements in real time.
- Storm: software for streaming data-intensive distributed applications, similar to S4, and developed by Nathan Marz at Twitter.
In Big Data Mining, there are many open source initiatives. The most popular are the following:
- Apache Mahout: Scalable machine learning and data mining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, clas- sification, collaborative filtering and frequent pattern mining.
- R: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets.
- MOA: Stream data mining open source software to perform data mining in real time. It has imple- mentations of classification, regression, clustering and frequent item set mining and frequent graph mining. It started as a project of the Machine Learning group of University of Waikato, New Zealand, famous for the WEKA software. The streams framework provides an environment for defining and running stream processes using simple XML based definitions and is able to use MOA, Android and Storm. SAMOA is a new upcoming software project for distributed stream mining that will combine S4 and Storm with MOA.
- Vowpal Wabbit: open source project started at Yahoo! Research and continuing at Microsoft Research to design a fast, scalable, useful learning algorithm. VW is able to learn from terafeature datasets. It can exceed the throughput of any single machine network interface when doing linear learning, via parallel learning.
More specific to Big Graph mining we found the following open source tools:
- Pegasus: big graph mining system built on top of MapReduce. It allows to find patterns and anomalies in massive real-world graphs.
- GraphLab: high-level graph-parallel system built without using MapReduce. GraphLab computes over dependent records which are stored as vertices in a large distributed data-graph. Algorithms in GraphLab are expressed as vertex-programs which are executed in parallel on each vertex and can interact with neighboring vertices.
To show the usefulness of Big Data mining, we would like to mention the work that Global Pulse is doing using Big Data to improve life in developing countries. Global Pulse is a United Nations initiative, launched in 2009, that functions as an innovative lab, and that is based in mining Big Data for developing countries. They pursue a strategy that consists of 1) researching innovative methods and techniques for analyzing real-time digital data to detect early emerging vulnerabilities; 2) assembling a free and open source technology toolkit for analyzing real-time data and sharing hypotheses; and 3) establishing an integrated, global network of Pulse Labs, to pilot the approach at country level.
Global Pulse describes the main opportunities Big Data offers to developing countries in their White paper ”Big Data for Development: Challenges & Opportunities”:
- Early warning: develop fast response in time of crisis, detecting anomalies in the usage of digital media
- Real-time awareness: design programs and policies with a more fine-grained representation of reality
- Real-time feedback: check what policies and programs fail, monitoring them in real time, and using this feedback make the needed changes
The Big Data mining revolution is not restricted to the industrialized world, as mobiles are spreading in developing countries as well. It is estimated that there are over five billion mobile phones, and that 80% are located in developing countries.