Keynote Talk at Business Applications of Social Network Analysis (BASNA) 2014

I was happy to be invited to give a keynote talk at BASNA 2014, the 5th International Workshop on Business Applications of Social Network Analysis, that was co-located with the 2014 IEEE International Conference on Data Mining (ICDM 2014) in Shenzhen, China.

The talk was on Real-Time Big Data Stream Analytics, about new techniques in Big Data mining that are able using a small amount of time and memory resources to adapt to changes. As an example, I discussed a social network application of data stream mining to compute user influence probabilities. And I presented the MOA software framework, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4. Here are the slides:

Big Data Stream Mining Tutorial at IEEE Big Data 2014

Gianmarco de Francisci Morales presented this week our tutorial “Big Data Stream Mining” at IEEE Big Data 2014 in Washington DC.

This tutorial was a gentle introduction to mining big data streams. The first part introduced data stream learners for classification, regression, clustering, and frequent pattern mining. The second part discussed data stream mining on distributed engines such as Storm, S4, and Samza.


  1. Fundamentals and Stream Mining Algorithms
    • Stream mining setting
    • Concept drift
    • Classification and Regression
    • Clustering
    • Frequent Pattern mining
  2. Distributed Big Data Stream Mining
    • Distributed Stream Processing Engines
    • Classification
    • Regression

Slides available in :

Extreme Classification: Classify Wikipedia documents into one of 325,056 categories

Extreme classification, where one needs to deal with multi-class and multi-label problems involving a very large number of categories, has opened up a new research frontier in machine learning. Many challenging applications, such as photo and video annotation and web page categorization, can benefit from being formulated as supervised learning tasks with millions, or even billions, of categories. Extreme classification can also give a fresh perspective on core learning problems such as ranking and recommendation by reformulating them as multi-class/label tasks where each item to be ranked or recommended is a separate category.

4th edition of the Large Scale Hierarchical Text Classification (LSHTC) Challenge.

The LSHTC Challenge was a hierarchical text classification competition, using very large datasets. We were happy to be involved in the winning team with Antti Puurula and Jesse Read.


Hierarchies are becoming ever more popular for the organization of text documents, particularly on the Web. Web directories and Wikipedia are two examples of such hierarchies. Along with their widespread use comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue, despite the vastness of available data: as more documents become available, more classes are also added to the hierarchy, and there is a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for new learning methods.

The challenge concerned multi-label classification based on the Wikipedia dataset. The hierarchy is a graph that can have cycles.  The number of categories is roughly 325,000 and the number of documents is 2,400,000. A document can appear in multiple classes.

Evolving Data Stream Classification and the Illusion of Progress

Data is being generated in real-time in increasing quantities and the distribution generating this data may be changing and evolving. In a paper presented at ECML-PKDD 2013 titled “Pitfalls in benchmarking data stream classification and how to avoid them“, we show that classifying data streams has an important temporal component, which we are currently not considering in the evaluation of data-stream classifiers. In this paper we show how a very simple classifier that considers this temporal component, the non-change classifier that predicts only using the last class seen by the classifier, can outperform current state-of-the-art classifiers in some real-world datasets. We propose to evaluate data streams considering this temporal component, using a new evaluation measure, which provides a more accurate gauge of classifier performance.

Resources to learn Big Data Analytics

A list of books and resources that are available online for learning Data Science:


  • Mc Kinsey Big data: The next frontier for innovation, competition, and productivity Website
  • O’Reilly Big Data Now: 2012 Edition. Website
  • IBM Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. Website
  • Pentaho Real-Time Big Data Analytics: Emerging Architecture. Website


  • The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Website
  • Data stream Mining: A practical approach. Website, Download
  • Introduction to Information Retrieval: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Website
  • Mining Massive Data Sets: Anand Rajaraman and Jeff Ullman, and Jure Leskovec. Website


Online Courses (suggested by Tim Osterbuhr)

  • Free Berkeley course on big data analysis using the Twitter API. Website
  • Extensive free data science course (good step-by-step approach). Website
  • Coursera course to get a good foundation of algorithms. Website

Mining Big Data in Real Time

Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis.

Nowadays, the quantity of data that is created every two days is estimated to be 5 exabytes. Moreover, it was estimated that 2007 was the first year in which it was not possible to store all the data that we are producing. This massive amount of data opens new challenging discovery tasks. Data stream real time analytics are needed to manage the data currently generated, at an ever increasing rate, from such applications as: sensor networks, measurements in network monitoring and traffic management, log records or click-streams in web exploring, manufacturing processes, call detail records, email, blogging, twitter posts and others. In fact, all data generated can be considered as streaming data or as a snapshot of streaming data, since it is obtained from an interval of time. In the data stream model, data arrive at high speed, and algorithms that process them must do so under very strict constraints of space and time. Consequently, data streams pose several challenges for data mining algorithm design. First, algorithms must make use of limited resources (time and memory). Second, they must deal with data whose nature or distribution changes over time.

Invited Talk: 100 Years of Alan Turing and 20 years of SLAIS, Slovenia, 2012

Big Data Mining SIGKDD Explorations

SigKDD Explorations

For the Big Data Mining SIGKDD Explorations Dec 2012, we selected four contributions that together show very significant state-of-the-art research in Big Data Mining, and that provide a broad overview of the field and a forecast to the future.

  • Scaling Big Data Mining Infrastructure: The Twitter Experience by Jimmy Lin and Dmitriy Ryaboy (Twitter, Inc.). This paper presents insights about Big Data mining infrastructures, and the experience of doing analytics at Twitter. It shows that due to the current state of the data mining tools, it is not straightforward to perform analytics. Most of the time is consumed in preparatory work to the application of data mining methods, and turning preliminary models into robust solutions.
  • Mining Heterogeneous Information Networks: A Structural Analysis Approach by Yizhou Sun (Northeastern University) and Jiawei Han (University of Illinois at Urbana-Champaign). This paper shows that mining heterogeneous information networks is a new and promising research frontier in Big Data mining research. It considers interconnected, multi-typed data, including the typical relational database data, as heterogeneous information networks. These semi-structured heterogeneous information network models leverage the rich semantics of typed nodes and links in a network and can uncover surprisingly rich knowledge from interconnected data.
  • Big Graph Mining: Algorithms and discoveries by U Kang and Christos Faloutsos(Carnegie Mellon University). This paper presents an overview of mining big graphs, focusing on the use of the Pegasus tool, showing some findings in the Web Graph and Twitter social networks. The paper gives inspirational future research directions for big graph mining.
  • Mining Large Streams of User Data for Personalized Recommendations by Xavier Amatriain (Netflix). This paper presents some lessons learned with the Netflix Prize, and discusses the recommender and personalization techniques used in Netflix. It discusses recent important problems and future research directions. Section 4 contains an interesting discussion about if we need more data or better models to improve our learning methodology.

Big Data Mining Future Challenges


There are many future important challenges in Big Data management and analytics, that arise from the nature of data: large, diverse, and evolving. These are some of the challenges that researchers and practitioners will have to deal with in the years to come:

  • Analytics Architecture. It is not clear yet how an optimal architecture of an analytics systems should be constructed to deal with historic data and with real- time data at the same time. An interesting proposal is the Lambda architecture of Nathan Marz. The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in realtime by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer. It combines in the same system as Hadoop for the batch layer, and Storm for the speed layer. The properties of the system are: robust and fault tolerant, scalable, general, extensible, allows ad hoc queries, minimal maintenance, and debuggable.
  • Evaluation. It is important to achieve significant statistical results, and not be fooled by randomness. As Efron explains in his book about Large Scale Inference, it is easy to go wrong with huge data sets and thousands of questions to answer at once. Also, it will be important to avoid the trap of a focus on error or speed as Kiri Wagstaff discusses in her paper “Machine Learning that Matters”.
  • Distributed mining. Many data mining techniques are not trivial to paralyze. To have distributed versions of some methods, a lot of research is needed with practi- cal and theoretical analysis to provide new methods.
  • Time evolving data. Data may be evolving over time, so it is important that the Big Data mining techniques should be able to adapt and in some cases to detect change first. For example, the data stream mining field has very powerful techniques for this task.
  • Compression: Dealing with Big Data, the quantity of space needed to store it is very relevant. There are two main approaches: compression where we don’t lose anything, or sampling where we choose data that is more representative. Using compression, we may take more time and less space, so we can consider it as a transformation from time to space. Using sampling, we are losing information, but the gains in space may be in orders of magnitude. For example Feldman et al. use coresets to reduce the complexity of Big Data problems. Coresets are small sets that provably approximate the original data for a given problem. Using merge-reduce the small sets can then be used for solving hard machine learning problems in parallel.
  • Visualization. A main task of Big Data analysis is how to visualize the results. As the data is so big, it is very difficult to find user-friendly visualizations. New techniques, and frameworks to tell and show stories will be needed, as for example the photographs, infographics and essays in the beautiful book ”The Human Face of Big Data”.
  • Hidden Big Data. Large quantities of useful data are getting lost since new data is largely untagged file- based and unstructured data. The 2012 IDC study on Big Data  explains that in 2012, 23% (643 exabytes) of the digital universe would be useful for Big Data if tagged and analyzed. However, currently only 3% of the potentially useful data is tagged, and even less is analyzed.

Big Data Mining Tools


The Big Data phenomenon is intrinsically related to the open source software revolution. Large companies such as Facebook, Yahoo!, Twitter, LinkedIn benefit and contribute to open source projects. Big Data infrastructure deals with Hadoop, and other related software as:

  • Apache Hadoop : software for data-intensive distributed applications, based in the MapReduce programming model and a distributed file system called Hadoop Distributed Filesystem (HDFS). Hadoop allows writing applications that rapidly process large amounts of data in parallel on large clusters of compute nodes. A MapReduce job divides the input dataset into independent subsets that are processed by map tasks in parallel. This step of mapping is then followed by a step of reducing tasks. These reduce tasks use the output of the maps to obtain the final result of the job.
  • Apache Hadoop related projects: Apache Pig, Apache Hive, Apache HBase, Apache ZooKeeper, Apache Cassandra, Cascading, Scribe and many others.
  • Apache S4: platform for processing continuous data streams. S4 is designed specifically for managing data streams. S4 apps are designed combining streams and processing elements in real time.
  • Storm: software for streaming data-intensive distributed applications, similar to S4, and developed by Nathan Marz at Twitter.

In Big Data Mining, there are many open source initiatives. The most popular are the following:

  • Apache Mahout: Scalable machine learning and data mining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, clas- sification, collaborative filtering and frequent pattern mining.
  • R: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets.
  • MOA: Stream data mining open source software to perform data mining in real time. It has imple- mentations of classification, regression, clustering and frequent item set mining and frequent graph mining. It started as a project of the Machine Learning group of University of Waikato, New Zealand, famous for the WEKA software. The streams framework provides an environment for defining and running stream processes using simple XML based definitions and is able to use MOA, Android and Storm. SAMOA is a new upcoming software project for distributed stream mining that will combine S4 and Storm with MOA.
  • Vowpal Wabbit: open source project started at Yahoo! Research and continuing at Microsoft Research to design a fast, scalable, useful learning algorithm. VW is able to learn from terafeature datasets. It can exceed the throughput of any single machine network interface when doing linear learning, via parallel learning.

More specific to Big Graph mining we found the following open source tools:

  • Pegasus: big graph mining system built on top of MapReduce. It allows to find patterns and anomalies in massive real-world graphs.
  • GraphLab: high-level graph-parallel system built without using MapReduce. GraphLab computes over dependent records which are stored as vertices in a large distributed data-graph. Algorithms in GraphLab are expressed as vertex-programs which are executed in parallel on each vertex and can interact with neighboring vertices.

Global Pulse: “Big Data for development”

UN Global Pulse

To show the usefulness of Big Data mining, we would like to mention the work that Global Pulse is doing using Big Data to improve life in developing countries. Global Pulse is a United Nations initiative, launched in 2009, that functions as an innovative lab, and that is based in mining Big Data for developing countries. They pursue a strategy that consists of 1) researching innovative methods and techniques for analyzing real-time digital data to detect early emerging vulnerabilities; 2) assembling a free and open source technology toolkit for analyzing real-time data and sharing hypotheses; and 3) establishing an integrated, global network of Pulse Labs, to pilot the approach at country level.

Global Pulse describes the main opportunities Big Data offers to developing countries in their White paper ”Big Data for Development: Challenges & Opportunities”:

  • Early warning: develop fast response in time of crisis, detecting anomalies in the usage of digital media
  •  Real-time awareness: design programs and policies with a more fine-grained representation of reality
  • Real-time feedback: check what policies and programs fail, monitoring them in real time, and using this feedback make the needed changes

The Big Data mining revolution is not restricted to the industrialized world, as mobiles are spreading in developing countries as well. It is estimated that there are over five billion mobile phones, and that 80% are located in developing countries.