Invited Talk at Asian Conference on Machine Learning (ACML) 2016

I was very happy to be invited to give an invited talk at the 8th Asian Conference on Machine Learning (ACML 2016) in Hamilton, New Zealand.

The talk was on Massive Online Analytics for the Internet of Things (IoT). The challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors and devices is bound to become a key area of data mining research as the number of applications requiring such processing increases. Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in stream mining. In this talk, I presented an overview of data stream mining, and I introduced MOA as the most popular open source tool for data stream mining.

IoT Big Data Stream Mining Tutorial at KDD 2016

We presented this week our tutorial “IoT Big Data Stream Mining” at KDD 2016 in San Francisco.

This tutorial was a gentle introduction to mining IoT big data streams. The first part introduces data stream learners for classification, regression, clustering, and frequent pattern mining. The second part deals with scalability issues inherent in IoT applications, and discusses how to mine data streams on distributed engines such as Spark, Flink, Storm, and Samza.

Outline:

Content:

  • 1. IoT Fundamentals and Stream Mining Algorithms
    •  IoT Stream mining setting
    •  Concept drift
    •  Classification and Regression
    •  Clustering
    •  Frequent Pattern mining
    •  Concept Evolution
    •  Limited Labeled Learning
  • 2. IoT Distributed Big Data Stream Mining
    •  Distributed Stream Processing Engines
    •  Classification
    •  Regression
    •  Open Source Tools
    •  Applications

Slides available in : https://sites.google.com/site/iotminingtutorial/

Keynote Talk at Business Applications of Social Network Analysis (BASNA) 2014

I was happy to be invited to give a keynote talk at BASNA 2014, the 5th International Workshop on Business Applications of Social Network Analysis, that was co-located with the 2014 IEEE International Conference on Data Mining (ICDM 2014) in Shenzhen, China.

The talk was on Real-Time Big Data Stream Analytics, about new techniques in Big Data mining that are able using a small amount of time and memory resources to adapt to changes. As an example, I discussed a social network application of data stream mining to compute user influence probabilities. And I presented the MOA software framework, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4. Here are the slides:

Big Data Stream Mining Tutorial at IEEE Big Data 2014

Gianmarco de Francisci Morales presented this week our tutorial “Big Data Stream Mining” at IEEE Big Data 2014 in Washington DC.

This tutorial was a gentle introduction to mining big data streams. The first part introduced data stream learners for classification, regression, clustering, and frequent pattern mining. The second part discussed data stream mining on distributed engines such as Storm, S4, and Samza.

Outline:

  1. Fundamentals and Stream Mining Algorithms
    • Stream mining setting
    • Concept drift
    • Classification and Regression
    • Clustering
    • Frequent Pattern mining
  2. Distributed Big Data Stream Mining
    • Distributed Stream Processing Engines
    • Classification
    • Regression

Slides available in : https://sites.google.com/site/iotminingtutorial/

Extreme Classification: Classify Wikipedia documents into one of 325,056 categories

Extreme classification, where one needs to deal with multi-class and multi-label problems involving a very large number of categories, has opened up a new research frontier in machine learning. Many challenging applications, such as photo and video annotation and web page categorization, can benefit from being formulated as supervised learning tasks with millions, or even billions, of categories. Extreme classification can also give a fresh perspective on core learning problems such as ranking and recommendation by reformulating them as multi-class/label tasks where each item to be ranked or recommended is a separate category.

4th edition of the Large Scale Hierarchical Text Classification (LSHTC) Challenge.

The LSHTC Challenge was a hierarchical text classification competition, using very large datasets. We were happy to be involved in the winning team with Antti Puurula and Jesse Read.

http://arxiv.org/abs/1405.0546

lshtc

Hierarchies are becoming ever more popular for the organization of text documents, particularly on the Web. Web directories and Wikipedia are two examples of such hierarchies. Along with their widespread use comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue, despite the vastness of available data: as more documents become available, more classes are also added to the hierarchy, and there is a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for new learning methods.

The challenge concerned multi-label classification based on the Wikipedia dataset. The hierarchy is a graph that can have cycles.  The number of categories is roughly 325,000 and the number of documents is 2,400,000. A document can appear in multiple classes.

https://www.kaggle.com/c/lshtc

Evolving Data Stream Classification and the Illusion of Progress

Data is being generated in real-time in increasing quantities and the distribution generating this data may be changing and evolving. In a paper presented at ECML-PKDD 2013 titled “Pitfalls in benchmarking data stream classification and how to avoid them“, we show that classifying data streams has an important temporal component, which we are currently not considering in the evaluation of data-stream classifiers. In this paper we show how a very simple classifier that considers this temporal component, the non-change classifier that predicts only using the last class seen by the classifier, can outperform current state-of-the-art classifiers in some real-world datasets. We propose to evaluate data streams considering this temporal component, using a new evaluation measure, which provides a more accurate gauge of classifier performance.

Resources to learn Big Data Analytics

A list of books and resources that are available online for learning Data Science:

Industry

  • Mc Kinsey Big data: The next frontier for innovation, competition, and productivity Website
  • O’Reilly Big Data Now: 2012 Edition. Website
  • IBM Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. Website
  • Pentaho Real-Time Big Data Analytics: Emerging Architecture. Website

Academia

  • The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Website
  • Data stream Mining: A practical approach. Website, Download
  • Introduction to Information Retrieval: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Website
  • Mining Massive Data Sets: Anand Rajaraman and Jeff Ullman, and Jure Leskovec. Website

 

Online Courses (suggested by Tim Osterbuhr)

  • Free Berkeley course on big data analysis using the Twitter API. Website
  • Extensive free data science course (good step-by-step approach). Website
  • Coursera course to get a good foundation of algorithms. Website

Mining Big Data in Real Time

Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis.

Nowadays, the quantity of data that is created every two days is estimated to be 5 exabytes. Moreover, it was estimated that 2007 was the first year in which it was not possible to store all the data that we are producing. This massive amount of data opens new challenging discovery tasks. Data stream real time analytics are needed to manage the data currently generated, at an ever increasing rate, from such applications as: sensor networks, measurements in network monitoring and traffic management, log records or click-streams in web exploring, manufacturing processes, call detail records, email, blogging, twitter posts and others. In fact, all data generated can be considered as streaming data or as a snapshot of streaming data, since it is obtained from an interval of time. In the data stream model, data arrive at high speed, and algorithms that process them must do so under very strict constraints of space and time. Consequently, data streams pose several challenges for data mining algorithm design. First, algorithms must make use of limited resources (time and memory). Second, they must deal with data whose nature or distribution changes over time.


Invited Talk: 100 Years of Alan Turing and 20 years of SLAIS, Slovenia, 2012

Big Data Mining SIGKDD Explorations

SigKDD Explorations

For the Big Data Mining SIGKDD Explorations Dec 2012, we selected four contributions that together show very significant state-of-the-art research in Big Data Mining, and that provide a broad overview of the field and a forecast to the future.

  • Scaling Big Data Mining Infrastructure: The Twitter Experience by Jimmy Lin and Dmitriy Ryaboy (Twitter, Inc.). This paper presents insights about Big Data mining infrastructures, and the experience of doing analytics at Twitter. It shows that due to the current state of the data mining tools, it is not straightforward to perform analytics. Most of the time is consumed in preparatory work to the application of data mining methods, and turning preliminary models into robust solutions.
  • Mining Heterogeneous Information Networks: A Structural Analysis Approach by Yizhou Sun (Northeastern University) and Jiawei Han (University of Illinois at Urbana-Champaign). This paper shows that mining heterogeneous information networks is a new and promising research frontier in Big Data mining research. It considers interconnected, multi-typed data, including the typical relational database data, as heterogeneous information networks. These semi-structured heterogeneous information network models leverage the rich semantics of typed nodes and links in a network and can uncover surprisingly rich knowledge from interconnected data.
  • Big Graph Mining: Algorithms and discoveries by U Kang and Christos Faloutsos(Carnegie Mellon University). This paper presents an overview of mining big graphs, focusing on the use of the Pegasus tool, showing some findings in the Web Graph and Twitter social networks. The paper gives inspirational future research directions for big graph mining.
  • Mining Large Streams of User Data for Personalized Recommendations by Xavier Amatriain (Netflix). This paper presents some lessons learned with the Netflix Prize, and discusses the recommender and personalization techniques used in Netflix. It discusses recent important problems and future research directions. Section 4 contains an interesting discussion about if we need more data or better models to improve our learning methodology.