Machine learning for streaming data: state of the art, challenges, and opportunities

Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., Velocity and Volume. Given the current industry needs, there are many challenges to be addressed before existing methods can be efficiently applied to real-world problems.

In this work, we focus on elucidating the connections among the current state-of-the-art on related fields; and clarifying open challenges in both academia and industry. We treat with special care topics that were not thoroughly investigated in past position and survey papers.

This work aims to evoke discussion and elucidate the current research opportunities, high-lighting the relationship of different subareas and suggesting courses of action when possible.

Heitor Murilo Gomes, Jesse Read, Albert Bifet, Jean Paul Barddal, João Gama: Machine learning for streaming data: state of the art, challenges, and opportunities. SIGKDD Explorations 21(2): 6-22 (2019)

Paper at Research Gate

Streaming Random Patches

The Streaming Random Patches (SRP) algorithm is a new ensemble method specially adapted to stream classification which combines random subspaces and online bagging. We provide theoretical insights and empirical results illustrating different aspects of SRP. In particular, we explain how the widely adopted incremental Hoeffding trees are not, in fact, unstable learners, unlike their batch counterparts, and how this fact significantly influences ensemble methods design and performance. We compare SRP against state-of-the-art ensemble variants for streaming data in a multitude of datasets. The results show how SRP produce a high predictive performance for both real and synthetic datasets. Besides, we analyze the diversity over time and the average tree depth, which provides insights on the differences between local subspace randomization (as in random forest) and global subspace randomization (as in random subspaces).

Heitor Murilo Gomes, Jesse Read, Albert Bifet: Streaming Random Patches for Evolving Data Stream Classification. ICDM 2019: 240-249

Paper at Research Gate

Learning Fast and Slow: A Unified Batch/Stream Framework

We propose an unified approach similar to the model proposed by Nobel Prize in Economics laureate Daniel Kahneman in his best-selling book “Thinking, Fast and Slow” to describe the mechanisms behind human decision-making. The central thesis of this book is a dichotomy between two modes of thought: System 1 is fast, instinctive and emotional; System2 is slower, more deliberative, and more logical.

In this paper, we present FAST AND SLOW LEARNING (FSL),  a novel unified framework that sheds light on the symbiosis between batch and stream learning. FSL works by employing Fast (stream) and Slow (batch) Learners, emulating the mechanisms used by humans to make decisions.

Jacob Montiel, Albert Bifet, Viktor Losing, Jesse Read, Talel Abdessalem: Learning Fast and Slow: A Unified Batch/Stream Framework. BigData 2018: 1065-1072

Paper at Research Gate

Jacob Montiel PhD Thesis: “Fast and slow machine learning

 

Machine Learning for Data Streams: with Practical Examples in MOA

Today many information sources—including sensor networks, financial markets, social networks, and healthcare monitoring—are so-called data streams, arriving sequentially and at high speed. Analysis must take place in real time, with partial data and without the capacity to store the entire data set. This book presents algorithms and techniques used in data stream mining and real-time analytics. Taking a hands-on approach, the book demonstrates the techniques using MOA (Massive Online Analysis), a popular, freely available open-source software framework, allowing readers to try out the techniques after reading the explanations.

The book first offers a brief introduction to the topic, covering big data mining, basic methodologies for mining data streams, and a simple example of MOA. More detailed discussions follow, with chapters on sketching techniques, change, classification, ensemble methods, regression, clustering, and frequent pattern mining. Most of these chapters include exercises, an MOA-based lab session, or both. Finally, the book discusses the MOA software, covering the MOA graphical user interface, the command line, use of its API, and the development of new methods within MOA. The book will be an essential reference for readers who want to use data stream mining as a tool, researchers in innovation or data stream mining, and programmers who want to create new algorithms for MOA.

  • Series: Adaptive Computation and Machine Learning series
  • Hardcover: 288 pages
  • Publisher: The MIT Press (March 2, 2018)
  • Language: English
  • ISBN-10: 0262037793
  • ISBN-13: 978-0262037792

Letters to a Young PhD Student

Rainer Maria Rilke gave in “Letters to a Young Poet” some advice to a young poet on how a poet should feel, love, and seek truth in poetry :

“Nobody can advise you and help you. Nobody. There is only one way—Go into yourself.”

I can recommend to young PhD students, only two things:

  • Do always more than your advisor ask you
  • Focus on an important research question

I can suggest one book to be more effective in research and in life:

The Seven Habits of Highly Effective People by Stephen Covey

 

Classifier Concept Drift Detection and the Illusion of Progress

In this paper, we discuss the surprising result that non-change detectors can outperform change-detectors when used in a classification streaming evaluation. This may be due to the temporal dependence on data, and we argue that evaluation of change detectors should not be done using only classifiers. We wish that this paper will open several directions for future research.

In an experiment with an adaptive HT with the electricity and covertype datasets, the best performance was due to the No-Change Detector. This detector outputs change every 60 instances; it is a no-change detector in the sense that it is not detecting change in the stream. Surprisingly, the classifiers using this no-change detector are getting better results than using the standard change detectors.


Albert Bifet: Classifier Concept Drift Detection and the Illusion of Progress. ICAISC (2) 2017: 715-725

Paper at ResearchGate

VHT: Vertical Hoeffding Tree

In this paper we present the Vertical Hoeffding Tree (VHT), the first distributed streaming algorithm for learning decision trees. It features a novel way of distributing decision trees via vertical parallelism. The algorithm is implemented on top of Apache SAMOA, a platform for mining big data streams, and thus able to run on real-world clusters. Our experiments to study the accuracy and throughput of VHT prove its ability to scale while attaining superior performance compared to sequential decision trees.

Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo: VHT: Vertical hoeffding tree. BigData 2016: 915-922

Paper at Research Gate

Apache SAMOA

Keynote Talk at Business Applications of Social Network Analysis (BASNA) 2014

I was happy to be invited to give a keynote talk at BASNA 2014, the 5th International Workshop on Business Applications of Social Network Analysis, that was co-located with the 2014 IEEE International Conference on Data Mining (ICDM 2014) in Shenzhen, China.

The talk was on Real-Time Big Data Stream Analytics, about new techniques in Big Data mining that are able using a small amount of time and memory resources to adapt to changes. As an example, I discussed a social network application of data stream mining to compute user influence probabilities. And I presented the MOA software framework, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4. Here are the slides:

Big Data Stream Mining Tutorial at IEEE Big Data 2014

Gianmarco de Francisci Morales presented this week our tutorial “Big Data Stream Mining” at IEEE Big Data 2014 in Washington DC.

This tutorial was a gentle introduction to mining big data streams. The first part introduced data stream learners for classification, regression, clustering, and frequent pattern mining. The second part discussed data stream mining on distributed engines such as Storm, S4, and Samza.

Outline:

  1. Fundamentals and Stream Mining Algorithms
    • Stream mining setting
    • Concept drift
    • Classification and Regression
    • Clustering
    • Frequent Pattern mining
  2. Distributed Big Data Stream Mining
    • Distributed Stream Processing Engines
    • Classification
    • Regression

Slides available in : https://sites.google.com/site/iotminingtutorial/