Blog

Challenges of Machine Learning for Data Streams in the Banking Industry

I was pleased to be invited to give a Keynote talk at The Ninth International Conference On Big Data Analytics 2021 December 15-18, 2021 at Indian Institute Of Information Technology Allahabad (IIITA), Prayagraj, India on our recent work with Mariam Barry,Raja Chiky, Jacob Montiel and Vinh-Thuy Tran.

Here is the paper:

https://link.springer.com/chapter/10.1007/978-3-030-93620-4_9

Banking Information Systems continuously generate large quantities of data as inter-connected streams (transactions, events logs, time series, metrics, graphs, process, etc.). Such data streams need to be processed online to deal with critical business applications such as real-time fraud detection, network security attack prevention or predictive maintenance on information system infrastructure. Many algorithms have been proposed for data stream learning, however, most of them do not deal with the important challenges and constraints imposed by real-world applications. In particular, when we need to train models incrementally from heterogeneous data mining and deployment them within complex big data architecture. Based on banking applications and lessons learned in production environments of BNP Paribas – a major international banking group and leader in the Eurozone – we identified the most important current challenges for mining IT data streams. Our goal is to highlight the key challenges faced by data scientists and data engineers within complex industry settings for building or deploying models for real word streaming applications. We provide future research directions on Stream Learning that will accelerate the adoption of online learning models for solving real-word problems. Therefore bridging the gap between research and industry communities. Finally, we provide some recommendations to tackle some of these challenges.

December 18, 2021
Deloitte AI Institute – AI Quotes/Facts Advent Calendar

I was honoured that Björn Bringmann, Deloitte AI Institute Managing Director, used one of my quotes for the Advent Calendar of the Deloitte AI Institute:

https://www.linkedin.com/feed/update/urn:li:activity:6874613775496966144/

December 11, 2021
WHITE PAPER: Aotearoa New Zealand Artificial Intelligence A Strategic Approach

This whitepaper discusses current AI capabilities in Aotearoa New Zealand and offers recommendations for establishing Aotearoa New Zealand as a research centre of excellence and trust in AI.

Whitepaper

Artificial Intelligence Researchers Association

Artificial Intelligence (AI) is profoundly changing how we live and work. The cumulative impact of AI
is likely to be comparable to other transformative technologies such as electricity or the internet.
As a result, it is imperative that we take a strategic approach to realising the potential benefits offered
by AI and to protecting people against the potential risks.

In this whitepaper, we discuss current AI capabilities in Aotearoa New Zealand and offer recommendations for establishing Aotearoa New Zealand as a research centre of excellence and trust in AI. Our discussion is informed and guided by the framework for developing a national AI strategy set out by the World Economic Forum.

It is important to invest in AI imbued with characteristics and values important for Aotearoa New Zealand such as sustainability, fairness, equality, data sovereignty, Te Tiriti obligations, multiculturalism, intergenerational thinking, people and whānau first, and holistic thinking. Otherwise, we risk being relegated to users of overseas technologies developed by countries with different values.

Our vision is that by 2030, Aotearoa New Zealand will have a community of cutting-edge companies producing and exporting AI technologies, supported by a strong network of researchers involved in high level fundamental and applied research

December 8, 2021

RIVER

River is a Python library for online machine learning. It is the result of a merger between creme and scikit-multiflow. River’s ambition is to be the go-to library for doing machine learning on streaming data.

Paper

Github

As a quick example, we’ll train a logistic regression to classify the website phishing dataset. Here’s a look at the first observation in the dataset.

>>> from pprint import pprint
>>> from river import datasets

>>> dataset = datasets.Phishing()

>>> for x, y in dataset:
...     pprint(x)
...     print(y)
...     break
{'age_of_domain': 1,
 'anchor_from_other_domain': 0.0,
 'empty_server_form_handler': 0.0,
 'https': 0.0,
 'ip_in_url': 1,
 'is_popular': 0.5,
 'long_url': 1.0,
 'popup_window': 0.0,
 'request_from_other_domain': 0.0}
True

Now let’s run the model on the dataset in a streaming fashion. We sequentially interleave predictions and model updates. Meanwhile, we update a performance metric to see how well the model is doing.

>>> from river import compose
>>> from river import linear_model
>>> from river import metrics
>>> from river import preprocessing

>>> model = compose.Pipeline(
...     preprocessing.StandardScaler(),
...     linear_model.LogisticRegression()
... )

>>> metric = metrics.Accuracy()

>>> for x, y in dataset:
...     y_pred = model.predict_one(x)      # make a prediction
...     metric = metric.update(y, y_pred)  # update the metric
...     model = model.learn_one(x, y)      # make the model learn

>>> metric
Accuracy: 89.20%

January 18, 2021

Machine learning for streaming data: state of the art, challenges, and opportunities

Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., Velocity and Volume. Given the current industry needs, there are many challenges to be addressed before existing methods can be efficiently applied to real-world problems.

In this work, we focus on elucidating the connections among the current state-of-the-art on related fields; and clarifying open challenges in both academia and industry. We treat with special care topics that were not thoroughly investigated in past position and survey papers.

This work aims to evoke discussion and elucidate the current research opportunities, high-lighting the relationship of different subareas and suggesting courses of action when possible.

Heitor Murilo Gomes, Jesse Read, Albert Bifet, Jean Paul Barddal, João Gama: Machine learning for streaming data: state of the art, challenges, and opportunities. SIGKDD Explorations 21(2): 6-22 (2019)

Paper at Research Gate

December 4, 2019
Streaming Random Patches

The Streaming Random Patches (SRP) algorithm is a new ensemble method specially adapted to stream classification which combines random subspaces and online bagging. We provide theoretical insights and empirical results illustrating different aspects of SRP. In particular, we explain how the widely adopted incremental Hoeffding trees are not, in fact, unstable learners, unlike their batch counterparts, and how this fact significantly influences ensemble methods design and performance. We compare SRP against state-of-the-art ensemble variants for streaming data in a multitude of datasets. The results show how SRP produce a high predictive performance for both real and synthetic datasets. Besides, we analyze the diversity over time and the average tree depth, which provides insights on the differences between local subspace randomization (as in random forest) and global subspace randomization (as in random subspaces).

Heitor Murilo Gomes, Jesse Read, Albert Bifet: Streaming Random Patches for Evolving Data Stream Classification. ICDM 2019: 240-249

Paper at Research Gate

November 4, 2019
Learning Fast and Slow: A Unified Batch/Stream Framework

We propose an unified approach similar to the model proposed by Nobel Prize in Economics laureate Daniel Kahneman in his best-selling book “Thinking, Fast and Slow” to describe the mechanisms behind human decision-making. The central thesis of this book is a dichotomy between two modes of thought: System 1 is fast, instinctive and emotional; System2 is slower, more deliberative, and more logical.

In this paper, we present FAST AND SLOW LEARNING (FSL), a novel unified framework that sheds light on the symbiosis between batch and stream learning. FSL works by employing Fast (stream) and Slow (batch) Learners, emulating the mechanisms used by humans to make decisions.

Jacob Montiel, Albert Bifet, Viktor Losing, Jesse Read, Talel Abdessalem: Learning Fast and Slow: A Unified Batch/Stream Framework. BigData 2018: 1065-1072

Paper at Research Gate

Jacob Montiel PhD Thesis: “Fast and slow machine learning”

December 4, 2018
Machine Learning for Data Streams: with Practical Examples in MOA
Today many information sources—including sensor networks, financial markets, social networks, and healthcare monitoring—are so-called data streams, arriving sequentially and at high speed. Analysis must take place in real time, with partial data and without the capacity to store the entire data set. This book presents algorithms and techniques used in data stream mining and real-time analytics. Taking a hands-on approach, the book demonstrates the techniques using MOA (Massive Online Analysis), a popular, freely available open-source software framework, allowing readers to try out the techniques after reading the explanations.

The book first offers a brief introduction to the topic, covering big data mining, basic methodologies for mining data streams, and a simple example of MOA. More detailed discussions follow, with chapters on sketching techniques, change, classification, ensemble methods, regression, clustering, and frequent pattern mining. Most of these chapters include exercises, an MOA-based lab session, or both. Finally, the book discusses the MOA software, covering the MOA graphical user interface, the command line, use of its API, and the development of new methods within MOA. The book will be an essential reference for readers who want to use data stream mining as a tool, researchers in innovation or data stream mining, and programmers who want to create new algorithms for MOA.
- Series: Adaptive Computation and Machine Learning series
- Hardcover: 288 pages
- Publisher: The MIT Press (March 2, 2018)
- Language: English
- ISBN-10: 0262037793
- ISBN-13: 978-0262037792
April 4, 2018
Letters to a Young PhD Student
Rainer Maria Rilke gave in “Letters to a Young Poet” some advice to a young poet on how a poet should feel, love, and seek truth in poetry :

“Nobody can advise you and help you. Nobody. There is only one way—Go into yourself.”

I can recommend to young PhD students, only two things:
- Do always more than your advisor ask you
- Focus on an important research question
I can suggest one book to be more effective in research and in life:

The Seven Habits of Highly Effective People by Stephen Covey
April 4, 2018
Classifier Concept Drift Detection and the Illusion of Progress

In this paper, we discuss the surprising result that non-change detectors can outperform change-detectors when used in a classification streaming evaluation. This may be due to the temporal dependence on data, and we argue that evaluation of change detectors should not be done using only classifiers. We wish that this paper will open several directions for future research.

In an experiment with an adaptive HT with the electricity and covertype datasets, the best performance was due to the No-Change Detector. This detector outputs change every 60 instances; it is a no-change detector in the sense that it is not detecting change in the stream. Surprisingly, the classifiers using this no-change detector are getting better results than using the standard change detectors.

Albert Bifet: Classifier Concept Drift Detection and the Illusion of Progress. ICAISC (2) 2017: 715-725

Paper at ResearchGate

July 4, 2017