Category: Uncategorized

VHT: Vertical Hoeffding Tree

In this paper we present the Vertical Hoeffding Tree (VHT), the first distributed streaming algorithm for learning decision trees. It features a novel way of distributing decision trees via vertical parallelism. The algorithm is implemented on top of Apache SAMOA, a platform for mining big data streams, and thus able to run on real-world clusters. Our experiments to study the accuracy and throughput of VHT prove its ability to scale while attaining superior performance compared to sequential decision trees.

Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo: VHT: Vertical hoeffding tree. BigData 2016: 915-922

Paper at Research Gate

Apache SAMOA

December 4, 2016
Short Course in Data Stream Mining

Slides of a short course in Data Stream Mining, presenting classification methods, adaptive change detection, clustering and frequent pattern mining.

January 23, 2015
Keynote Talk at Business Applications of Social Network Analysis (BASNA) 2014

I was happy to be invited to give a keynote talk at BASNA 2014, the 5th International Workshop on Business Applications of Social Network Analysis, that was co-located with the 2014 IEEE International Conference on Data Mining (ICDM 2014) in Shenzhen, China.

The talk was on Real-Time Big Data Stream Analytics, about new techniques in Big Data mining that are able using a small amount of time and memory resources to adapt to changes. As an example, I discussed a social network application of data stream mining to compute user influence probabilities. And I presented the MOA software framework, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4. Here are the slides:

November 18, 2014
Big Data Stream Mining Tutorial at IEEE Big Data 2014
Gianmarco de Francisci Morales presented this week our tutorial “Big Data Stream Mining” at IEEE Big Data 2014 in Washington DC.

This tutorial was a gentle introduction to mining big data streams. The first part introduced data stream learners for classification, regression, clustering, and frequent pattern mining. The second part discussed data stream mining on distributed engines such as Storm, S4, and Samza.

Outline:
1. Fundamentals and Stream Mining Algorithms
  - Stream mining setting
  - Concept drift
  - Classification and Regression
  - Clustering
  - Frequent Pattern mining
2. Distributed Big Data Stream Mining
  - Distributed Stream Processing Engines
  - Classification
  - Regression
Slides available in : https://sites.google.com/site/iotminingtutorial/
November 2, 2014
Extreme Classification: Classify Wikipedia documents into one of 325,056 categories

Extreme classification, where one needs to deal with multi-class and multi-label problems involving a very large number of categories, has opened up a new research frontier in machine learning. Many challenging applications, such as photo and video annotation and web page categorization, can benefit from being formulated as supervised learning tasks with millions, or even billions, of categories. Extreme classification can also give a fresh perspective on core learning problems such as ranking and recommendation by reformulating them as multi-class/label tasks where each item to be ranked or recommended is a separate category.

4th edition of the Large Scale Hierarchical Text Classification (LSHTC) Challenge.

The LSHTC Challenge was a hierarchical text classification competition, using very large datasets. We were happy to be involved in the winning team with Antti Puurula and Jesse Read.

http://arxiv.org/abs/1405.0546

Hierarchies are becoming ever more popular for the organization of text documents, particularly on the Web. Web directories and Wikipedia are two examples of such hierarchies. Along with their widespread use comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue, despite the vastness of available data: as more documents become available, more classes are also added to the hierarchy, and there is a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for new learning methods.

The challenge concerned multi-label classification based on the Wikipedia dataset. The hierarchy is a graph that can have cycles. The number of categories is roughly 325,000 and the number of documents is 2,400,000. A document can appear in multiple classes.

https://www.kaggle.com/c/lshtc

May 23, 2014
Evolving Data Stream Classification and the Illusion of Progress

Data is being generated in real-time in increasing quantities and the distribution generating this data may be changing and evolving. In a paper presented at ECML-PKDD 2013 titled “Pitfalls in benchmarking data stream classification and how to avoid them“, we show that classifying data streams has an important temporal component, which we are currently not considering in the evaluation of data-stream classifiers. In this paper we show how a very simple classifier that considers this temporal component, the non-change classifier that predicts only using the last class seen by the classifier, can outperform current state-of-the-art classifiers in some real-world datasets. We propose to evaluate data streams considering this temporal component, using a new evaluation measure, which provides a more accurate gauge of classifier performance.

October 2, 2013
Mining Big Data in Real Time

Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis.

Nowadays, the quantity of data that is created every two days is estimated to be 5 exabytes. Moreover, it was estimated that 2007 was the first year in which it was not possible to store all the data that we are producing. This massive amount of data opens new challenging discovery tasks. Data stream real time analytics are needed to manage the data currently generated, at an ever increasing rate, from such applications as: sensor networks, measurements in network monitoring and traffic management, log records or click-streams in web exploring, manufacturing processes, call detail records, email, blogging, twitter posts and others. In fact, all data generated can be considered as streaming data or as a snapshot of streaming data, since it is obtained from an interval of time. In the data stream model, data arrive at high speed, and algorithms that process them must do so under very strict constraints of space and time. Consequently, data streams pose several challenges for data mining algorithm design. First, algorithms must make use of limited resources (time and memory). Second, they must deal with data whose nature or distribution changes over time.

Invited Talk: 100 Years of Alan Turing and 20 years of SLAIS, Slovenia, 2012

June 15, 2013