The course will present algorithms for data analysis and mining while focusing on mining massive datasets. It will focus on both practical and theoretical aspects of data mining. During the course, the students will become familiar with the most successful algorithms for classification, clustering, mining frequent itemsets, and other machine learning/data mining technologies. Students will work on a small project where they will implement some of the algorithms and analyze real-world data.
Prerequisite: – Foundations of algorithms and data structures
– Knowledge of Java programming language
– Good programming skills
Lecture Slides
- 1. Introduction to Big Data Slides
- 2. Data pre-processing and Classifier Evaluation Slides
- 3. Apache Spark Slides. Spark SQL Slides
- 4. Apache Spark ML Lab Slides
- 5. Classification Slides
- 6. Clustering Slides
- 7. Frequent Pattern Mining Slides
- 8. Apache Spark Lab 2 Notebook 1 – Notebook 2
- 9. Link Analysis: PageRank and HITS Slides
- 10. Deep Learning/Multilayer Perceptron Slides Source Code
- 11. Apache Spark ML Lab 3
- New competition website: https://www.kaggle.com/t/cd2e890b195447888adbe7b471303b30
- Slides
- Submission (20/11/2018)