Resources to learn Big Data Analytics

A list of books and resources that are available online for learning Data Science:


  • Mc Kinsey Big data: The next frontier for innovation, competition, and productivity Website
  • O’Reilly Big Data Now: 2012 Edition. Website
  • IBM Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. Website
  • Pentaho Real-Time Big Data Analytics: Emerging Architecture. Website


  • The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Website
  • Data stream Mining: A practical approach. Website, Download
  • Introduction to Information Retrieval: Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch├╝tze. Website
  • Mining Massive Data Sets: Anand Rajaraman and Jeff Ullman, and Jure Leskovec. Website


Online Courses (suggested by Tim Osterbuhr)

  • Free Berkeley course on big data analysis using the Twitter API. Website
  • Extensive free data science course (good step-by-step approach). Website
  • Coursera course to get a good foundation of algorithms. Website

Mining Big Data in Real Time

Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis.

Nowadays, the quantity of data that is created every two days is estimated to be 5 exabytes. Moreover, it was estimated that 2007 was the first year in which it was not possible to store all the data that we are producing. This massive amount of data opens new challenging discovery tasks. Data stream real time analytics are needed to manage the data currently generated, at an ever increasing rate, from such applications as: sensor networks, measurements in network monitoring and traffic management, log records or click-streams in web exploring, manufacturing processes, call detail records, email, blogging, twitter posts and others. In fact, all data generated can be considered as streaming data or as a snapshot of streaming data, since it is obtained from an interval of time. In the data stream model, data arrive at high speed, and algorithms that process them must do so under very strict constraints of space and time. Consequently, data streams pose several challenges for data mining algorithm design. First, algorithms must make use of limited resources (time and memory). Second, they must deal with data whose nature or distribution changes over time.

Invited Talk: 100 Years of Alan Turing and 20 years of SLAIS, Slovenia, 2012