Adjunct Prof. Dr.-Ing. Mahdi Bohlouli

Personal Academic Page

Big Data Management Course

Introduction:

Big data is one of those emerging technologies that nowadays appear in almost every novel scientific and industrial work associated with the data science and applied scalable analytics. It provides a massive scalability to the analytics including machine learning approaches as well. Furthermore, scalable machine learning algorithms such as Apache Mahout have been developed on top of big data using MapReduce and Hadoop. According to the Garner hype cycle of 2016, most of emerging technologies of the next 10 years rely on innovations made by big data technologies. [ Big Data seminar ]

This course is organized by Asst. Prof. Dr. Mahdi Bohlouli, Aparup khatua, Mohammad H Nazeri and, Amir Reza Mohammadi. Teaching assistants for constructing material for extra hands-on classes are Arya, Navid, Sepideh and , Solmaz. This course deepens the students’ knowledge through providing an introduction to big data analysis, MapReduce, Hadoop Distributed File System and a deep dive to Apache Spark and its coresponding stack of libraries like SparkML and SparkSQL. The Course consists of theoretical, practical (research), hands-on assignments and, capstone project. The theoretical part will be the first 7 weeks of the course, continued by Apache Spark tutorial by Amir Reza Mohammadi and Mohammad H Nazeri. There may be a need for some implementation exercises in the practical part.

Piazza Page: [piazza.com/iasbs.ac.ir/winter2020/bd101]

Course Objectives:

This course aims to introduce you to the concepts behind Big Data, the core technologies used in managing large-scale data sets, and a range of technologies for developing solutions to large-scale data analytics problems.

This course is intended for students who want to understand modern large-scale data analytics systems. It covers a wide range of topics and technologies and will prepare students to be able to build such systems as well as use them efficiently and effectively address challenges in big data management.

Course Outline:

Date Lecture Topics
Week 1 Introduction to course and logistics
Week 2 Introduction to Big Data Analytics
Week 3 Data models for Big Data
Week 4 Data storage on the batch layer + HDFS
Week 5 Introduction to Hadoop ecosystem + MapReduce
Week 6 Serving layer of lambda architecture
Week 7 Speed layer of lambda architecture
Week 8 Introduction to Apache Spark and core Components
Week 9 Deep dive to Spark libraries (SparkSQL, MLlib, SparkStreaming)

Assignments:

Available in Piazza. [https://piazza.com/iasbs.ac.ir/winter2020/bd101/resources]

Prerequisites:

  • Data Structures and Algorithms
  • Database Systems
  • Before commencing this course, you should:
    • have experiences and good knowledge of algorithm design
    • have a solid background in database systems
    • have solid programming skills in Java
    • be familiar with working on a Unix-style operating systems
    • have basic knowledge of linear algebra (e.g., vector spaces, matrix multiplication), probability theory and statistics, and graph theory
  • No previous experience necessary in
    – MapReduce + Parallel and distributed programming

References:

Big Data: Principles and Best Practices of Scalable Realtime Data Systems by Nathan Marz WITH James Warren MANNING

 

 

 

 

 

 

 

 

  • Hadoop: The Definitive Guide. Tom White. 4th Edition – O’Reilly
  • Media Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman. 2ndedition – Cambridge University Press
  • Advanced Analytics with Spark. Josh Wills, Sandy Ryza, Sean Owen, and UriLaserson. O’Reilly Media

 

Final Exam:

Instructors:

Teaching Assistants:

 

Social media & sharing icons powered by UltimatelySocial