Introduction:
Big data is one of those emerging technologies that nowadays appear in almost every novel scientific and industrial work associated with the data science and applied scalable analytics. It provides a massive scalability to the analytics including machine learning approaches as well. Furthermore, scalable machine learning algorithms such as Apache Mahout have been developed on top of big data using MapReduce and Hadoop. According to the Garner hype cycle of 2016, most of emerging technologies of the next 10 years rely on innovations made by big data technologies. [ Big Data seminar ]
This course is organized by Asst. Prof. Dr. Mahdi Bohlouli, Aparup khatua, Mohammad H Nazeri and, Amir Reza Mohammadi. Teaching assistants for constructing material for extra hands-on classes are Arya, Navid, Sepideh and , Solmaz. This course deepens the students’ knowledge through providing an introduction to big data analysis, MapReduce, Hadoop Distributed File System and a deep dive to Apache Spark and its coresponding stack of libraries like SparkML and SparkSQL. The Course consists of theoretical, practical (research), hands-on assignments and, capstone project. The theoretical part will be the first 7 weeks of the course, continued by Apache Spark tutorial by Amir Reza Mohammadi and Mohammad H Nazeri. There may be a need for some implementation exercises in the practical part.
Piazza Page: [piazza.com/iasbs.ac.ir/winter2020/bd101]
Course Objectives:
This course aims to introduce you to the concepts behind Big Data, the core technologies used in managing large-scale data sets, and a range of technologies for developing solutions to large-scale data analytics problems.
This course is intended for students who want to understand modern large-scale data analytics systems. It covers a wide range of topics and technologies and will prepare students to be able to build such systems as well as use them efficiently and effectively address challenges in big data management.
Course Outline:
Date | Lecture Topics |
Week 1 | Introduction to course and logistics |
Week 2 | Introduction to Big Data Analytics |
Week 3 | Data models for Big Data |
Week 4 | Data storage on the batch layer + HDFS |
Week 5 | Introduction to Hadoop ecosystem + MapReduce |
Week 6 | Serving layer of lambda architecture |
Week 7 | Speed layer of lambda architecture |
Week 8 | Introduction to Apache Spark and core Components |
Week 9 | Deep dive to Spark libraries (SparkSQL, MLlib, SparkStreaming) |
Assignments:
Available in Piazza. [https://piazza.com/iasbs.ac.ir/winter2020/bd101/resources]
Prerequisites:
- Data Structures and Algorithms
- Database Systems
- Before commencing this course, you should:
- have experiences and good knowledge of algorithm design
- have a solid background in database systems
- have solid programming skills in Java
- be familiar with working on a Unix-style operating systems
- have basic knowledge of linear algebra (e.g., vector spaces, matrix multiplication), probability theory and statistics, and graph theory
- No previous experience necessary in
– MapReduce + Parallel and distributed programming
References:

- Hadoop: The Definitive Guide. Tom White. 4th Edition – O’Reilly
- Media Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman. 2ndedition – Cambridge University Press
- Advanced Analytics with Spark. Josh Wills, Sandy Ryza, Sean Owen, and UriLaserson. O’Reilly Media
Final Exam:
Instructors:
Teaching Assistants: