大数据系统与大规模数据分析  083500M02001H

学期:2020—2021学年(春)第二学期 | 课程属性:一级学科普及课 | 任课教师:陈世敏,孙翼
授课时间: 星期三,第9、10、11 节
授课地点: 教一楼107
授课周次: 1、2、3、4、5、6、7、8、9、10、11、12、13、14、15、16
课程编号: 083500M02001H 课时: 50 学分: 3.00
课程属性: 一级学科普及课 主讲教师:陈世敏,孙翼 助教:祁琦
英文名称: Big Data System and Large-Scale Data Analysis 召集人:

教学目的、要求

随着互联网、社交网络、云计算、物联网、移动计算、大规模科学探测与计算分析等的发展,各种新的数据密集型应用如雨后春笋般涌现。这些新的应用通常具有数据量巨大、数据获取速度更新速度快和/或数据种类丰富繁多等特点,被通称为大数据应用。近年来,产业界和学术界面向不同应用场景推出了多种类型、各具特色的大数据处理系统平台。同时,一大批数据建模与分析的方法被应用于大规模数据处理。一方面,多种大数据处理平台没有统一的标准,设计目标、功能和关键技术也多有不同,对于初学者的学习带来了很大的困难,容易“只见树木不见森林”,难以形成全面的认识。另一方面,大规模数据处理要求选择恰当的算法,适合的大数据平台,才能达到其功能和性能的目标。
    本课程将从大数据系统和大规模数据分析两个方面系统地讲解大数据处理的知识。在大数据系统方面,本课程将以传统的关系型数据库系统为基础,讲解多种大数据存储系统和运算系统的系统结构和工作原理,通过与传统的关系型数据库系统进行比较,分析每种大数据系统的设计思想、关键技术、优势和缺陷,从而对目前百家争鸣的各种大数据系统进行纲举目张的讲述。在大规模数据分析方面,本课程将讲解在实际中获得广泛应用的经典建模和分析算法,在介绍算法基本原理的基础上,侧重讲解算法的实现和应用。本课程前后两个方面相互呼应,使学生对大数据处理形成一个统一的认识。
    通过本课程的学习,希望学生了解大数据系统的科学问题和大数据建模分析的经典算法,掌握基本设计思想和关键技术,对大数据处理形成全面地认识,为进一步从事大数据系统或大数据分析方向的研究,或者选择使用大数据系统研发大数据应用,提供良好的基础。

预修课程

数据库概论、程序设计、计算机原理、数据结构、线性代数、概率论基础

教 材

主要内容

第一章 大数据的发展与应用 
1.1计算机硬件的发展
1.2 数据管理系统的发展
1.3 大数据的挑战	
1.4 大数据系统 
1.5 大规模数据分析

第二章 关系型数据管理系统 
2.1 关系型数据模型
2.2 关系型运算
2.3 SQL语言
2.4 数据库系统结构
2.5 数据存储
2.6 缓冲池
2.7 索引结构
2.8 关系型运算的实现
2.9 事务处理系统
2.10 数据仓库
2.11 并行数据库

第三章 大数据存储系统 
3.1分布式文件系统
3.2 Google File System和HDFS
3.3 键值存储系统(BigTable, HBase, Dynamo, Cassandra, RocksDB)
3.4 Document Store(MongoDB)
3.5 图存储系统(Neo4j, JanusGraph, RDF)

第四章 大数据运算系统 
4.1 分布式计算概述
4.2 MapReduce云计算系统(MapReduce, Hadoop, Dryad)
4.3 MapReduce+SQL(Hive, Pig, Scope)
4.4 图运算系统(Pregel,GraphLab, PowerGraph)
4.5 内存计算(MMDB, Memcached, Redis,Spark)

第五章 大规模数据建模与分析 
5.1 Clustering 与 Classification(概念, Hierarchical clustering, K-means, SVM, KNN)
5.2 Dimensionality Reduction(Principal component analysis, Singular value 
   decomposition, CUR)
5.3 Recommender system(Content-based recommendation, Collaborative filtering, 
   Alternating least squares)
5.4 Location Sensitive Hashing
5.5 大规模数据建模平台(Apache Mahout, Spark Mlib, 等)
5.6 应用举例:教育大数据的建模与分析

第六章 流式数据分析与处理
6.1 数据流模型
6.2 流处理系统(Storm)
6.3 流数据采样与估计
6.4 流数据过滤与分析
6.5 流数据应用举例(日志流,传感器流数据等)

参考文献

[1]	Anand Rajaraman and Jeffrey D. Ullman. Mining of Massive Datasets. Cambridge University Press, 2013. (book website: http://i.stanford.edu/~ullman/mmds.html )
[2]	Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[3]	Jiawei Han, and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, Second Edition, 2006.
[4]	Chuck Lam. Hadoop in Action. Manning Publications, First Edition, 2010.
[5]	C. Mohan. Tutorial: An In-Depth Look at Modern Database Systems. VLDB 2013, EDBT 2014.
[6]	Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: The Google file system. SOSP 2003: 29-43.
[7]	Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, Robert Gruber: Bigtable: A Distributed Storage System for Structured Data. OSDI 2006: 205-218.
[8]	Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels: Dynamo: amazon’s highly available key-value store. SOSP 2007: 205-220.
[9]	Avinash Lakshman, Prashant Malik: Cassandra: a decentralized structured storage system. Operating Systems Review 44(2): 35-40 (2010).
[10]	Brad Fitzpatrick: Distributed Caching with Memcached. Linux Journal (2004). http://www.linuxjournal.com/article/7451.
[11]	Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, Venkateshwaran Venkataramani: Scaling Memcache at Facebook. NSDI 2013: 385-398.
[12]	Redis. http://redis.io/
[13]	MongoDB. http://www.mongodb.org/	
[14]	Neo4j. http://neo4j.com/	
[15]	Jason Baker, Chris Bond, James C. Corbett, J. J. Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh: Megastore: Providing Scalable, Highly Available Storage for Interactive Services. CIDR 2011: 223-234.
[16]	Sudipto Das, Shoji Nishimura, Divyakant Agrawal, Amr El Abbadi: Albatross: Lightweight Elasticity in Shared Storage Databases for the Cloud using Live Data Migration. PVLDB 4(8): 494-505 (2011).
[17]	Sudipto Das, Divyakant Agrawal, Amr El Abbadi: ElasTraS: An elastic, scalable, and self-managing transactional database for the cloud. ACM Trans. Database Syst. 38(1): 5 (2013).
[18]	Hatem A. Mahmoud, Vaibhav Arora, Faisal Nawab, Divyakant Agrawal, Amr El Abbadi: MaaT: Effective and scalable coordination of distributed transactions in the cloud. PVLDB 7(5): 329-340 (2014).
[19]	Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150.
[20]	Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly: Dryad: distributed data-parallel programs from sequential building blocks. EuroSys 2007: 59-72.
[21]	Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy: Hive - a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005.
[22]	Ashish Thusoo, Zheng Shao, Suresh Anthony, Dhruba Borthakur, Namit Jain, Joydeep Sen Sarma, Raghotham Murthy, Hao Liu: Data warehousing and analytics infrastructure at facebook. SIGMOD Conference 2010: 1013-1020.
[23]	Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing. SIGMOD Conference 2008: 1099-1110.
[24]	Ronnie Chaiken, Bob Jenkins, Per-?ke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, Jingren Zhou: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2): 1265-1276 (2008).
[25]	Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski: Pregel: a system for large-scale graph processing. SIGMOD Conference 2010: 135-146.
[26]	Giraph. http://giraph.apache.org
[27]	Hama. https://hama.apache.org
[28]	Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph M. Hellerstein: GraphLab: A New Framework For Parallel Machine Learning. UAI 2010: 340-349.
[29]	Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph M. Hellerstein: Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB 5(8): 716-727 (2012).
[30]	Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, Carlos Guestrin: PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI 2012: 17-30.
[31]	Aapo Kyrola, Guy E. Blelloch, Carlos Guestrin: GraphChi: Large-Scale Graph Computation on Just a PC. OSDI 2012: 31-46.
[32]	Peter A. Boncz, Stefan Manegold, Martin L. Kersten: Database Architecture Optimized for the New Bottleneck: Memory Access. VLDB 1999: 54-65.
[33]	Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, Marios Skounakis: Weaving Relations for Cache Performance. VLDB 2001: 169-180.
[34]	Jun Rao, Kenneth A. Ross: Cache Conscious Indexing for Decision-Support in Main Memory. VLDB 1999: 78-89.
[35]	Shimin Chen, Phillip B. Gibbons, Todd C. Mowry: Improving Index Performance through Prefetching. SIGMOD Conference 2001: 235-246.
[36]	Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, Todd C. Mowry: Improving Hash Join Performance through Prefetching. ICDE 2004: 116-127.
[37]	Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, Ion Stoica: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012: 15-28.
[38]	Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica: Shark: fast data analysis using coarse-grained distributed memory. SIGMOD Conference 2012: 689-692.
[39]	Cloudera Impala. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/ impala.html
[40]	Storm. https://storm.incubator.apache.org/
[41]	James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford: Spanner: Google’s Globally Distributed Database. ACM Trans. Comput. Syst. 31(3): 8 (2013).
[42]	Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, Himani Apte: F1: A Distributed SQL Database That Scales. PVLDB 6(11): 1068-1079 (2013).
[43]	http://static.googleusercontent.com/media/research.google.com/en/us/university/relations/facultysummit2010/storage_architecture_and_challenges.pdf
[44]	Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai, Shuo Wu, Sandeep Govind Dhoot, Abhilash Rajesh Kumar, Ankur Agiwal, Sanjay Bhansali, Mingsheng Hong, Jamie Cameron, Masood Siddiqi, David Jones, Jeff Shute, Andrey Gubarev, Shivakumar Venkataraman, Divyakant Agrawal: Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing. PVLDB 7(12): 1259-1270 (2014).
[45]	Ronald Barber, Peter Bendel, Marco Czech, Oliver Draese, Frederick Ho, Namik Hrle, Stratos Idreos, Min-Soo Kim, Oliver Koeth, Jae-Gil Lee, Tianchao Tim Li, Guy M. Lohman, Konstantinos Morfonios, René Müller, Keshava Murthy, Ippokratis Pandis, Lin Qiao, Vijayshankar Raman, Richard Sidle, Knut Stolze, Sandor Szabo: Business Analytics in (a) Blink. IEEE Data Eng. Bull. 35(1): 9-14 (2012).
[46]	Vijayshankar Raman, Gopi K. Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M. Lohman, Tim Malkemus, René Müller, Ippokratis Pandis, Berni Schiefer, David Sharpe, Richard Sidle, Adam J. Storm, Liping Zhang: DB2 with BLU Acceleration: So Much More than Just a Column Store. PVLDB 6(11): 1080-1091 (2013).
[47]	Cristian Diaconu, Craig Freedman, Erik Ismert, Per-?ke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, Mike Zwilling: Hekaton: SQL server’s memory-optimized OLTP engine. SIGMOD Conference 2013: 1243-1254.
[48]	EMC Greenplum. http://www.emc.com/campaign/global/greenplumdca/index.htm
[49]	HP Vertica. http://www.vertica.com/
[50]	VoltDB. http://voltdb.com/
[51]	Overview of SciDB, Large Scale Array Storage, Processing and Analysis, The SciDB Development team, SIGMOD’10, June 6-11, 2010, Indianapolis, Indiana, USA
[52]	Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Sai Wu: epiC: an Extensible and Scalable System for Processing Big Data. PVLDB 7(7): 541-552 (2014)
[53]	Zhao Cao, Shimin Chen, Dongzhe Ma, Jianhua Feng, Min Wang: Efficient and Flexible Index Access in MapReduce. EDBT 2014: 61-72.
[54]	Zhao Cao, Shimin Chen, Feifei Li, Min Wang, Xiaoyang Sean Wang: LogKV: Exploiting Key-Value Stores for Log Processing. CIDR 2013.