DATA3404 Lecture Notes - Lecture 10: Mapreduce, Apache Spark, Runtime System

77 views6 pages

Document Summary

Not directly possible out of the box, needs some programming joins with mapreduce. Ideally via automated plan optimisation and scheduling: built on top of hadoop/hdfs, usable with existing jobs and data stores. In-memory framework for distributed, iterative computations: core: augment data flow model with resilient distributed dataset (rdd, rdd: fault-tolerance, in-memory storage abstraction. Rdds are created by: parallelizing an existing collection, referencing a dataset in an external storage system, such as hdfs. Rdds have partitions: based on source file partition (such as blocks of hdfs files, or created during transformation, repartition. Transformation: create a new dataset from an existing one, eg. map(func), flatmap(func), maptopairs(func), reducebykey(func) Action return a value to the driver program after running a computation on the data set: eg. count(), first(), collect(), saveastextfile(path) Most rdd operations take one or more functions as parameter: most of them can be viewed as higher order functions.

Get access

Grade+
$40 USD/m
Billed monthly
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
10 Verified Answers
Class+
$30 USD/m
Billed monthly
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
7 Verified Answers

Related Documents