DATA3404 Lecture Notes - Lecture 10: Mapreduce, Apache Spark, Runtime System

73 views6 pages

Document Summary

Not directly possible out of the box, needs some programming joins with mapreduce. Ideally via automated plan optimisation and scheduling: built on top of hadoop/hdfs, usable with existing jobs and data stores. In-memory framework for distributed, iterative computations: core: augment data flow model with resilient distributed dataset (rdd, rdd: fault-tolerance, in-memory storage abstraction. Rdds are created by: parallelizing an existing collection, referencing a dataset in an external storage system, such as hdfs. Rdds have partitions: based on source file partition (such as blocks of hdfs files, or created during transformation, repartition. Transformation: create a new dataset from an existing one, eg. map(func), flatmap(func), maptopairs(func), reducebykey(func) Action return a value to the driver program after running a computation on the data set: eg. count(), first(), collect(), saveastextfile(path) Most rdd operations take one or more functions as parameter: most of them can be viewed as higher order functions.

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers

Related Documents