Transformation

Read file from Local FS

For HDFS specify path as “hdfs://localhost:9000/lti/sample.txt”

lines = sc.textFile("file:///home/ak/sample.txt", <partition>)

Create Partitions of List

sc.parallelize(['b', 'c', 'a'])
sc.parallelize(['b', 'c', 'a'], 4)

Flat Map

Used to break data into smaller chunks

words = textsrc.flatMap(lambda line: line.split())

Flat Map Values

Used to flatten values of an key value pair. Can only access the value field

x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])])
x.flatMapValues(lambda value : value).collect()

Map

Create key, value pairs of data. Returns an list

wdmap =words.map(lambda word: (word, 1))
ratings = lines.map(lambda x : x.split()[2])

Map Values

Used to apply Map function of Key, Value pairs. Can only access the value field

x = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])])
x.mapValues(lambda f : len(f)).collect()

Filter

Returns an Boolean value. True data is allowed to pass through

newRDD = rdd.filter(lambda x : "TMIN" in x[1])

Caching of Data

text_file.cache()
text_file.persist()
text_file.unpersist()

NOTE

  • A action needs to be performed before the data is actually cached
  • The cached files are visible under the Jobs tab in Spark Web UI

Number of Partitions

textfile.getNumPartitions()

Partitions

repartition(4) 		# Increase or decrease partitions. Shuffling of data takes place
coalesce(<num>) 	# Reduce the no. of partitions. Shuffling does not take place

Actions

count.collect()
rdd.take(<num>)

Save the RDD as a text file

counts.saveAsTextFile("/lti/sparkwc")

Count By Value

Groups same data together. Returns a DefaultDict

result = ratings.countByValue()
 
for key, value in result.items():
	print(str(key) + " " + str(value))

Reduce by Key

Perform aggregation of data with the same key
The reduction is performed at the mapper stage and combined at the reducer

counts=wdmap.reduceByKey(lambda a, b: a + b)

Group By Key

Group values with the same key. The reduction is performed at the reducer stage

groupByKey()

Sort By Key

Sort RDD by key values

wordCount.map(lambda (x, y) : (y, x)).sortByKey()

Create RDD of just Keys or Values

keys(), values()

visualapi.pdf - Google Drive

Spark Programming Guide - Spark 2.2.0 Documentation