Gangmax Blog

Spark getting started

From here.

Concepts

  1. RDD: RDD means Resilient Distributed Dataset. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset.

  2. Actions & Transformations: actions return values, transformations return pointer of new RDDs.

  3. Start Spark Shell(you can choose Scala/Python)

1
2
3
4
5
6
7
8
9
10
11
12
13
./bin/pyspark
# Action.
textFile.count()
textFile.first()
# Transformation.
linesWithSpark = textFile.filter(lambda line: "Spark" in line)
textFile.filter(lambda line: "Spark" in line).count()
# RDD actions and transformations can be used for more complex computations.
textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)
# One common data flow pattern is MapReduce, as popularized by Hadoop.
# Spark can implement MapReduce flows easily
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.collect()
  1. Self-Contained Applications

Suppose we wish to write a self-contained application using the Spark API. We will walk through a simple application in Scala (with sbt), Java (with Maven), and Python(note: I prefer Python since it’s faster than the others).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
"""SimpleApp.py"""
from pyspark import SparkContext

# Replace "YOUR_SPARK_HOME" with your Spark directory.
logFile = "YOUR_SPARK_HOME/README.md"
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

sc.stop()

Save the file, and execute it with the following command:

1
2
3
4
./bin/spark-submit --master local[4] SimpleApp.py
# Note: if you use zsh like me, you should use the modified version like
# below from: http://zpjiang.me/2015/10/17/zsh-no-match-found-local-spark/
./bin/spark-submit --master 'local[4]' SimpleApp.py

Comments