RDD: RDD means Resilient Distributed Dataset. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset.
Actions & Transformations: actions return values, transformations return pointer of new RDDs.
Start Spark Shell(you can choose Scala/Python)
1 2 3 4 5 6 7 8 9 10 11 12 13
./bin/pyspark # Action. textFile.count() textFile.first() # Transformation. linesWithSpark = textFile.filter(lambda line: "Spark"in line) textFile.filter(lambda line: "Spark"in line).count() # RDD actions and transformations can be used for more complex computations. textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b) # One common data flow pattern is MapReduce, as popularized by Hadoop. # Spark can implement MapReduce flows easily wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) wordCounts.collect()
Self-Contained Applications
Suppose we wish to write a self-contained application using the Spark API. We will walk through a simple application in Scala (with sbt), Java (with Maven), and Python(note: I prefer Python since it’s faster than the others).
1 2 3 4 5 6 7 8 9 10 11 12 13 14
"""SimpleApp.py""" from pyspark import SparkContext
# Replace "YOUR_SPARK_HOME" with your Spark directory. logFile = "YOUR_SPARK_HOME/README.md" sc = SparkContext("local", "Simple App") logData = sc.textFile(logFile).cache()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
sc.stop()
Save the file, and execute it with the following command:
1 2 3 4
./bin/spark-submit --master local[4] SimpleApp.py # Note: if you use zsh like me, you should use the modified version like # below from: http://zpjiang.me/2015/10/17/zsh-no-match-found-local-spark/ ./bin/spark-submit --master 'local[4]' SimpleApp.py