Analyze your first dataset

You can find this tutorial in the demos folder of your Jupyter notebook environment.

spark_titanic.ipynb

This tutorial walks you through a basic data analysis using Spark in Camber:

Load the Titanic dataset hosted on the Camber Open Stash, which you have access to by default.
Use Spark functionalties to transform and aggregate this dataset.

Load the dataset

First, import camber. Also import the functions module from pyspark.sql, which is needed for the following analysis.

import camber
from pyspark.sql import functions as sf

Create a Spark session hassle free with the camber.spark.connect() method. Camber provisions a Spark cluster to you. For this use case, an XSMALL engine is enough. For more details on engine sizing, read Engine Attributes.

spark = camber.spark.connect(engine_size="XSMALL")

Spark initialized! Remember to stop the spark session when done: spark.stop()

Access the open stash through camber.stash, and use it to load a dataset into a Spark DataFrame.

titanic = camber.stash.public.read_spark("datasets/tutorial/titanic.csv", spark, format="csv", header=True)

With the DataFrame loaded, print its schema:

titanic.printSchema()

root
 |-- PassengerId: string (nullable = true)
 |-- Survived: string (nullable = true)
 |-- Pclass: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- SibSp: string (nullable = true)
 |-- Parch: string (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: string (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

You can also get a sample view of the DataFrame. Disable the truncate option to print the full output for every column (instead of trucating ones that are too long):

titanic.show(10, truncate=False)

+-----------+--------+------+---------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|Name                                               |Sex   |Age |SibSp|Parch|Ticket          |Fare   |Cabin|Embarked|
+-----------+--------+------+---------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+
|1          |0       |3     |Braund, Mr. Owen Harris                            |male  |22  |1    |0    |A/5 21171       |7.25   |null |S       |
|2          |1       |1     |Cumings, Mrs. John Bradley (Florence Briggs Thayer)|female|38  |1    |0    |PC 17599        |71.2833|C85  |C       |
|3          |1       |3     |Heikkinen, Miss Laina                              |female|26  |0    |0    |STON/O2. 3101282|7.925  |null |S       |
|4          |1       |1     |Futrelle, Mrs. Jacques Heath (Lily May Peel)       |female|35  |1    |0    |113803          |53.1   |C123 |S       |
|5          |0       |3     |Allen, Mr. William Henry                           |male  |35  |0    |0    |373450          |8.05   |null |S       |
|6          |0       |3     |Moran, Mr. James                                   |male  |null|0    |0    |330877          |8.4583 |null |Q       |
|7          |0       |1     |McCarthy, Mr. Timothy J                            |male  |54  |0    |0    |17463           |51.8625|E46  |S       |
|8          |0       |3     |Palsson, Master Gosta Leonard                      |male  |2   |3    |1    |349909          |21.075 |null |S       |
|9          |1       |3     |Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  |female|27  |0    |2    |347742          |11.1333|null |S       |
|10         |1       |2     |Nasser, Mrs. Nicholas (Adele Achem)                |female|14  |1    |0    |237736          |30.0708|null |C       |
+-----------+--------+------+---------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 10 rows

Access columns in the DataFrame in one of the following ways. The rest of the tutorial uses all these methods to show what’s possible.

titanic.select("PassengerId", titanic.PassengerId, sf.col("PassengerId")).show(5)

+-----------+-----------+-----------+
|PassengerId|PassengerId|PassengerId|
+-----------+-----------+-----------+
|          1|          1|          1|
|          2|          2|          2|
|          3|          3|          3|
|          4|          4|          4|
|          5|          5|          5|
+-----------+-----------+-----------+
only showing top 5 rows

Analyze the dataset

Find the distinct values of the Embarked column, and then order the output in ascending order.

titanic.select(titanic.Embarked).distinct().orderBy(titanic.Embarked).show()

+--------+
|Embarked|
+--------+
|    null|
|       C|
|       Q|
|       S|
+--------+

Notice how this tutorial uses the DataFrame.show() method. This is because Spark executes lazily. A rough idea is that certain methods create the execution graph, while others force the execution. See Transformations vs Actions.

Filter for all survived passengers:

survivors = titanic.filter(sf.col("Survived") == "1")
survivors.show(5)

+-----------+--------+------+--------------------+------+---+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex|Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+---+-----+-----+----------------+-------+-----+--------+
|          2|       1|     1|Cumings, Mrs. Joh...|female| 38|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss L...|female| 26|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female| 35|    1|    0|          113803|   53.1| C123|       S|
|          9|       1|     3|Johnson, Mrs. Osc...|female| 27|    0|    2|          347742|11.1333| null|       S|
|         10|       1|     2|Nasser, Mrs. Nich...|female| 14|    1|    0|          237736|30.0708| null|       C|
+-----------+--------+------+--------------------+------+---+-----+-----+----------------+-------+-----+--------+
only showing top 5 rows

Now count the number of passengers in each Pclass (passenger class).

classes = titanic.groupBy("Pclass").agg(sf.count("*").alias("Pcount")).orderBy("Pclass")
classes.show()

+------+------+
|Pclass|Pcount|
+------+------+
|     1|   216|
|     2|   184|
|     3|   491|
+------+------+

Congratulations! You just ran a simple data analysis in Spark. Remember to call spark.stop() to kill your Spark session.

spark.stop()

Typically, Spark is most appropriate when using large datasets. For an example, try the Plot GAIA all-sky map notebook, which creates a histogram from a terabyte of astronomical data.

Run your first job What is Camber?

Analyze your first dataset

Load the dataset

Analyze the dataset

Read more