Analyze your first dataset
You can find this tutorial in the demos
folder of your Jupyter notebook environment.
- spark_titanic.ipynb
This tutorial walks you through a basic data analysis using Spark in Camber:
- Load the Titanic dataset hosted on the Camber Open Stash, which you have access to by default.
- Use Spark functionalties to transform and aggregate this dataset.
Load the dataset
First, import camber
.
Also import the functions module from pyspark.sql
, which is needed for the following analysis.
import camber
from pyspark.sql import functions as sf
spark = camber.spark.connect(engine_size="XSMALL")
Access the open stash through camber.stash
, and use it to load a dataset into a Spark DataFrame.
titanic = camber.stash.open_stash.read_spark("datasets/tutorial/titanic.csv", spark, format="csv", header=True)
titanic.printSchema()
You can also get a sample view of the DataFrame. Disable the truncate
option to print the full output for every column (instead of trucating ones that are too long):
titanic.show(10, truncate=False)
titanic.select("PassengerId", titanic.PassengerId, sf.col("PassengerId")).show(5)
Analyze the dataset
Find the distinct values of the Embarked
column, and then order the output in ascending order.
titanic.select(titanic.Embarked).distinct().orderBy(titanic.Embarked).show()
Notice how this tutorial uses the DataFrame.show()
method. This is because Spark executes lazily. A rough idea is that certain methods create the execution graph, while others force the execution. More info.
Filter for all survived passengers:
survivors = titanic.filter(sf.col("Survived") == "1")
survivors.show(5)
Now count the number of passengers in each Pclass
(passenger class).
classes = titanic.groupBy("Pclass").agg(sf.count("*").alias("Pcount")).orderBy("Pclass")
classes.show()
spark.stop()
to kill your Spark session.spark.stop()
Read more
Typically, Spark is most appropriate when using large datasets. For an example, try the Plot GAIA all-sky map notebook, which creates a histogram from a terabyte of astronomical data.