Analyze your first dataset
You can find this tutorial in the demos
folder of your Jupyter notebook environment.
- spark_titanic.ipynb
This tutorial walks you through a basic data analysis using Spark in Camber:
- Load the Titanic dataset hosted on the Camber Open Stash, which you have access to by default.
- Use Spark functionalties to transform and aggregate this dataset.
Load the dataset
First, import camber
Also import the functions module from pyspark.sql
, which is needed for the following analysis.
import camber
from pyspark.sql import functions as sf
Create a Spark session hassle free with the camber.spark.connect()
Camber provisions a Spark cluster to you.
For this use case, an XSMALL
engine is enough.
For more details on engine sizing, read Engine Attributes.
spark = camber.spark.connect(engine_size="XSMALL")
Access the open stash through camber.stash
, and use it to load a dataset into a Spark DataFrame.
titanic = camber.stash.public.read_spark("datasets/tutorial/titanic.csv", spark, format="csv", header=True)
You can also get a sample view of the DataFrame. Disable the truncate
option to print the full output for every column (instead of trucating ones that are too long):, truncate=False)"PassengerId", titanic.PassengerId, sf.col("PassengerId")).show(5)
Analyze the dataset
Find the distinct values of the Embarked
column, and then order the output in ascending order.
Notice how this tutorial uses the
This is because Spark executes lazily.
A rough idea is that certain methods create the execution graph, while others force the execution. See Transformations vs Actions.
Filter for all survived passengers:
survivors = titanic.filter(sf.col("Survived") == "1")
Now count the number of passengers in each Pclass
(passenger class).
classes = titanic.groupBy("Pclass").agg(sf.count("*").alias("Pcount")).orderBy("Pclass")
to kill your Spark session.spark.stop()
Read more
Typically, Spark is most appropriate when using large datasets. For an example, try the Plot GAIA all-sky map notebook, which creates a histogram from a terabyte of astronomical data.