camber.stash

The Stash class offers a quasi-filesystem to interact with your cloud storage. Each user has a private personal stash, one or more shared team stashes depending on team memberships in the Camber platform, and a public stash. Users have read/write access to their personal stashes by default. The public stash is read only.

import camber.stash

prv_stash = camber.stash.private
pub_stash = camber.stash.public

Attributes

When you import camber.stash, you can access three types of stashes:

private_stash: The stash that only you access.
team_stash[<TEAM>]: The stash that is shared by the members of <TEAM>. (Not available if you are not in a team.)
open_stash: The stash of datasets that all camber users can read.

As documented in the next section, each stash shares methods to manipulate and move files.

Each Stash object, when initialized, defaults to the root level, referred to as ~ or HOME directory. On private stash, this corresponds to the root level of a user’s notebook filesystem. On other stashes, this is object storage that maps to the structure of a UNIX file system.

Methods

For the following documentation, imagine we have the following directory structure in a given Stash:

dataset1.csv
dataset2.csv
- image1.jpg
- image2.png
- image3.gig

README.md
reference.md

main.py
utils.py
- test_main.py
- test_utils.py

`cd`

This is equivalent to cd in shell. The user changes their current working directory to the provided path argument.

import camber

stash = camber.stash.private
stash
# prints <PrivateStash(.)>

# set current directory to be ./docs
stash.cd("docs")

# set current directory to .
stash.cd("~")

# chain cd commands to cd into ./data/images
stash.cd("data").cd("images")

# change cd using absolute path
stash.cd("~/src")

Args

path: str: Path to which the user wants to set their current directory.

Returns

self: Original Stash with the current directory changed.

`ls`

This is equivalent to ls in shell. Lists everything in the path provided, which defaults to the current directory. Since cd changes current directory, it will affect what ls shows, just like on a filesystem. Directories in results will be denoted with an extra / at the end of their names. Note that basic wild cards * and ? are also supported in this method. A max_results parameter is also set to prevent overflow of ls results.

import camber
stash = camber.stash.private

# list everything in the data directory
stash.ls("data")
# prints:
# [
#   'dataset1.csv',
#   'dataset2.csv',
#   'images/',
# ]

# using wildcards
stash.ls("data/images/*.png")
# prints:
# ['image2.png']

# combined use with `cd`
stash.cd("docs").ls()
# prints:
# [
#   'README.md',
#   'reference.md',
# ]

Args

path: str: Path to which the user wants to list files/directories.; Defaults to the current directory, or ..
max_results: int: Maximum number of results to return.; Defaults to 100.

`rm`

This is equivalent to rm in shell. Removes either a file or directory at the designated path. This method also supports basic wild cards * and ?. The recursive flag must be set to True if the target is a directory or wild cards are used, the call will fail otherwise.

import camber
stash = camber.stash.private

# remove file in the docs directory
stash.rm("docs/README.md")

# remove the 'tests' directory in 'src', must use `recursive` or the op fails
stash.rm("src/tests", recursive=True)

# remove all 'png' files in data/images, `recursive` must be set to True
stash.rm("data/images/*.png", recursive=True)

Args

path: str: Path of the file/directory the user wants to remove
recursive: bool: Whether to delete recursively, required for deleting directories

`cp`

This is equivalent to cp in shell. Copies either a file or directory from src_path to dest_path. If the source and destination are in different stashes, you must supply a dest_stash object. If the source is a directory, the recursive flag must be set to True. Users could also use the >>/<< shortcuts to copy each stash’s current directory recursively to another stash. Arrows always point to the receiving stash. Note that basic wild cards * and ? are supported in this method.

import camber
private = camber.stash.private
mygroup = camber.stash.team["mygroup"]

private
# prints <PrivateStash(.)>
mygroup
# prints <TeamStash[mygroup](.)>

# copies ./docs/README.md to ./README.md in the private stash
private.cp(src_path="docs/README.md", dest_path=".")

# copies everything in ./docs to ./docs-deprecated recursively
private.cp(src_path="docs", dest_path="docs-deprecated", recursive=True)

# copies the ./datasets directory from team stash to ./datasets in private stash
mygroup.cp(
		dest_stash=private,
		src_path="datasets",
		dest_path="datasets",
		recursive=True
)

# syntactic sugar, the logic above could also be written as
mygroup.cd("~/datasets") >> private.cd("~/datasets")

Args

dest_stash: Stash: Stash where the file/directory will be copied to, defaults to self
src_path: str: Path of the source file/directory to be copied
dest_path: str: Destination path where the file/directory will be copied
recursive: bool: Whether to copy recursively, required for copying directories

`read_spark`

This allows users to read data from a Stash into a Spark DataFrame. The user needs to supply the path to the data in the stash, a live Spark session, the format of the data, and any necessary Spark format options. For more information on the available formats and their corresponding options, please refer to the Spark documentation.

import camber

spark = camber.spark.connect(worker_size="XSMALL")
# wait for spark to start up

# read csv data into pyspark DataFrame
private = camber.stash.private
df = private.read_spark(
		path="data/*.csv",
		spark_session=spark,
		format="csv",
		header=True,
		inferSchema=True
)

Args:

path: str: Path to data file/directory
spark_session: pyspark.sql.SparkSession: Live Spark session
format: str: The format of the data, defaults to “csv”
**fmt_opts: Format options for Spark

Returns:

pyspark.sql.DataFrame: DataFrame containing the data from the stash

`write_spark`

This method allows users to write data from a Spark DataFrame into a Stash. The user needs to specify the DataFrame to be written, the destination path in the stash, the data format, and any necessary Spark format options. If single_file is set to True, the data will be written as a single file; otherwise, it will be written as multiple part files. If overwrite_existing is set to True, existing files at the destination path will be overwritten. For more details on the available formats and their corresponding options, please refer to the Spark documentation.

# ... follow up from `read_spark` example
# apply processing logic to `df`

# write output to `json` form instead
private.write_spark(
		df=df,
		path="output/"
		format="json",
		single_file=True,
		overwrite_existing=True
)
# outputs:
# output/*.csv

Args:

df: pyspark.sql.DataFrame: DataFrame to be written
path: str: Destination path in the Stash
format: str: The format of the data, defaults to “csv”
single_file: bool: If set to True, data will be written as a single file
overwrite_existing: bool: If set to True, existing files at the destination will be overwritten
*fmt_opts: Format options for Spark

camber.spark