camber.stash
The Stash
class offers a quasi-filesystem to interact with your cloud storage.
Each user has a private personal stash, one or more shared team stashes depending on team memberships in the Camber platform, and a public stash.
Users have read/write access to their personal stashes by default.
The public stash is read only.
import camber.stash
prv_stash = camber.stash.private
pub_stash = camber.stash.public
Attributes
When you import camber.stash
, you can access three types of stashes:
private_stash
- The stash that only you access.
team_stash[<TEAM>]
- The stash that is shared by the members of
<TEAM>
. (Not available if you are not in a team.) open_stash
- The stash of datasets that all camber users can read.
As documented in the next section, each stash shares methods to manipulate and move files.
Each Stash
object, when initialized, defaults to the root level, referred to as ~
or HOME
directory. On private stash, this corresponds to the root level of a user’s notebook filesystem.
On other stashes, this is object storage that maps to the structure of a UNIX file system.
Methods
For the following documentation, imagine we have the following directory structure in a given Stash
:
- dataset1.csv
- dataset2.csv
- image1.jpg
- image2.png
- image3.gig
- README.md
- reference.md
- main.py
- utils.py
- test_main.py
- test_utils.py
cd
This is equivalent to cd
in shell.
The user changes their current working directory to the provided path
argument.
import camber
stash = camber.stash.private
stash
# prints <PrivateStash(.)>
# set current directory to be ./docs
stash.cd("docs")
# set current directory to .
stash.cd("~")
# chain cd commands to cd into ./data/images
stash.cd("data").cd("images")
# change cd using absolute path
stash.cd("~/src")
Args
path
: str- Path to which the user wants to set their current directory.
Returns
self
- Original Stash with the current directory changed.
ls
This is equivalent to ls
in shell.
Lists everything in the path provided, which defaults to the current directory.
Since cd
changes current directory, it will affect what ls
shows, just like on a filesystem.
Directories in results will be denoted with an extra /
at the end of their names.
Note that basic wild cards *
and ?
are also supported in this method.
A max_results
parameter is also set to prevent overflow of ls
results.
import camber
stash = camber.stash.private
# list everything in the data directory
stash.ls("data")
# prints:
# [
# 'dataset1.csv',
# 'dataset2.csv',
# 'images/',
# ]
# using wildcards
stash.ls("data/images/*.png")
# prints:
# ['image2.png']
# combined use with `cd`
stash.cd("docs").ls()
# prints:
# [
# 'README.md',
# 'reference.md',
# ]
Args
path
: str- Path to which the user wants to list files/directories.
- Defaults to the current directory, or
.
. max_results
: int- Maximum number of results to return.
- Defaults to 100.
rm
This is equivalent to rm
in shell.
Removes either a file or directory at the designated path
.
This method also supports basic wild cards *
and ?
.
The recursive
flag must be set to True
if the target is a directory or wild cards are used, the call will fail otherwise.
import camber
stash = camber.stash.private
# remove file in the docs directory
stash.rm("docs/README.md")
# remove the 'tests' directory in 'src', must use `recursive` or the op fails
stash.rm("src/tests", recursive=True)
# remove all 'png' files in data/images, `recursive` must be set to True
stash.rm("data/images/*.png", recursive=True)
Args
path
: str- Path of the file/directory the user wants to remove
recursive
: bool- Whether to delete recursively, required for deleting directories
cp
This is equivalent to cp
in shell.
Copies either a file or directory from src_path
to dest_path
.
If the source and destination are in different stashes, you must supply a dest_stash
object.
If the source is a directory, the recursive
flag must be set to True
.
Users could also use the >>
/<<
shortcuts to copy each stash’s current directory recursively to another stash.
Arrows always point to the receiving stash.
Note that basic wild cards *
and ?
are supported in this method.
import camber
private = camber.stash.private
mygroup = camber.stash.team["mygroup"]
private
# prints <PrivateStash(.)>
mygroup
# prints <TeamStash[mygroup](.)>
# copies ./docs/README.md to ./README.md in the private stash
private.cp(src_path="docs/README.md", dest_path=".")
# copies everything in ./docs to ./docs-deprecated recursively
private.cp(src_path="docs", dest_path="docs-deprecated", recursive=True)
# copies the ./datasets directory from team stash to ./datasets in private stash
mygroup.cp(
dest_stash=private,
src_path="datasets",
dest_path="datasets",
recursive=True
)
# syntactic sugar, the logic above could also be written as
mygroup.cd("~/datasets") >> private.cd("~/datasets")
Args
dest_stash
: Stash- Stash where the file/directory will be copied to, defaults to
self
src_path
: str- Path of the source file/directory to be copied
dest_path
: str- Destination path where the file/directory will be copied
recursive
: bool- Whether to copy recursively, required for copying directories
read_spark
This allows users to read data from a Stash
into a Spark DataFrame
.
The user needs to supply the path to the data in the stash, a live Spark session, the format of the data, and any necessary Spark format options.
For more information on the available formats and their corresponding options, please refer to the Spark documentation.
import camber
spark = camber.spark.connect(worker_size="XSMALL")
# wait for spark to start up
# read csv data into pyspark DataFrame
private = camber.stash.private
df = private.read_spark(
path="data/*.csv",
spark_session=spark,
format="csv",
header=True,
inferSchema=True
)
Args:
path
: str- Path to data file/directory
spark_session
: pyspark.sql.SparkSession- Live Spark session
format
: str- The format of the data, defaults to “csv”
**fmt_opts
- Format options for Spark
Returns:
pyspark.sql.DataFrame
- DataFrame containing the data from the stash
write_spark
This method allows users to write data from a Spark DataFrame
into a Stash
.
The user needs to specify the DataFrame
to be written, the destination path in the stash, the data format, and any necessary Spark format options.
If single_file
is set to True
, the data will be written as a single file; otherwise, it will be written as multiple part files.
If overwrite_existing
is set to True
, existing files at the destination path will be overwritten.
For more details on the available formats and their corresponding options, please refer to the Spark documentation.
# ... follow up from `read_spark` example
# apply processing logic to `df`
# write output to `json` form instead
private.write_spark(
df=df,
path="output/"
format="json",
single_file=True,
overwrite_existing=True
)
# outputs:
# output/*.csv
Args:
df
: pyspark.sql.DataFrame- DataFrame to be written
path
: str- Destination path in the
Stash
format
: str- The format of the data, defaults to “csv”
single_file
: bool- If set to
True
, data will be written as a single file overwrite_existing
: bool- If set to
True
, existing files at the destination will be overwritten *fmt_opts
- Format options for Spark