camber.stash

The Stash class offers a quasi-filesystem to interact with your cloud storage. Each user has a private personal stash, one or more shared team stashes depending on team memberships in the Camber platform, and a public stash. Users have read/write access to their personal stashes by default. The public stash is read only.

import camber.stash

prv_stash = camber.stash.private
pub_stash = camber.stash.public

Attributes

When you import camber.stash, you can access three types of stashes:

private_stash
The stash that only you access.
team_stash[<TEAM>]
The stash that is shared by the members of <TEAM>. (Not available if you are not in a team.)
open_stash
The stash of datasets that all camber users can read.

As documented in the next section, each stash shares methods to manipulate and move files.

Each Stash object, when initialized, defaults to the root level, referred to as ~ or HOME directory. On private stash, this corresponds to the root level of a user’s notebook filesystem. On other stashes, this is object storage that maps to the structure of a UNIX file system.

Methods

For the following documentation, imagine we have the following directory structure in a given Stash:

    • dataset1.csv
    • dataset2.csv
      • image1.jpg
      • image2.png
      • image3.gig
    • README.md
    • reference.md
    • main.py
    • utils.py
      • test_main.py
      • test_utils.py
  • cd

    This is equivalent to cd in shell. The user changes their current working directory to the provided path argument.

    import camber
    
    stash = camber.stash.private
    stash
    # prints <PrivateStash(.)>
    
    # set current directory to be ./docs
    stash.cd("docs")
    
    # set current directory to .
    stash.cd("~")
    
    # chain cd commands to cd into ./data/images
    stash.cd("data").cd("images")
    
    # change cd using absolute path
    stash.cd("~/src")

    Args

    path: str
    Path to which the user wants to set their current directory.

    Returns

    self
    Original Stash with the current directory changed.

    ls

    This is equivalent to ls in shell. Lists everything in the path provided, which defaults to the current directory. Since cd changes current directory, it will affect what ls shows, just like on a filesystem. Directories in results will be denoted with an extra / at the end of their names. Note that basic wild cards * and ? are also supported in this method. A max_results parameter is also set to prevent overflow of ls results.

    import camber
    stash = camber.stash.private
    
    # list everything in the data directory
    stash.ls("data")
    # prints:
    # [
    #   'dataset1.csv',
    #   'dataset2.csv',
    #   'images/',
    # ]
    
    # using wildcards
    stash.ls("data/images/*.png")
    # prints:
    # ['image2.png']
    
    # combined use with `cd`
    stash.cd("docs").ls()
    # prints:
    # [
    #   'README.md',
    #   'reference.md',
    # ]

    Args

    path: str
    Path to which the user wants to list files/directories.
    Defaults to the current directory, or ..
    max_results: int
    Maximum number of results to return.
    Defaults to 100.

    rm

    This is equivalent to rm in shell. Removes either a file or directory at the designated path. This method also supports basic wild cards * and ?. The recursive flag must be set to True if the target is a directory or wild cards are used, the call will fail otherwise.

    import camber
    stash = camber.stash.private
    
    # remove file in the docs directory
    stash.rm("docs/README.md")
    
    # remove the 'tests' directory in 'src', must use `recursive` or the op fails
    stash.rm("src/tests", recursive=True)
    
    # remove all 'png' files in data/images, `recursive` must be set to True
    stash.rm("data/images/*.png", recursive=True)

    Args

    path: str
    Path of the file/directory the user wants to remove
    recursive: bool
    Whether to delete recursively, required for deleting directories

    cp

    This is equivalent to cp in shell. Copies either a file or directory from src_path to dest_path. If the source and destination are in different stashes, you must supply a dest_stash object. If the source is a directory, the recursive flag must be set to True. Users could also use the >>/<< shortcuts to copy each stash’s current directory recursively to another stash. Arrows always point to the receiving stash. Note that basic wild cards * and ? are supported in this method.

    import camber
    private = camber.stash.private
    mygroup = camber.stash.team["mygroup"]
    
    private
    # prints <PrivateStash(.)>
    mygroup
    # prints <TeamStash[mygroup](.)>
    
    # copies ./docs/README.md to ./README.md in the private stash
    private.cp(src_path="docs/README.md", dest_path=".")
    
    # copies everything in ./docs to ./docs-deprecated recursively
    private.cp(src_path="docs", dest_path="docs-deprecated", recrusvie=True)
    
    # copies the ./datasets directory from team stash to ./datasets in private stash
    mygroup.cp(
    		dest_stash=private,
    		src_path="datasets",
    		dest_path="datasets",
    		recursive=True
    )
    
    # syntactic sugar, the logic above could also be written as
    mygroup.cd("~/datasets") >> private.cd("~/datasets")

    Args

    dest_stash: Stash
    Stash where the file/directory will be copied to, defaults to self
    src_path: str
    Path of the source file/directory to be copied
    dest_path: str
    Destination path where the file/directory will be copied
    recursive: bool
    Whether to copy recursively, required for copying directories

    read_spark

    This allows users to read data from a Stash into a Spark DataFrame. The user needs to supply the path to the data in the stash, a live Spark session, the format of the data, and any necessary Spark format options. For more information on the available formats and their corresponding options, please refer to the Spark documentation.

    import camber
    
    spark = camber.spark.connect(worker_size="XSMALL")
    # wait for spark to start up
    
    # read csv data into pyspark DataFrame
    private = camber.stash.private
    df = private.read_spark(
    		path="data/*.csv",
    		spark_session=spark,
    		format="csv",
    		header=True,
    		inferSchema=True
    )

    Args:

    path: str
    Path to data file/directory
    spark_session: pyspark.sql.SparkSession
    Live Spark session
    format: str
    The format of the data, defaults to “csv”
    **fmt_opts
    Format options for Spark

    Returns:

    pyspark.sql.DataFrame
    DataFrame containing the data from the stash

    write_spark

    This method allows users to write data from a Spark DataFrame into a Stash. The user needs to specify the DataFrame to be written, the destination path in the stash, the data format, and any necessary Spark format options. If single_file is set to True, the data will be written as a single file; otherwise, it will be written as multiple part files. If overwrite_existing is set to True, existing files at the destination path will be overwritten. For more details on the available formats and their corresponding options, please refer to the Spark documentation.

    # ... follow up from `read_spark` example
    # apply processing logic to `df`
    
    # write output to `json` form instead
    private.write_spark(
    		df=df,
    		path="output/"
    		format="json",
    		single_file=True,
    		overwrite_existing=True
    )
    # outputs:
    # output/*.csv

    Args:

    df: pyspark.sql.DataFrame
    DataFrame to be written
    path: str
    Destination path in the Stash
    format: str
    The format of the data, defaults to “csv”
    single_file: bool
    If set to True, data will be written as a single file
    overwrite_existing: bool
    If set to True, existing files at the destination will be overwritten
    *fmt_opts
    Format options for Spark