Skip to content

Latest commit

 

History

History
127 lines (99 loc) · 5.21 KB

README.md

File metadata and controls

127 lines (99 loc) · 5.21 KB

Turing AI Cloud Quick Start

Workflow Overview

Workflow

The above picture illustrates the submission and debug workflows of TACC job.

Creating a TACC account

Before using tcloud SDK, please make sure that you have applied for a TACC account and submitted your public key to TACC. You may generate SSH public key according to the steps. To apply for a TACC account, please visit our website .

Installing tcloud SDK

  • Download tcloud SDK
    Download the latest tcloud SDK from tags.
  • Install tcloud SDK
    Place setup.sh and tcloud in the same directory, and run setup.sh.

Submitting Your First TACC Job

CLI Tool Initialization

  • First, you need to configure your TACC credentials. You can do this by running the tcloud config command:
    $ tcloud config [-u/--username] MYUSERNAME
    $ tcloud config [-f/--file] MYPRIVATEFILEPATH
    
  • Then, run tcloud init command to obtain the latest cluster hardware information from TACC cluster.
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    tacc*        up   infinite      5  alloc 10-0-7-[18-19],10-0-8-[18-19]
    tacc*        up   infinite     19   idle 10-0-2-[18-19],10-0-3-[10-13]
    

Download Sample Job

You can use this link to download our example code.

Submit a Job

Each job requires a main.py with tuxiv.conf

After tcloud is configured correctly, you can try to submit your first job.

  1. Go to the example folder in your terminal.
  2. Run tcloud submit command.
    ~/Dow/quickstart-master/example/helloworld ❯ tcloud submit
    Start parsing tuxiv.conf...
    building file list ...
    8 files to consider
    helloworld/
    helloworld/run.sh
            151 100%    0.00kB/s    0:00:00 (xfer#1, to-check=5/8)
    helloworld/configurations/
    helloworld/configurations/citynet.sh
              12 100%   11.72kB/s    0:00:00 (xfer#2, to-check=2/8)
    helloworld/configurations/conda.yaml
            107 100%  104.49kB/s    0:00:00 (xfer#3, to-check=1/8)
    helloworld/configurations/run.slurm
            278 100%  271.48kB/s    0:00:00 (xfer#4, to-check=0/8)
    
    sent 429 bytes  received 144 bytes  382.00 bytes/sec
    total size is 1071  speedup is 1.87
    Submitted batch job 2000
    Job helloworld submitted.
    

Retrive Your Job Status and Output

In this section, we provide two methods to monitor the job log.

After training, you can use tcloud ls [filepath] to find the output files

  • cat

    You can configure your log path in the tuxiv.conf. The default path is slurm_log/slurm-jobid.out.

    tcloud cat slurm_log/slurm-jobid.out
    

    In the helloworld example, the tuxiv.conf file specifies the log path as slurm_log/hello.log

  • download

    You can use tcloud download [filepath].

    Note that you can only read and download files in USERDIR, and the files in WORKDIR may be removed after the job is finished.

    tcloud download slurm_log/slurm-jobid.out
    

Manage your environment

tcloud uses Conda to manage your dependencies. All dependencies will be installed through conda. Please specify the required conda channel to meet the installation requirements. In tcloud, we offer two ways of environment management:

  1. One-off Environment. A new environment with different dependencies will be created every time you submit a task to TACC. If you do not specify an environment name and your dependencies configuration does not change between two consecutive submissions in tuxiv.conf, we will reuse the previous environment to save time. This is the default behavior.
    environment:
      # name:       # do not specify environment name
      dependencies:
          - pytorch=1.6.0
          - torchvision=0.7.0
      channels: pytorch
  2. Persistent Environment. You can create a dedicated environment for each project. It needs to set a different environment name in tuxiv.conf for each project. When you change your dependencies configuration with an exist environment, tcloud will update this environment in stead of creating a new one. Learn how to do this in tuxiv.conf documentation environment part.
    environment:
      name: torch-env   # dedicated environment name
      dependencies:
          - pytorch=1.6.0
          - torchvision=0.7.0
      channels: pytorch

Demo video

The following videos will help you use tcloud CLI to begin your TACC journey: demo video.

Examples

Basic examples are provided under the example folder. These examples include: HelloWorld, TensorFlow, PyTorch and MXNet.

FAQ

FAQ