GetStarted_EC2

Running CaffeOnSpark on EC2

Set up your EC2 key pair, and apply spark-ec2 from Apache Spark as below to launch a Spark cluster with 2 slaves on g2.2xlarge (1 GPU, 8 vCPUs) or g2.8xlarge (4 GPUs, 32 vCPUs) instances with an CaffeOnImage AMI. You could check your request status at EC2 console

export AMI_IMAGE=ami-198b396a
export SPARK_WORKER_INSTANCES=2 
${SPARK_HOME}/ec2/spark-ec2 --key-pair=${EC2_KEY} --identity-file=${EC2_PEM_FILE} \
			    --region=eu-west-1 --zone=eu-west-1c \
			    --ebs-vol-size=50 \
			    --instance-type=g2.2xlarge \
			    --master-instance-type=m4.xlarge \
			    --ami=${AMI_IMAGE}  -s ${SPARK_WORKER_INSTANCES} \
			    --spot-price 0.5 \
			    --copy-aws-credentials \
			    --hadoop-major-version=yarn --spark-version 1.6.0 \
			    --no-ganglia \
			    --user-data ${CAFFE_ON_SPARK}/scripts/ec2-cloud-config.txt \
			    launch CaffeOnSparkDemo

You should see the following line, which contains the host name of your Spark master.

Spark standalone cluster started at http://ec2-52-49-81-151.eu-west-1.compute.amazonaws.com:8080
Done!

ssh onto Spark master

ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${EC2_PEM_FILE} root@<SPARK_MASTER_HOST>

Train a DNN model, and test using mnist dataset located at ${CAFFE_ON_SPARK}/data

source ~/.bashrc
export PATH=${PATH}:${HADOOP_HOME}/bin:${SPARK_HOME}/bin
export SPARK_WORKER_INSTANCES=2 
#8 cores in g2.4xlarge, 32 cores in g2.8xlarge
export CORES_PER_WORKER=8
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
pushd ${CAFFE_ON_SPARK}/data

hadoop fs -rm -r -f /mnist_lenet.model 
spark-submit --master spark://$(hostname):7077 \
    --files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt \
    --conf spark.cores.max=${TOTAL_CORES} \
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -conf lenet_memory_solver.prototxt \
	-clusterSize ${SPARK_WORKER_INSTANCES} \
        -devices 1 \
	-connection ethernet \
        -model /mnist_lenet.model 
hadoop fs -ls /mnist_lenet*

hadoop fs -rm -r -f /lenet_features_result
spark-submit --master spark://$(hostname):7077 \
    --files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt \
    --conf spark.cores.max=${TOTAL_CORES} \
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -features accuracy,loss \
	-label label \
        -conf lenet_memory_solver.prototxt \
	-clusterSize ${SPARK_WORKER_INSTANCES} \
        -devices 1 \
	-connection ethernet \
        -model /mnist_lenet.model \
        -output /lenet_features_result
hadoop fs -cat /lenet_features_result/*

Destroy EC2 clusters

${SPARK_HOME}/ec2/spark-ec2 --key-pair=${EC2_KEY} --identity-file=${EC2_PEM_FILE} \
			    --region=eu-west-1 --zone=eu-west-1c \
			    destroy CaffeOnSparkDemo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GetStarted_EC2

Running CaffeOnSpark on EC2

Clone this wiki locally