atlasgce-scripts

Description

This is a collection of scripts for managing ATLAS analysis clusters on Google Compute Engine (GCE). They are a complement to the atlasgce Puppet modules (atlasgce-modules).

Structure

The collection consists of shell scripts and Puppet templates that handle two different types of tasks. One set is used to control the cluster, and the second set is transfered to and executed or applied on the cluster machines during startup — the contextualization. Each part is described in detail below.

Contextualization

When a machine is started, it must be configured to become a node in an analysis cluster. This configuration is done during or shortly after the machine boots and is called contextualization. The contextualization can be loosely separated into two parts; bootstrapping and configuration done by Puppet.

The contextualization consists of several elements, and is slightly different for the manager role, the worker role, and the Cloud Scheduler worker role. The roles are described here.

Puppet

See What is Puppet?

Bootstrapping

Bootstrapping is the process of launching software for machine self-configuration during startup. The bootstrapping procedure consists of the following steps

Install Puppet
Install other required software (e.g. Git)
Configure and mount attached storage
Download Puppet modules
Apply a Puppet manifest for the selected role

For the manager and worker role the bootstrapping procedure is run by the bootstrap.sh shell script, which is provided to the machine as a startup script. It is stored locally as /var/run/google.startup.script.

For the Cloud Scheduler worker node cloudscheduler/bootstrap.sh is used for bootstrapping, and it is provided to the machine through the user-data metadata attribute and the context and contexthelper scripts. Note: The bootstrapping procedure for Cloud Scheduler worker nodes requires machine images prepared with cloudscheduler/setup.sh.

Contextualization elements

The manager role (`head`)

The manager role is contextualized with the following files

Metadata attribute	Local file	Remote file	Description
`startup-script`	`bootstrap.sh`	`/var/run/google.startup.script`	Script that runs the bootstrapping procedure
`mount-script`	`mount-head.sh`	`/var/run/mount.sh`	Script that configures and mounts extra disk space
`module-script`	`modules.sh`	`/var/run/modules.sh`	Script to download Puppet modules
`node-template`	`gce_node_head.pp`	`/var/run/node-template.pp`	Puppet manifest containing the machine configuration

The worker role (`node`)

The worker role is contextualized with the following files

Metadata attribute	Local file	Remote file	Description
`startup-script`	`bootstrap.sh`	`/var/run/google.startup.script`	Script that runs the bootstrapping procedure
`mount-script`	`mount-worker.sh`	`/var/run/mount.sh`	Script that configures and mounts extra disk space
`module-script`	`modules.sh`	`/var/run/modules.sh`	Script to download Puppet modules
`node-template`	`gce_node_worker.pp`	`/var/run/node-template.pp`	Puppet manifest containing the machine configuration

The Cloud Scheduler worker role (`csnode`)

The Cloud Scheduler worker role requires a machine image prepared with the following files (see cloudscheduler)

File	Description
`/etc/init.d/context`	Run during boot to execute the `context-helper` script
`/usr/local/bin/context-helper`	Script that downloads [Nimbus](http://www.nimbusproject.org/) context data from the `user-data` metadata attribute

and is contextualized with these files

Local file	Description
`cloudscheduler/bootstrap.sh`	Script that runs the bootstrapping procedure
embedded mount script	Script that configures and mounts extra disk space
embedded node template	Puppet manifest containing the machine configuration

Cluster and machine control

Clusters consisting of one manager node (head) and one or more worker nodes (node) are controlled by the four cluster control scripts below. Individual test nodes (node, csnode, bare) can be created with the start-test-node.sh script.

Note: Cloud Scheduler worker nodes (csnode) are treated differently from manager (head) and worker (node) nodes, inasmuch as they are started by Cloud Scheduler and not by cluster control scripts.

`start-cluster.sh`

Usage: start-cluster.sh [options]
  -h            Print this text and exit
  -n N          Use N worker nodes. Default: 4.
  -p PROJECT    Use GCE project PROJECT. Default: atlasgce.
  -z ZONE       Add instances to ZONE. Default: europe-west1-b.
  -m MACHINE    Add instances of type MACHINE. Default: n1-standard-1-d.
  -i IMAGE      Add instances of image type IMAGE. Default: centos-6.

The start-cluster.sh script creates a head node and N worker nodes. To the head node the script attaches gce_node_head.pp and mount-head.sh as metadata, and to each worker node it attaches gce_node_worker.pp and mount-worker.sh. The worker nodes are created in parallel.

`stop-cluster.sh`

Usage: stop-cluster.sh [options]
  -h            Print this text and exit
  -n N          Use N worker nodes. Default: 4.
  -p PROJECT    Use GCE project PROJECT. Default: atlasgce.
  -z ZONE       Add instances to ZONE. Default: europe-west1-b.

The stop-cluster.sh script deletes the head node and N worker nodes. N should be set to the number of worker nodes in the cluster.

`update-cluster.sh`

Usage: update-cluster.sh [options]
  -h            Print this text and exit
  -n N          Use N worker nodes. Default: 4.
  -p PROJECT    Use GCE project PROJECT. Default: atlasgce.

The update-cluster.sh script updates the Puppet module repository (cd /etc/puppet/modules; sudo git pull origin master) and then reapplies the node template attached during startup (sudo puppet apply /var/run/node-template.pp) on each node in the cluster.

`run-cluster-command.sh`

Usage: run-cluster-command.sh [options] COMMAND
  -h            Print this text and exit
  -n N          Use N worker nodes. Default: 4.
  -p PROJECT    Use GCE project PROJECT. Default: atlasgce.
  -v            Verbose output

The run-cluster-command.sh script connects to each node in the cluster and runs COMMAND via gcutil ssh <node> "COMMAND".

`start-test-node.sh`

Usage: start-test-node.sh [options]
  -h            Print this text and exit
  -p PROJECT    Use GCE project PROJECT. Default: atlasgce.
  -z ZONE       Add instances to ZONE. Default: europe-west1-b.
  -m MACHINE    Add instances of type MACHINE. Default: n1-standard-1-d.
  -i IMAGE      Add instances of image type IMAGE. Default: centos-6.
  -a NAME       Name the test instance NAME. Default: test.
  -b            Bare instance without any contextualization.
  -c            Instance with Cloud Scheduler contextualization.

The start-test-node.sh script is a helper script to start a worker node suitable for testing the contextualization procedure. Without any options this command will create a worker node that has gone through parts of the bootstrapping procedure, but the Puppet contextualization has not taken place. It executes bootstrap.sh and mount-worker.sh on the node.

With the -c option a node is created as if it had been started by Cloud Scheduler. It sends cloudscheduler/bootstrap.sh in the user-data metadata attribute, which with a machine image prepared for Cloud Scheduler performs bootstrapping and Puppet contextualization.

With the -b option a bare node suitable for manually testing contextualization. By uploading bootstrapping scripts and Puppet modules and templates, the whole contextualization procedure can be mimicked.

Creating your own cluster

This part describes how to configure and run an analysis cluster on GCE.

Configuring

The most important files of the cluster configuration are the gce_node_head.pp and gce_node_worker.pp Puppet manifests, which contain the Puppet configuration for the machines. The configuration is realized with an instance of the gce_node class and it has the following parameters:

Parameter	Default	Description
`head`		Address of the manager node of the cluster
`role`		Role of the machine (`head` for the manager node, `node` for the worker nodes)
`use_cvmfs`	true	Flag to indicate use of CernVM-FS
`condor_pool_password`	undef	The Condor pool password (optional)
`condor_use_gsi`	false	Flag to indicate the use of GSI security for Condor. Certificates must be provided through other means.
`condor_slots`		Number of Condor execution slots per node (≤ #CPUs)
`use_xrootd`	true	Flag to indicate use of XRootD for file transfers and caching
`xrootd_global_redirector`	undef	Global XRootD redirector to access external data. Must be provided if `use_xrootd` is true
`use_apf`	true	Flag to indicate use of AutoPyFactory to create Panda pilots
`panda_site`	undef	Name of the Panda site as given in AGIS. Must be set if `use_apf` is true
`panda_queue`	undef	Name of the Panda queue as given in AGIS. Must be set if `use_apf` is true
`panda_cloud`	undef	Name of the Panda cloud as given in AGIS. Must be set if `use_apf` is true
`panda_administrator_email`	undef	Email address of the cluster administrator. Must be set if `use_apf` is true
`atlas_site`	undef	Value to assign to the `ATLAS_SITE` environment variable (optional)
`debug`	false	Flag to turn on or off debug or trace logging for the services

These files also specify any mount points created in mount-head.sh and mount-worker.sh. To change the disk configuration of the manager node both gce_node_head.pp and mount-head.sh must be edited, and correspondingly gce_node_worker.pp and mount-worker.sh for the worker nodes.

The location of the repository for the atlasgce-modules can be changed in modules.sh. Note: The update-cluster.sh script depends on being able to git pull from the master branch of the remote. If the retrieval method is changed from git the update command must be changed accordingly.

Finally it might be necessary to modify the bootstrapping procedure to account for eventualities not covered by the scripts and manifests above. As a last resorts modifications and additions can be made directly to the bootstrap.sh script.

Starting and stopping

Once the cluster has been configured, it can be started and stopped with the start-cluster.sh and stop-cluster.sh scripts respectively. These scripts require some parameters which can be given directly on the command line or configured in the file defaults.sh, with parameters given on the command line overriding the defaults. For example, this command starts a cluster with 8 nodes using the machine image my-special-image and using default values for the rest of the options

start-cluster.sh -n 8 -i my-special-image

and correspondingly to stop the cluster

stop-cluster.sh -n 8

Debugging

If something goes wrong with the contextualization, such that there's an error when applying the Puppet configuration, or one or more of the services are incorrectly configured, several options exist to debug the cluster.

By logging into the manager or one of the worker nodes, log files can be examined. The output from the bootstrap procedure, including the application of the Puppet configuration, can be found in /var/log/startupscript.log. Log files for the different services can be found in /var/log/cvmfs, /var/log/xrootd, /var/log/condor, and /var/log/apf for CernVM-FS, XRootD, Condor, and AutoPyFactory respectively. Note that to log enough information to debug the services it might be necessary to turn on debugging in the gce_node_head.pp and gce_node_worker.pp Puppet manifests.

It is possible to use run-cluster-command.sh -v to sequentially collect information about each node in the cluster. For instance, to probe the CernVM-FS repositories on each node simply run

run-cluster-command.sh -v 'cvmfs_config probe'

and to find the phrase all.manager in the Cluster Management Services log file just do

run-cluster-command.sh -v 'grep -F "all.manager" /var/log/xrootd/cmsd.log'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

atlasgce-scripts

Description

Structure

Contextualization

Puppet

Bootstrapping

Contextualization elements

The manager role (`head`)

The worker role (`node`)

The Cloud Scheduler worker role (`csnode`)

Cluster and machine control

`start-cluster.sh`

`stop-cluster.sh`

`update-cluster.sh`

`run-cluster-command.sh`

`start-test-node.sh`

Creating your own cluster

Configuring

Starting and stopping

Debugging

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
cloudscheduler		cloudscheduler
README.md		README.md
bootstrap.sh		bootstrap.sh
cernvm_node_worker.pp		cernvm_node_worker.pp
defaults.sh		defaults.sh
gce_node_head.pp		gce_node_head.pp
gce_node_worker.pp		gce_node_worker.pp
modules.sh		modules.sh
mount-head.sh		mount-head.sh
mount-worker.sh		mount-worker.sh
run-cluster-command.sh		run-cluster-command.sh
start-cluster.sh		start-cluster.sh
start-test-node.sh		start-test-node.sh
stop-cluster.sh		stop-cluster.sh
update-cluster.sh		update-cluster.sh

spiiph/atlasgce-scripts

Folders and files

Latest commit

History

Repository files navigation

atlasgce-scripts

Description

Structure

Contextualization

Puppet

Bootstrapping

Contextualization elements

The manager role (head)

The worker role (node)

The Cloud Scheduler worker role (csnode)

Cluster and machine control

start-cluster.sh

stop-cluster.sh

update-cluster.sh

run-cluster-command.sh

start-test-node.sh

Creating your own cluster

Configuring

Starting and stopping

Debugging

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

The manager role (`head`)

The worker role (`node`)

The Cloud Scheduler worker role (`csnode`)

`start-cluster.sh`

`stop-cluster.sh`

`update-cluster.sh`

`run-cluster-command.sh`

`start-test-node.sh`

Packages