QCFractalCompute - Compute Managers and Workers#

QCFractal executes quantum chemical calculations via compute managers deployed to resources suitable for these types of calculations. This document illustrates how to set up and run a compute manager on HPC resources.

Compute manager setup for an HPC cluster#

Note

The instructions in this section are performed on the head/login node of your HPC cluster. By the end, you will start up a compute manager on the head node, and it will launch jobs on your behalf. If your cluster admin forbits long-running processes on the head node, then refer to Execution without interfacing with an HPC scheduler.

Install the base environment for the compute manager, using mamba, and activate it:

$ mamba create -n qcfractalcompute -c conda-forge qcfractalcompute
$ mamba activate qcfractalcompute

This will create the conda environment qcfractalcompute, which the compute manager will run under.

Next, create the manager config as qcfractal-manager-config.yml, using the content below as your starting point. Fill in the components given in <brackets>. For this example, we are assuming a cluster using LSF as the scheduler; for other scheduler types, see Configuration for different HPC schedulers:

# qcfractal-manager-config.yml
---
cluster: <cluster_name>           # descriptive name to present to QCFractal server
loglevel: INFO
logfile: qcfractal-manager.log
update_frequency: 60.0

server:
  fractal_uri: <fractal_url>      # e.g. https://qcarchive.molssi.org
  username: <compute_identity>
  password: <compute_key>
  verify: True

executors:
  cpuqueue:
    type: lsf
    workers_per_node: 4           # max number of workers to spawn per node
    cores_per_worker: 16          # cores per worker
    memory_per_worker: 96         # memory per worker, in GiB
    max_nodes: 2                  # max number of jobs to have queued/running at a time
    walltime: "4:00:00"           # walltime, given in `hours:min:seconds`, alternatively in integer minutes
    project: null
    queue: <queue_name>           # name of queue to launch to
    request_by_nodes: false
    scheduler_options:            # add additional options for submission command
      - "-R fscratch"
    queue_tags:                   # only claim tasks with these tags; '*' means all tags accepted
      - '*'
    environments:
      use_manager_environment: False   # don't use the manager environment for task execution
      conda:
        - <worker_conda_env_name>      # name of conda env used for task execution; see below for example
    worker_init:
      - source <absolute_path>/worker-init.sh   # initialization script for worker; see below for example

Create the file worker-init.sh; this is run before worker startup:

# worker-init.sh
#!/bin/bash

# Make sure to run bashrc
source $HOME/.bashrc

# Don't limit stack size
ulimit -s unlimited

# make scratch space
# CUSTOMIZE FOR YOUR CLUSTER
mkdir -p /fscratch/${USER}/${LSB_JOBID}
cd /fscratch/${USER}/${LSB_JOBID}

# Activate qcfractalcompute conda env
conda activate qcfractalcompute

Set the absolute path to this file in the line - source <absolute_path>/worker-init.sh in qcfractal-manager-config.yml above.

We will also create a conda environment used by each worker to execute tasks, e.g. an environment suitable for using psi4, with a conda environment file such as worker-env.yml:

# worker-env.yml
---
name: qcfractal-worker-psi4-18.1
channels:
  - conda-forge/label/libint_dev
  - conda-forge
  - defaults
dependencies:
  - python =3.10
  - pip
  - qcengine
  - psi4 =1.8.1
  - dftd3-python
  - gcp-correction
  - geometric
  - scipy

  - pip:
    - basis_set_exchange

And creating a conda environment from it with mamba:

$ mamba env create -f worker-env.yml

Set the name of this conda env (qcfractal-worker-psi4-18.1) in the line - <worker_conda_env_name> in qcfractal-manager-config.yml above.

Finally, start up the compute manager:

$ qcfractal-compute-manager --config config.yml

The compute manager will read its config file, communicate with the QCFractal server to claim tasks, and launch jobs to the HPC scheduler as needed to execute those tasks using the worker conda environment. To keep it running beyond your current session if connected via SSH, consider running the compute manager under tmux or screen.

Configuration for different HPC schedulers#

HPC cluster schedulers vary in behavior, so you will need to adapt your qcfractal-manager-config.yml to the scheduler of the HPC cluster you intend to use. The configuration keys available for each type of record in the executors list are referenced here.


pydantic model SlurmExecutorConfig[source]#

Bases: ExecutorConfig

Create a new model by parsing and validating input data from keyword arguments.

Raises ValidationError if the input data cannot be parsed to form a valid model.

Validators:
field type: Literal['slurm'] = 'slurm'#
field walltime: str [Required]#
Validated by:
field exclusive: bool = True#
field partition: str | None = None#
field account: str | None = None#
field workers_per_node: int [Required]#
field max_nodes: int [Required]#
field scheduler_options: List[str] = []#
validator walltime_must_be_str  »  walltime[source]#
field queue_tags: List[str] [Required]#
field worker_init: List[str] = []#
field scratch_directory: str | None = None#
field bind_address: str | None = None#
field cores_per_worker: int [Required]#
field memory_per_worker: float [Required]#
field extra_executor_options: Dict[str, Any] = {}#
field environments: PackageEnvironmentSettings = PackageEnvironmentSettings(use_manager_environment=True, conda=[], apptainer=[])#

pydantic model TorqueExecutorConfig[source]#

Bases: ExecutorConfig

Create a new model by parsing and validating input data from keyword arguments.

Raises ValidationError if the input data cannot be parsed to form a valid model.

Validators:
field type: Literal['torque'] = 'torque'#
field walltime: str [Required]#
Validated by:
field account: str | None = None#
field queue: str | None = None#
field workers_per_node: int [Required]#
field max_nodes: int [Required]#
field scheduler_options: List[str] = []#
validator walltime_must_be_str  »  walltime[source]#
field queue_tags: List[str] [Required]#
field worker_init: List[str] = []#
field scratch_directory: str | None = None#
field bind_address: str | None = None#
field cores_per_worker: int [Required]#
field memory_per_worker: float [Required]#
field extra_executor_options: Dict[str, Any] = {}#
field environments: PackageEnvironmentSettings = PackageEnvironmentSettings(use_manager_environment=True, conda=[], apptainer=[])#

pydantic model LSFExecutorConfig[source]#

Bases: ExecutorConfig

Create a new model by parsing and validating input data from keyword arguments.

Raises ValidationError if the input data cannot be parsed to form a valid model.

Validators:
field type: Literal['lsf'] = 'lsf'#
field walltime: str [Required]#
Validated by:
field project: str | None = None#
field queue: str | None = None#
field workers_per_node: int [Required]#
field max_nodes: int [Required]#
field request_by_nodes: bool = True#
field bsub_redirection: bool = True#
field scheduler_options: List[str] = []#
validator walltime_must_be_str  »  walltime[source]#
field queue_tags: List[str] [Required]#
field worker_init: List[str] = []#
field scratch_directory: str | None = None#
field bind_address: str | None = None#
field cores_per_worker: int [Required]#
field memory_per_worker: float [Required]#
field extra_executor_options: Dict[str, Any] = {}#
field environments: PackageEnvironmentSettings = PackageEnvironmentSettings(use_manager_environment=True, conda=[], apptainer=[])#

Execution without interfacing with an HPC scheduler#

When running with a configuration like that above, the compute manager must remain alive on the head/login node of the cluster in order to execute tasks. If leaving a long-running process running on the head node is undesirable, then consider using a local executor configuration instead, replacing the executors section in qcfractal-manager-config.yml with e.g.:

executors:
  local_executor:
    type: local
    max_workers: 4                # max number of workers to spawn
    cores_per_worker: 16          # cores per worker
    memory_per_worker: 96         # memory per worker, in GiB
    queue_tags:
      - '*'
    environments:
      use_manager_environment: False
      conda:
        - <worker_conda_env_name>      # name of conda env used by worker; see below for example
    worker_init:
      - source <absolute_path>/worker_init.sh

You will then need to create a submission script suitable for your HPC scheduler that requests the appropriate resources, activates the qcfractalcompute conda environment, and runs qcfractal-compute-manager --config qcfractal-manager-config.yml itself. You can then manually submit jobs using this script as needed to complete tasks available on the QCFractal server.

Using the local executor type is also recommended for running a compute manager on a standalone host, or within a container on e.g. a Kubernetes cluster.


pydantic model LocalExecutorConfig[source]#

Bases: ExecutorConfig

Create a new model by parsing and validating input data from keyword arguments.

Raises ValidationError if the input data cannot be parsed to form a valid model.

field type: Literal['local'] = 'local'#
field max_workers: int [Required]#
field queue_tags: List[str] [Required]#
field worker_init: List[str] = []#
field scratch_directory: str | None = None#
field bind_address: str | None = None#
field cores_per_worker: int [Required]#
field memory_per_worker: float [Required]#
field extra_executor_options: Dict[str, Any] = {}#
field environments: PackageEnvironmentSettings = PackageEnvironmentSettings(use_manager_environment=True, conda=[], apptainer=[])#