Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 30 additions & 6 deletions specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,6 @@
## STATUS: EARLY DRAFT



## Introduction

The purpose of this document is to provide an analysis of the design and
Expand All @@ -87,11 +86,36 @@ corresponding implementation of a job management API is a job management
library. A job management library, through its API, is invoked by a
client application.

Traditionally, job management is implemented on supercomputers by Local
Resource Managers (LRMs), such as PBS/Torque, SLURM, etc. To a first
approximation, a job management API is understood as an abstraction layer
on top of various LRMs.

Traditionally, job management is implemented on supercomputers by Local Resource
Managers (LRMs), such as PBS/Torque, SLURM, etc. To a first approximation, a job
management API is understood as an abstraction layer on top of various LRMs.
Consequently, the scope of the present API is informed by functionality commonly
found across LRMs.

The main motivation behind the present job management API is the ubiquity with
which projects meant to simplify the process of doing science on compute
clusters are forced to implement their own solution. The reason is simple. If
any tool needs to be portable across multiple clusters (and it rarely makes
sense to not want such portability), it must access the underlying cluster LRM
in an abstract way; that is, it must use a job management API. The only stable
job management API currently available is
[SAGA](http://radical-cybertools.github.io/) [so, wait a minute, how do we
justify not pushing SAGA forward?].
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will need to leave others to clarify this. One argument I heard on the call (but not one I would make, just repeating) is that SAGA, or at least the name, comes with a certain pre-notation / ballast, and that would at least partially justify to distinguish an Exaworks job API from it.

BTW: DRMAA also is an existing, stable API -- just not a widely implemented one (anymore). As such, we should also motivate why we do not implement DRMAA instead of inventing our own API.


We aim to provide a minimal API. That is, the API focuses on managing
independent jobs and not much more. Functionality such as expressing and
enforcing job dependencies, providing a uniform view of software environments
deployed on target clusters/resources, or providing an information service
describing characteristics of the target cluster/resource are beyond the scope
of this API. This is motivated, in part, by the fact that such functionality
would push the complexity of the API into unmanageable territory, while,
simultaneously being better suited for separate components.

We take inspiration from a number of projects, some defunct, with overlapping
scope, such as [Globus
GRAM](https://en.wikipedia.org/wiki/Grid_resource_allocation_manager),
[SAGA](http://radical-cybertools.github.io/), and
[DRMAA](http://www.drmaa.org/).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to add DRMAA, but also Flux here. We could also add POSIX FWIW (qsub etc): even though POSIX does not define an LRM API, it does define those command line tools which are widely considered the standard interface in many LRM implementations.


### A Note About Code Samples

Expand Down