Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions docs/00_intro/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@ Introduction

This is benchmark documentation for a Department of Energy (DOE)
National Nuclear Security Administration (NNSA) Advanced Simulation
and Computing (ASC) **Future Computing Resource (FCR)**.
and Computing (ASC) **Advanced Technology System 6 (ATS-6)**.


Benchmark Overview
==================

Mini Applications and Microbenchmarks are features, components, performance characteristics, or other properties that are important to the Laboratories. Mini Applications are prioritized as Priority 1, or Priority 2.
Mini Applications and Microbenchmarks are features, components, performance characteristics, or other properties that are important to the Laboratories. Mini Applications are prioritized as Technical Requirement 1, or Technical Requirement 2.

Priority 1 Mini Applications
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Technical Requirement 1 Mini Applications
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::

Expand Down Expand Up @@ -62,8 +62,8 @@ Priority 1 Mini Applications
- Kokkos


Priority 2 Mini Applications
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Technical Requirement 2 Mini Applications
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::

Expand Down Expand Up @@ -101,7 +101,7 @@ Priority 2 Mini Applications
- NCCL+CUDA
- NVIDIA NeMo

Please note that half of the RAJA kernels are Priority 1, and the other half are Priority 2. Similarly, 2 of the Laghos problems are Priority 1, and the third is Priority 2.
Please note that half of the RAJA kernels are Technical Requirement 1, and the other half are Technical Requirement 2. Similarly, 2 of the Laghos problems are Technical Requirement 1, and the third is Technical Requirement 2.


.. _GlobalRunRules:
Expand Down
11 changes: 6 additions & 5 deletions docs/12_laghos/laghos.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,16 @@ Problems

The test problems are the Sedov shock (problem 1 in Laghos) in 3D.
The test problems should be run with a conforming mesh.
Linear, quadratic, and cubic orders are of interest with the following priorities:
Linear, quadratic, and cubic orders are of interest and fall into
the Technical Requirements as following:

Priority 1 problems
^^^^^^^^^^^^^^^^^^^
Technical Requirement 1 problems
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. **3D Linear** This problem uses a kinematic order of 1, and a thermodynamic order of 0 (Q1Q0).
#. **3D Quadratic** This problem uses a kinematic order of 2, and a thermodynamic order of 1 (Q2Q1).

Priority 2 problems
^^^^^^^^^^^^^^^^^^^
Technical Requirement 2 problems
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3. **3D Cubic** This problem uses a kinematic order of 3, and a thermodynamics order of 2 (Q3Q2).


Expand Down
114 changes: 56 additions & 58 deletions docs/13_rajaperf/rajaperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,24 +79,24 @@ Problems

The RAJA Performance Suite Benchmark consists of a subset of kernels in the
full Suite that focus on some key computational patterns found in LLNL
applications. The benchmark kernels are partitioned into two priority levels as
described below, along with notable features and RAJA constructs used in each
kernel (in parentheses).
applications. The benchmark kernels are partitioned into two sets of Technical
Requirements as described below, along with notable features and RAJA constructs
used in each kernel (in parentheses).

.. note:: In the RAJA Performance Suite repository, each kernel contains a
detailed reference description near the top of the header file for
the kernel class; i.e., C++ header file named ``<kernel-name>.hpp``.
The reference description is a C-style sequential implementation of
the kernel in a comment section near the top of the file.

The RAJA Performance Suite Benchmark kernels are partitioned into two
priority levels described below.
The RAJA Performance Suite Benchmark kernels are partitioned into two sets of
Technical Requirements described below.


Priority 1 kernels
^^^^^^^^^^^^^^^^^^^
Technical Requirement 1 kernels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*Priority 1* kernels are most important to us. They are located in the
*Technical Requirement 1* kernels are most important to us. They are located in the
``RAJAPerf/src/apps`` sub-directory:

#. **DIFFUSION3DPA** element-wise action of a 3D finite element volume diffusion operator via partial assembly and sum factorization *(nested loops, GPU shared memory, RAJA::launch API)*
Expand All @@ -111,12 +111,10 @@ Priority 1 kernels
#. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) *(single loop, data access via indirection array, RAJA::forall API)*


Priority 2 kernels
^^^^^^^^^^^^^^^^^^^
Technical Requirement 2 kernels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*Priority 2* kernels are also important, but less so than the *Priority 1*
kernels listed above. *Priority 2* kernels are listed below and are located in
the ``RAJAPerf/src`` sub-directories noted:
*Technical Requirement 2* kernels are listed below and are located in the ``RAJAPerf/src`` sub-directories noted:

#. **apps/CONVECTION3DPA** element-wise action of a 3D finite element volume convection operator via partial assembly and sum factorization *(nested loops, GPU shared memory, RAJA::launch API)*
#. **apps/DEL_DOT_VEC_2D** divergence of a vector field at a set of points on a mesh *(single loop, data access via indirection array, RAJA::forall API)*
Expand Down Expand Up @@ -383,14 +381,14 @@ The scripts and results discussed here are located in the ``scripts/2026-FCR``
directory there.

.. important:: In the following sections, we present detailed results,
including FOM tables and throughput plots for the Priority 1
kernels described above. For completeness, we also include a
brief summary of results for Priority 2 kernels in less detail.
Data files containing results for all kernels run are included
in this repository.
including FOM tables and throughput plots for the Technical
Requirement 1 kernels described above. For completeness, we also
include a brief summary of results for Technical Requirement 2
kernels in less detail. Data files containing results for all
kernels run are included in this repository.

AMD MI300A throughput results (Priority 1 kernels)
----------------------------------------------------
AMD MI300A throughput results (Technical Requirement 1 kernels)
---------------------------------------------------------------

For the MI300A architecture, we present two sets of throughput results. One is
run in ``SPX mode`` where we use 4 MPI ranks on a node, one for each MI300A APU,
Expand All @@ -400,8 +398,8 @@ APU, and treat each APU as 6 GPUs (one GPU = 1 XCD). In each case, we run
each kernel over a sequence of problem sizes such that the saturation point is
evident on its associated throughput curve.

SPX mode (Priority 1)
^^^^^^^^^^^^^^^^^^^^^^
SPX mode (Technical Requirement 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For SPX mode (run with 1 MPI rank per APU on a node), we choose the smallest
problem to use ~100,000 bytes of allocated memory and the largest problem
Expand All @@ -416,7 +414,7 @@ memory and the largest problem to use ~600MB memory, which is over twice as
large as the MALL.

After building the code as described in :ref:`rajaperf_build_mi300a-label`, we
run the ``Priority 1`` kernels in **SPX mode** as follows::
run the ``Technical Requirement 1`` kernels in **SPX mode** as follows::

$ pwd
path/to/RAJAPerf
Expand All @@ -440,34 +438,34 @@ directory specified via the ``--output-dir`` option above. We include
the files generated by the ``process_data.py`` script in this repo in the
directory ``./docs/13_rajaperf/baseline_data/RPBenchmark_MI300A_tier1-SPX``.

.. csv-table:: FOM results for Priority 1 kernels run on MI300A in SPX mode
.. csv-table:: FOM results for Technical Requirement 1 kernels on MI300A in SPX mode
:file: ./baseline_data/RPBenchmark_MI300A_tier1-SPX/FOM/combined_fom.csv
:align: center
:widths: auto
:header-rows: 1

SPX mode (Priority 2)
SPX mode (Technical Requirement 2)
^^^^^^^^^^^^^^^^^^^^^^

The process for generating results for the Priority 2 kernels is essentially
the same as for the Priority 1 kernels just described. Note that two of the
kernels ``INDEXLIST_3LOOP`` and ``HALO_PACKING_FUSED`` do not perform any
The process for generating results for the Technical Requirement 2 kernels is
the same as for the Technical Requirement 1 kernels just described. Note that two
of the kernels ``INDEXLIST_3LOOP`` and ``HALO_PACKING_FUSED`` do not perform any
floating point operations. They represent recurring computational patterns
in our application that are important rather than key numerical kernels.
Thus, the two kernels have zero GFLOP/sec rates. So, we consider the bandwidth
as the appropriate metric to consider.

.. csv-table:: FOM results for Priority 2 kernels run on MI300A in SPX mode
.. csv-table:: FOM results for Technical Requirement 2 kernels on MI300A in SPX mode
:file: ./baseline_data/RPBenchmark_MI300A_tier2-SPX/FOM/combined_fom.csv
:align: center
:widths: auto
:header-rows: 1

The baseline data files for Priority 2 kernels run on the MI300A architecture in
The baseline data files for Technical Requirement 2 kernels on the MI300A architecture in
SPX mode are in this repo in the directory ``./docs/13_rajaperf/baseline_data/RPBenchmark_MI300A_tier1-SPX``.

CPX mode (Priority 1)
^^^^^^^^^^^^^^^^^^^^^^
CPX mode (Technical Requirement 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For CPX mode (run with 6 MPI ranks per APU on a node), we choose the
smallest problem to use ~50,000 bytes of allocated memory and the largest
Expand All @@ -480,8 +478,8 @@ For them, we chose the smallest problem to use ~1.6MB of allocated
memory and the largest problem to use ~200MB memory, which is a little less
than the MALL size.

Similar to the SPX mode description above, we run the ``Priority 1`` kernels in
**CPX mode** as follows::
Similar to the SPX mode description above, we run the ``Technical Requirement 1``
kernels in **CPX mode** as follows::

$ pwd
path/to/RAJAPerf
Expand All @@ -506,34 +504,34 @@ directory specified by via the ``--output-dir`` option above. We include
the files generated by the ``process_data.py`` script in this repo in the
directory ``./docs/13_rajaperf/baseline_data/RPBenchmark_MI300A_tier1-CPX``.

.. csv-table:: FOM results for Priority 1 kernels run on MI300A in CPX mode
.. csv-table:: FOM results for Technical Requirement 1 kernels on MI300A in CPX mode
:file: ./baseline_data/RPBenchmark_MI300A_tier1-CPX/FOM/combined_fom.csv
:align: center
:widths: auto
:header-rows: 1

CPX mode (Priority 2)
^^^^^^^^^^^^^^^^^^^^^^
CPX mode (Technical Requirement 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The process for generating results for the Priority 2 kernels is essentially
the same as for the Priority 1 kernels just described. Note that two of the
The process for generating results for the Technical Requirement 2 kernels is essentially
the same as for the Technical Requirement 1 kernels just described. Note that two of the
kernels ``INDEXLIST_3LOOP`` and ``HALO_PACKING_FUSED`` do not perform any
floating point operations. They represent recurring computational patterns
in our application that are important rather than key numerical kernels.
Thus, the two kernels have zero GFLOP/sec rates. So, we consider the bandwidth
as the appropriate metric to consider.

.. csv-table:: FOM results for Priority 2 kernels run on MI300A in CPX mode
.. csv-table:: FOM results for Technical Requirement 2 kernels run on MI300A in CPX mode
:file: ./baseline_data/RPBenchmark_MI300A_tier2-CPX/FOM/combined_fom.csv
:align: center
:widths: auto
:header-rows: 1

The baseline data files for Priority 2 kernels run on this MI300A architecture in
The baseline data files for Technical Requirement 2 kernels run on this MI300A architecture in
CPX mode are in this repo in the directory ``./docs/13_rajaperf/baseline_data/RPBenchmark_MI300A_tier1-CPX``.

AMD MI300A throughput plots (Priority 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AMD MI300A throughput plots (Technical Requirement 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following table contains throughput plots for each kernel run as described
above on the MI300A architecture in SPX mode and CPX mode. Each plot has multiple
Expand All @@ -560,7 +558,7 @@ RAJA execution policies specifically, can have a significant impact on
performance.

+-----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
| Priority 1 Kernels: MI300A Node Throughput (SPX Mode) | Priority 1 Kernels: MI300A Node Throughput (CPX Mode) |
| Technical Requirement 1 Kernels: MI300A Node Throughput (SPX Mode) | Technical Requirement 1 Kernels: MI300A Node Throughput (CPX Mode) |
+-----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
| | |
| .. figure:: baseline_data/RPBenchmark_MI300A_tier1-SPX/figures/Apps_DIFFUSION3DPA_flops.png | .. figure:: baseline_data/RPBenchmark_MI300A_tier1-CPX/figures/Apps_DIFFUSION3DPA_flops.png |
Expand Down Expand Up @@ -614,11 +612,11 @@ performance.
+-----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+


NVIDIA H100 throughput results (Priority 1 kernels)
----------------------------------------------------
NVIDIA H100 throughput results (Technical Requirement 1 kernels)
----------------------------------------------------------------

For the H100 architecture, we present throughput results, where we run with
4 MPI ranks on a node -- one for each H100 GPU. We run each ``Priority 1``
4 MPI ranks on a node -- one for each H100 GPU. We run each ``Technical Requirement 1``
kernel over a sequence of problem sizes such that the saturation point is
evident on its associated throughput curve.

Expand All @@ -634,7 +632,7 @@ and the largest problem to use ~300MB memory, which is about 6 times the
L2 cache size.

After building the code as described in :ref:`rajaperf_build_h100-label`, we
run the ``Priority 1`` kernels as follows::
run the ``Technical Requirement 1`` kernels as follows::

$ pwd
path/to/RAJAPerf
Expand All @@ -658,34 +656,34 @@ directory specified by via the ``--output-dir`` option above. We include
the files generated by the ``process_data.py`` script in this repo in the
directory ``./docs/13_rajaperf/baseline_data/RPBenchmark_H100_tier1``.

.. csv-table:: FOM results for Priority 1 kernels run on H100
.. csv-table:: FOM results for Technical Requirement 1 kernels run on H100
:file: ./baseline_data/RPBenchmark_H100_tier1/FOM/combined_fom.csv
:align: center
:widths: auto
:header-rows: 1

H100 (Priority 2)
^^^^^^^^^^^^^^^^^^^^^^
H100 (Technical Requirement 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The process for generating results for the Priority 2 kernels is essentially
the same as for the Priority 1 kernels just described. Note that two of the
The process for generating results for the Technical Requirement 2 kernels is essentially
the same as for the Technical Requirement 1 kernels just described. Note that two of the
kernels ``INDEXLIST_3LOOP`` and ``HALO_PACKING_FUSED`` do not perform any
floating point operations. They represent recurring computational patterns
in our application that are important rather than key numerical kernels.
Thus, the two kernels have zero GFLOP/sec rates. So, we consider the bandwidth
as the appropriate metric to consider.

.. csv-table:: FOM results for Priority 2 kernels run on H100
.. csv-table:: FOM results for Technical Requirement 2 kernels run on H100
:file: ./baseline_data/RPBenchmark_H100_tier2/FOM/combined_fom.csv
:align: center
:widths: auto
:header-rows: 1

The baseline data files for Priority 2 kernels run on the H100 architecture
The baseline data files for Technical Requirement 2 kernels run on the H100 architecture
are in this repo in the directory ``./docs/13_rajaperf/baseline_data/RPBenchmark_H100_tier2-SPX``.

NVIDIA H100 throughput plots (Priority 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NVIDIA H100 throughput plots (Technical Requirement 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following table contains throughput plots for each kernel run as described
above for the H100 architecture. Each plot has multiple curves where GFLOP/sec
Expand All @@ -710,7 +708,7 @@ These additional curves were included to show how kernel execution choices,
RAJA execution policies specifically, can have a noticeable impact on performance.

+-----------------------------------------------------------------------------------------------------+
| Priority 1 Kernels H100 Node Throughput |
| Technical Requirement 1 Kernels H100 Node Throughput |
+-----------------------------------------------------------------------------------------------------+
| |
| .. figure:: baseline_data/RPBenchmark_H100_tier1/figures/Apps_DIFFUSION3DPA_flops.png |
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = "FCR Benchmarks"
project = "ATS-6 Benchmarks"
copyright = "Advanced Simulation and Computing"
author = "Tri-labs"

Expand Down
10 changes: 4 additions & 6 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
70-89 :: Microbenchmarks
90-99 :: Appendices

FCR Benchmarks Project. ATTENTION: This page is a work in progress and nothing is considered to be final
========================================================================================================
ATS-6 Benchmarks. ATTENTION: This page is a work in progress and nothing is considered to be final
==========================================================================================================

.. toctree::
:maxdepth: 3
Expand All @@ -22,7 +22,7 @@ FCR Benchmarks Project. ATTENTION: This page is a work in progress and nothing i
.. toctree::
:maxdepth: 3
:numbered:
:caption: Priority 1 Mini-Applications
:caption: Technical Requirements 1

11_kripke/kripke
12_laghos/laghos
Expand All @@ -34,13 +34,12 @@ FCR Benchmarks Project. ATTENTION: This page is a work in progress and nothing i
.. toctree::
:maxdepth: 3
:numbered:
:caption: Priority 2 Mini-Applications
:caption: Technical Requirements 2

10_amg/amg
32_lammpsACE/lammpsACE
40_remhos/remhos
50_miniem/miniem
60_mlperf/mlperf

.. toctree::
:maxdepth: 3
Expand All @@ -50,7 +49,6 @@ FCR Benchmarks Project. ATTENTION: This page is a work in progress and nothing i
70_phloem/phloem
71_omb/omb
72_smb/smb
73_gpcnet/gpcnet
80_ior/ior
81_mdtest/mdtest
82_dlio/dlio
Expand Down
Loading