12.1 AMD Radeon (GCN) ¶
On the hardware side, there is the hierarchy (fine to coarse):
- work item (thread)
- wavefront
- work group
- compute unit (CU)
All OpenMP and OpenACC levels are used, i.e.
- OpenMP’s simd and OpenACC’s vector map to work items (thread)
- OpenMP’s threads (“parallel”) and OpenACC’s workers map
to wavefronts
- OpenMP’s teams and OpenACC’s gang use a threadpool with the
size of the number of teams or gangs, respectively.
The used sizes are
- Number of teams is the specified
num_teams (OpenMP) or
num_gangs (OpenACC) or otherwise the number of CU. It is limited
by two times the number of CU.
- Number of wavefronts is 4 for gfx900 and 16 otherwise;
num_threads (OpenMP) and num_workers (OpenACC)
overrides this if smaller.
- The wavefront has 102 scalars and 64 vectors
- Number of workitems is always 64
- The hardware permits maximally 40 workgroups/CU and
16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.
- 80 scalars registers and 24 vector registers in non-kernel functions
(the chosen procedure-calling API).
- For the kernel itself: as many as register pressure demands (number of
teams and number of threads, scaled down if registers are exhausted)
The implementation remark:
- I/O within OpenMP target regions and OpenACC parallel/kernels is supported
using the C library
printf functions and the Fortran
print/write statements.
- Reverse offload (i.e.
target regions with
device(ancestor:1)) are processed serially per target region
such that the next reverse offload region is only executed after the previous
one returned.
- OpenMP code that has a requires directive with
unified_address or
unified_shared_memory will remove any GCN device from the list of
available devices (“host fallback”).
- The available stack size can be changed using the
GCN_STACK_SIZE
environment variable; the default is 32 kiB per thread.