AMD Radeon (GNU libgomp)

On the hardware side, there is the hierarchy (fine to coarse):

All OpenMP and OpenACC levels are used, i.e.

OpenMP’s simd and OpenACC’s vector map to work items (thread)
OpenMP’s threads (“parallel”) and OpenACC’s workers map to wavefronts
OpenMP’s teams and OpenACC’s gang use a threadpool with the size of the number of teams or gangs, respectively.

The used sizes are

Number of teams is the specified num_teams (OpenMP) or num_gangs (OpenACC) or otherwise the number of CU. It is limited by two times the number of CU.
Number of wavefronts is 4 for gfx900 and 16 otherwise; num_threads (OpenMP) and num_workers (OpenACC) overrides this if smaller.
The wavefront has 102 scalars and 64 vectors
Number of workitems is always 64
The hardware permits maximally 40 workgroups/CU and 16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.
80 scalars registers and 24 vector registers in non-kernel functions (the chosen procedure-calling API).
For the kernel itself: as many as register pressure demands (number of teams and number of threads, scaled down if registers are exhausted)

The implementation remark:

I/O within OpenMP target regions and OpenACC parallel/kernels is supported using the C library printf functions and the Fortran print/write statements.
Reverse offload (i.e. target regions with device(ancestor:1)) are processed serially per target region such that the next reverse offload region is only executed after the previous one returned.
OpenMP code that has a requires directive with unified_address or unified_shared_memory will remove any GCN device from the list of available devices (“host fallback”).
The available stack size can be changed using the GCN_STACK_SIZE environment variable; the default is 32 kiB per thread.