The release notes for the ROCm platform.
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
andhipconfig.bin
. Thehipcc
/hipconfig
soft link will be assimilated to point fromhipcc
/hipconfig
to the respective compiled binaries as the default option.
ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility.
The following is the new filesystem hierarchy:4
/opt/rocm-<ver>
| --bin
| --All externally exposed Binaries
| --libexec
| --<component>
| -- Component specific private non-ISA executables (architecture independent)
| --include
| -- <component>
| --<header files>
| --lib
| --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
(public libraries linked with application)
| --<component> (component specific private library, executable data)
| --<cmake>
| --components
| --<component>.config.cmake
| --share
| --html/<component>/*.html
| --info/<component>/*.[pdf, md, txt]
| --man
| --doc
| --<component>
| --<licenses>
| --<component>
| --<misc files> (arch independent non-executable)
| --samples
Note
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
Note
ROCm will continue supporting backward compatibility until the next major release.
Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include
) with a warning message to include files from the new location (/opt/rocm-xxx/include
) as shown in the example below:
// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"
The wrapper header files’ backward compatibility deprecation is as follows:
#pragma
message announcing deprecation -- ROCm v5.2 release#pragma
message changed to#warning
-- Future release#warning
changed to#error
-- Future release- Backward compatibility wrappers removed -- Future release
Library files are available in the /opt/rocm-xxx/lib
folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib
) has a soft link to the library at the new location.
Example:
$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component>
folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake
) consist of a soft link to the new CMake config.
Example:
$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
In ROCm v5.4.3, improvements to the compiler address errors with the following signatures:
- "error: unhandled SGPR spill to memory"
- "cannot scavenge register without an emergency spill slot!"
- "error: ran out of registers during register allocation"
Some users may encounter a “Cannot find Symbol” error at runtime when using -save-temps. While most -save-temps use cases work correctly, this error may appear occasionally.
This issue is under investigation, and the known workaround is not to use -save-temps when the error appears.
Library | Version |
---|---|
hipBLAS | 0.53.0 |
hipCUB | 2.13.0 |
hipFFT | 1.0.10 |
hipSOLVER | 1.6.0 |
hipSPARSE | 2.3.3 |
rccl | 2.13.4 |
rocALUTION | 2.1.3 |
rocBLAS | 2.46.0 |
rocFFT | 1.0.20 ⇒ 1.0.21 |
rocPRIM | 2.12.0 |
rocRAND | 2.10.16 |
rocSOLVER | 3.20.0 |
rocSPARSE | 2.4.0 |
rocThrust | 2.17.0 |
rocWMMA | 0.9 |
Tensile | 4.35.0 |
rocFFT 1.0.21 for ROCm 5.4.3
- Removed source directory from rocm_install_targets call to prevent installation of rocfft.h in an unintended location.
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
andhipconfig.bin
. Thehipcc
/hipconfig
soft link will be assimilated to point fromhipcc
/hipconfig
to the respective compiled binaries as the default option.
The following hipcc options are being deprecated and will be removed in a future release:
- The
--amdgpu-target
option is being deprecated, and user must use the–offload-arch
option to specify the GPU architecture. - The
--amdhsa-code-object-version
option is being deprecated. Users can use the Clang/LLVM option-mllvm -mcode-object-version
to debug issues related to code object versions. - The
--hipcc-func-supp
/--hipcc-no-func-supp
options are being deprecated, as the function calls are already supported in production on AMD GPUs.
Under certain circumstances typified by high register pressure, users may encounter a compiler abort with one of the following error messages:
-
error: unhandled SGPR spill to memory
-
cannot scavenge register without an emergency spill slot!
-
error: ran out of registers during register allocation
This is a known issue and will be fixed in a future release.
Library | Version |
---|---|
hipBLAS | 0.53.0 |
hipCUB | 2.13.0 |
hipFFT | 1.0.10 |
hipSOLVER | 1.6.0 |
hipSPARSE | 2.3.3 |
rccl | 2.13.4 |
rocALUTION | 2.1.3 |
rocBLAS | 2.46.0 |
rocFFT | 1.0.20 |
rocPRIM | 2.12.0 |
rocRAND | 2.10.16 |
rocSOLVER | 3.20.0 |
rocSPARSE | 2.4.0 |
rocThrust | 2.17.0 |
rocWMMA | 0.9 |
Tensile | 4.35.0 |
The ROCm v5.4.1 release consists of the following new HIP API:
The following new HIP API is introduced in the ROCm v5.4.1 release.
Note
This is a pre-official version (beta) release of the new APIs.
hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData);
This swaps the stream capture mode of a thread.
@param [in] mode - Pointer to mode value to swap with the current mode
This parameter returns #hipSuccess
, #hipErrorInvalidValue
.
For more information, refer to the HIP API documentation at /bundle/HIP_API_Guide/page/modules.html.
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
andhipconfig.bin
. Thehipcc
/hipconfig
soft link will be assimilated to point fromhipcc
/hipconfig
to the respective compiled binaries as the default option.
These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release. AMD Instinct™ MI200 Firmware IFWI Maintenance Update #3
This IFWI release fixes the following issue in AMD Instinct™ MI210/MI250 Accelerators.
After prolonged periods of operation, certain MI200 Instinct™ Accelerators may perform in a degraded way resulting in application failures.
In this package, AMD delivers a new firmware version for MI200 GPU accelerators and a firmware installation tool – AMD FW FLASH 1.2.
GPU | Production Part Number | SKU | IFWI Name |
---|---|---|---|
MI210 | 113-D673XX | D67302 | D6730200V.110 |
MI210 | 113-D673XX | D67301 | D6730100V.073 |
MI250 | 113-D652XX | D65209 | D6520900.073 |
MI250 | 113-D652XX | D65210 | D6521000.073 |
Instructions on how to download and apply MI200 maintenance updates are available at:
Maintenance update #3, combined with ROCm 5.4.1, now provides SRIOV virtualization support for all AMD Instinct™ MI200 devices.
Library | Version |
---|---|
hipBLAS | 0.53.0 |
hipCUB | 2.13.0 |
hipFFT | 1.0.10 |
hipSOLVER | 1.6.0 |
hipSPARSE | 2.3.3 |
rccl | 2.13.4 |
rocALUTION | 2.1.3 |
rocBLAS | 2.46.0 |
rocFFT | 1.0.19 ⇒ 1.0.20 |
rocPRIM | 2.12.0 |
rocRAND | 2.10.16 |
rocSOLVER | 3.20.0 |
rocSPARSE | 2.4.0 |
rocThrust | 2.17.0 |
rocWMMA | 0.9 |
Tensile | 4.35.0 |
rocFFT 1.0.20 for ROCm 5.4.1
- Fixed incorrect results on strided large 1D FFTs where batch size does not equal the stride.
The ROCm v5.4 release consists of the following HIP enhancements:
A new timer function wall_clock64() is supported, which returns wall clock count at a constant frequency on the device.
long long int wall_clock64();
It returns wall clock count at a constant frequency on the device, which can be queried via HIP API with the hipDeviceAttributeWallClockRate attribute of the device in the HIP application code.
Example:
int wallClkRate = 0; //in kilohertz
+HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
Where hipDeviceAttributeWallClockRate is a device attribute.
Note
The wall clock frequency is a per-device attribute.
The GPU_MAX_HW_QUEUES registry defines the maximum number of independent hardware queues allocated per process per device.
The environment variable controls how many independent hardware queues HIP runtime can create per process, per device. If the application allocates more HIP streams than this number, then the HIP runtime reuses the same hardware queues for the new streams in a round-robin manner.
Note
This maximum number does not apply to hardware queues created for CU-masked HIP streams or cooperative queues for HIP Cooperative Groups (there is only one queue per device).
For more details, refer to the HIP Programming Guide.
The following new HIP APIs are available in the ROCm v5.4 release.
Note
This is a pre-official version (beta) release of the new APIs.
hipError_t hipDrvGetErrorName(hipError_t hipError, const char** errorString);
This returns HIP errors in the text string format.
hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString);
This returns text string messages with more details about the error.
For more information, refer to the HIP API Guide.
With ROCm v5.4, a separate GitHub project is created at
https://github.com/ROCm-Developer-Tools/hip-tests
This contains HIP catch2 tests and samples, and new tests will continue to develop.
In future ROCm releases, catch2 tests and samples will be removed from the HIP project.
This release consists of the following OpenMP enhancements:
- Enable new device RTL in libomptarget as default.
- New flag
-fopenmp-target-fast
to imply-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism
. - Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated.
- Initial implementation of optimized cross-team sum reduction for float and double type scalars.
- Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
andhipconfig.bin
. Thehipcc
/hipconfig
soft link will be assimilated to point fromhipcc
/hipconfig
to the respective compiled binaries as the default option.
ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility.
The following is the new filesystem hierarchy:
/opt/rocm-<ver>
| --bin
| --All externally exposed Binaries
| --libexec
| --<component>
| -- Component specific private non-ISA executables (architecture independent)
| --include
| -- <component>
| --<header files>
| --lib
| --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
(public libraries linked with application)
| --<component> (component specific private library, executable data)
| --<cmake>
| --components
| --<component>.config.cmake
| --share
| --html/<component>/*.html
| --info/<component>/*.[pdf, md, txt]
| --man
| --doc
| --<component>
| --<licenses>
| --<component>
| --<misc files> (arch independent non-executable)
| --samples
Note
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
Note
ROCm will continue supporting backward compatibility until the next major release.
Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include
) with a warning message to include files from the new location (/opt/rocm-xxx/include
) as shown in the example below:
// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"
The wrapper header files’ backward compatibility deprecation is as follows:
#pragma
message announcing deprecation -- ROCm v5.2 release#pragma
message changed to#warning
-- Future release#warning
changed to#error
-- Future release- Backward compatibility wrappers removed -- Future release
Library files are available in the /opt/rocm-xxx/lib
folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib
) has a soft link to the library at the new location.
Example:
$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component>
folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake
) consist of a soft link to the new CMake config.
Example:
$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
The following defects are fixed in this release.
These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release.
The test was incorrectly using the hipDeviceAttributePageableMemoryAccess
device attribute to determine coherent support.
hipHostMalloc()
allocates memory with fine-grained access by default when the environment variable HIP_HOST_COHERENT=1
is used.
For more information, refer to the HIP Programming Guide at
On GFX10 GPUs, kernel execution hangs when it is launched on streams created using hipStreamWithCUMask
.
On GFX10 GPUs, each workgroup processor encompasses two compute units, and the compute units must be enabled as a pair. The hipStreamWithCUMask
API unit test cases are updated to set compute unit mask (cuMask) in pairs for GFX10 GPUs.
The HIP language device IDs are not the same as the GPU IDs reported by the tools. GPU IDs are globally unique and guaranteed to be consistent across APIs and processes.
GPU IDs reported by ROCTracer and ROCProfiler or ROCm Tools are HSA Driver Node ID of that GPU, as it is a unique ID for that device in that particular node.
Library | Version |
---|---|
hipBLAS | 0.52.0 ⇒ 0.53.0 |
hipCUB | 2.12.0 ⇒ 2.13.0 |
hipFFT | 1.0.9 ⇒ 1.0.10 |
hipSOLVER | 1.5.0 ⇒ 1.6.0 |
hipSPARSE | 2.3.1 ⇒ 2.3.3 |
rccl | 2.12.10 ⇒ 2.13.4 |
rocALUTION | 2.1.0 ⇒ 2.1.3 |
rocBLAS | 2.45.0 ⇒ 2.46.0 |
rocFFT | 1.0.18 ⇒ 1.0.19 |
rocPRIM | 2.11.0 ⇒ 2.12.0 |
rocRAND | 2.10.15 ⇒ 2.10.16 |
rocSOLVER | 3.19.0 ⇒ 3.20.0 |
rocSPARSE | 2.2.0 ⇒ 2.4.0 |
rocThrust | 2.16.0 ⇒ 2.17.0 |
rocWMMA | 0.8 ⇒ 0.9 |
Tensile | 4.34.0 ⇒ 4.35.0 |
hipBLAS 0.53.0 for ROCm 5.4.0
- Allow for selection of int8 datatype
- Added support for hipblasXgels and hipblasXgelsStridedBatched operations (with s,d,c,z precisions), only supported with rocBLAS backend
- Added support for hipblasXgelsBatched operations (with s,d,c,z precisions)
hipCUB 2.13.0 for ROCm 5.4.0
- CMake functionality to improve build parallelism of the test suite that splits compilation units by function or by parameters.
- New overload for
BlockAdjacentDifference::SubtractLeftPartialTile
that takes a predecessor item.
- Improved build parallelism of the test suite by splitting up large compilation units for
DeviceRadixSort
,DeviceSegmentedRadixSort
andDeviceSegmentedSort
. - CUB backend references CUB and thrust version 1.17.1.
hipFFT 1.0.10 for ROCm 5.4.0
- Added hipfftExtPlanScaleFactor API to efficiently multiply each output element of a FFT by a given scaling factor. Result scaling must be supported in the backend FFT library.
- When hipFFT is built against the rocFFT backend, rocFFT 1.0.19 or higher is now required.
hipSOLVER 1.6.0 for ROCm 5.4.0
- Added compatibility-only functions
- gesvdaStridedBatched
- hipsolverDnSgesvdaStridedBatched_bufferSize, hipsolverDnDgesvdaStridedBatched_bufferSize, hipsolverDnCgesvdaStridedBatched_bufferSize, hipsolverDnZgesvdaStridedBatched_bufferSize
- hipsolverDnSgesvdaStridedBatched, hipsolverDnDgesvdaStridedBatched, hipsolverDnCgesvdaStridedBatched, hipsolverDnZgesvdaStridedBatched
- gesvdaStridedBatched
hipSPARSE 2.3.3 for ROCm 5.4.0
- Added hipsparseCsr2cscEx2_bufferSize and hipsparseCsr2cscEx2 routines
- HIPSPARSE_ORDER_COLUMN has been renamed to HIPSPARSE_ORDER_COL to match cusparse
RCCL 2.13.4 for ROCm 5.4.0
- Compatibility with NCCL 2.13.4
- Improvements to RCCL when running with hipGraphs
- RCCL_ENABLE_HIPGRAPH environment variable is no longer necessary to enable hipGraph support
- Minor latency improvements
- Resolved potential memory access error due to asynchronous memset
rocALUTION 2.1.3 for ROCm 5.4.0
- Added build support for Navi31 and Navi33
- Added support for non-squared global matrices
- Fixed a memory leak in MatrixMult on HIP backend
- Global structures can now be used with a single process
- Switched GTest death test style to 'threadsafe'
- GlobalVector::GetGhostSize() is deprecated and will be removed
- ParallelManager::GetGlobalSize(), ParallelManager::GetLocalSize(), ParallelManager::SetGlobalSize() and ParallelManager::SetLocalSize() are deprecated and will be removed
- Vector::GetGhostSize() is deprecated and will be removed
- Multigrid::SetOperatorFormat(unsigned int) is deprecated and will be removed, use Multigrid::SetOperatorFormat(unsigned int, int) instead
- RugeStuebenAMG::SetCouplingStrength(ValueType) is deprecated and will be removed, use SetStrengthThreshold(float) instead
rocBLAS 2.46.0 for ROCm 5.4.0
- client smoke test dataset added for quick validation using command rocblas-test --yaml rocblas_smoke.yaml
- Added stream order device memory allocation as a non-default beta option.
- Improved trsm performance for small sizes by using a substitution method technique
- Improved syr2k and her2k performance significantly by using a block-recursive algorithm
- Level 2, Level 1, and Extension functions: argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour.
- Add variable to turn on/off ieee16/ieee32 tests for mixed precision gemm
- Allow hipBLAS to select int8 datatype
- Disallow B == C && ldb != ldc in rocblas_xtrmm_outofplace
- FORTRAN interfaces generalized for FORTRAN compilers other than gfortran
- fix for trsm_strided_batched rocblas-bench performance gathering
- Fix for rocm-smi path in commandrunner.py script to match ROCm 5.2 and above
rocFFT 1.0.19 for ROCm 5.4.0
- Optimized some strided large 1D plans.
- Added rocfft_plan_description_set_scale_factor API to efficiently multiply each output element of a FFT by a given scaling factor.
- Created a rocfft_kernel_cache.db file next to the installed library. SBCC kernels are moved to this file when built with the library, and are runtime-compiled for new GPU architectures.
- Added gfx1100 and gfx1102 to default AMDGPU_TARGETS.
- Moved runtime compilation cache to in-memory by default. A default on-disk cache can encounter contention problems on multi-node clusters with a shared filesystem. rocFFT can still be told to use an on-disk cache by setting the ROCFFT_RTC_CACHE_PATH environment variable.
rocPRIM 2.12.0 for ROCm 5.4.0
device_partition
,device_unique
, anddevice_reduce_by_key
now support problem sizes larger than 2^32 items.
block_sort::sort()
overload for keys and values with a dynamic size. This overload was documented but the implementation is missing. To avoid further confusion the documentation is removed until a decision is made on implementing the function.
- Fixed the compilation failure in
device_merge
if the two key iterators don't match.
rocRAND 2.10.16 for ROCm 5.4.0
- MRG31K3P pseudorandom number generator based on L'Ecuyer and Touzin, 2000, "Fast combined multiple recursive generators with multipliers of the form a = ±2q ±2r".
- LFSR113 pseudorandom number generator based on L'Ecuyer, 1999, "Tables of maximally equidistributed combined LFSR generators".
- SCRAMBLED_SOBOL32 and SCRAMBLED_SOBOL64 quasirandom number generators. The Scrambled Sobol sequences are generated by scrambling the output of a Sobol sequence.
- The
mrg_<distribution>_distribution
structures, which provided numbers based on MRG32K3A, are now replaced bymrg_engine_<distribution>_distribution
, where<distribution>
islog_normal
,normal
,poisson
, oruniform
. These structures provide numbers for MRG31K3P (with template typerocrand_state_mrg31k3p
) and MRG32K3A (with template typerocrand_state_mrg32k3a
).
- Sobol64 now returns 64 bits random numbers, instead of 32 bits random numbers. As a result, the performance of this generator has regressed.
- Fixed a bug that prevented compiling code in C++ mode (with a host compiler) when it included the rocRAND headers on Windows.
rocSOLVER 3.20.0 for ROCm 5.4.0
- Partial SVD for bidiagonal matrices:
- BDSVDX
- Partial SVD for general matrices:
- GESVDX (with batched and strided_batched versions)
- Changed
ROCSOLVER_EMBED_FMT
default toON
for users building directly with CMake. This matches the existing default when building with install.sh or rmake.py.
rocSPARSE 2.4.0 for ROCm 5.4.0
- Added rocsparse_spmv_ex routine
- Added rocsparse_bsrmv_ex_analysis and rocsparse_bsrmv_ex routines
- Added csritilu0 routine
- Added build support for Navi31 and Navi 33
- Optimization to segmented algorithm for COO SpMV by performing analysis
- Improve performance when generating random matrices.
- Fixed bug in ellmv
- Optimized bsr2csr routine
- Fixed integer overflow bugs
rocThrust 2.17.0 for ROCm 5.4.0
- Updated to match upstream Thrust 1.17.0
rocWMMA 0.9 for ROCm 5.4.0
- Added gemm driver APIs for flow control builtins
- Added benchmark logging systems
- Restructured tests to follow naming convention. Added macros for test generation
- Changed CMake to accomodate the modified test infrastructure
- Fine tuned the multi-block kernels with and without lds
- Adjusted Maximum Vector Width to dWordx4 Width
- Updated Efficiencies to display as whole number percentages
- Updated throughput from GFlops/s to TFlops/s
- Reset the ad-hoc tests to use smaller sizes
- Modified the output validation to use CPU-based implementation against rocWMMA
- Modified the extended vector test to return error codes for memory allocation failures
Tensile 4.35.0 for ROCm 5.4.0
- Async DMA support for Transpose Data Layout (ThreadSeparateGlobalReadA/B)
- Option to output library logic in dictionary format
- No solution found error message for benchmarking client
- Exact K check for StoreCInUnrollExact
- Support for CGEMM + MIArchVgpr
- client-path parameter for using prebuilt client
- CleanUpBuildFiles global parameter
- Debug flag for printing library logic index of winning solution
- NumWarmups global parameter for benchmarking
- Windows support for benchmarking client
- DirectToVgpr support for CGEMM
- TensileLibLogicToYaml for creating tuning configs from library logic solutions
- Put beta code and store separately if StoreCInUnroll = x4 store
- Improved performance for StoreCInUnroll + b128 store
- Re-enable HardwareMonitor for gfx90a
- Decision trees use MLFeatures instead of Properties
- Reject DirectToVgpr + MatrixInstBM/BN > 1
- Fix benchmark timings when using warmups and/or validation
- Fix mismatch issue with DirectToVgprB + VectorWidth > 1
- Fix mismatch issue with DirectToLds + NumLoadsCoalesced > 1 + TailLoop
- Fix incorrect reject condition for DirectToVgpr
- Fix reject condition for DirectToVgpr + MIWaveTile < VectorWidth
- Fix incorrect instruction generation with StoreCInUnroll
There was a known issue with rocTHRUST and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases.
thrust::merge
no longer correctly supports different iterator types forkeys_input1
andkeys_input2
.rocprim::device_merge
no longer correctly supports using different types forkeys_input1
andkeys_input2
.
This issue is resolved with the following fixes to compilation failures:
- rocPRIM: in device_merge if the two key iterators do not match.
- rocTHRUST: in thrust::merge if the two key iterators do not match.
Library | Version |
---|---|
hipBLAS | 0.52.0 |
hipCUB | 2.12.0 |
hipFFT | 1.0.9 |
hipSOLVER | 1.5.0 |
hipSPARSE | 2.3.1 |
rccl | 2.12.10 |
rocALUTION | 2.1.0 |
rocBLAS | 2.45.0 |
rocFFT | 1.0.18 |
rocPRIM | 2.11.0 |
rocRAND | 2.10.15 |
rocSOLVER | 3.19.0 |
rocSPARSE | 2.2.0 |
rocThrust | 2.16.0 |
rocWMMA | 0.8 |
Tensile | 4.34.0 |
The following known issues in ROCm v5.3.2 are fixed in this release.
Peer-to-Peer Direct Memory Access (DMA) mapping errors on Dell systems (R7525 and R750XA) with SLES 15 SP3/SP4 and RHEL 9.0 are fixed in this release.
Previously, running rocminfo resulted in Peer-to-Peer DMA mapping errors.
The RCCL tuning table is updated for supported platforms.
Functional correctness failures in SGEMM (F32 GEMM) routines in rocBLAS for certain problem sizes and ranges are fixed in this release.
This section consists of known issues in this release.
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads but does not impact Discrete Device Assignment (DDA) or bare metal.
Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads.
Customers cannot update the Integrated Firmware Image (IFWI) for AMD Instinct™ MI200 accelerators.
An updated firmware maintenance bundle consisting of an installation tool and images specific to AMD Instinct™ MI200 accelerators is under planning and will be available soon.
There is a known known issue with rocThrust and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases.
-
thrust::merge no longer correctly supports different iterator types for
keys_input1
andkeys_input2
. -
rocprim::device_merge no longer correctly supports using different types for
keys_input1
andkeys_input2
.
This issue is currently under investigation and will be resolved in a future release.
Library | Version |
---|---|
hipBLAS | 0.52.0 |
hipCUB | 2.12.0 |
hipFFT | 1.0.9 |
hipSOLVER | 1.5.0 |
hipSPARSE | 2.3.1 |
rccl | 2.12.10 |
rocALUTION | 2.1.0 |
rocBLAS | 2.45.0 |
rocFFT | 1.0.18 |
rocPRIM | 2.11.0 |
rocRAND | 2.10.15 |
rocSOLVER | 3.19.0 |
rocSPARSE | 2.2.0 |
rocThrust | 2.16.0 |
rocWMMA | 0.8 |
Tensile | 4.34.0 |
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
andhipconfig.bin
. Thehipcc
/hipconfig
soft link will be assimilated to point fromhipcc
/hipconfig
to the respective compiled binaries as the default option.
ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility.
The following is the new filesystem hierarchy:
/opt/rocm-<ver>
| --bin
| --All externally exposed Binaries
| --libexec
| --<component>
| -- Component specific private non-ISA executables (architecture independent)
| --include
| -- <component>
| --<header files>
| --lib
| --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
(public libraries linked with application)
| --<component> (component specific private library, executable data)
| --<cmake>
| --components
| --<component>.config.cmake
| --share
| --html/<component>/*.html
| --info/<component>/*.[pdf, md, txt]
| --man
| --doc
| --<component>
| --<licenses>
| --<component>
| --<misc files> (arch independent non-executable)
| --samples
Note
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
Note
ROCm will continue supporting backward compatibility until the next major release.
Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include
) with a warning message to include files from the new location (/opt/rocm-xxx/include
) as shown in the example below:
// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"
The wrapper header files’ backward compatibility deprecation is as follows:
#pragma
message announcing deprecation -- ROCm v5.2 release#pragma
message changed to#warning
-- Future release#warning
changed to#error
-- Future release- Backward compatibility wrappers removed -- Future release
Library files are available in the /opt/rocm-xxx/lib
folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib
) has a soft link to the library at the new location.
Example:
$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component>
folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake
) consist of a soft link to the new CMake config.
Example:
$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
The following defects are fixed in this release.
These defects were identified and documented as known issues in previous ROCm releases and are fixed in the ROCm v5.3 release.
User code did not initialize certain data constructs, leading to a correctness issue. A strict reading of the C++ standard suggests that failing to initialize these data constructs is undefined behavior. However, a special case was added for a specific compiler builtin to handle the uninitialized data in a defined manner.
The compiler fix consists of the following patches:
- A new
noundef
attribute is added. This attribute denotes when a function call argument or return val may never contain uninitialized bits. For more information, see https://reviews.llvm.org/D81678 - The application of this attribute was refined such that it was not added to a specific compiler builtin where the compiler knows that inactive lanes do not impact program execution.
For more information, see https://github.com/RadeonOpenCompute/llvm-project/commit/accf36c58409268ca1f216cdf5ad812ba97ceccd.
This section consists of known issues in this release.
The openmp-extras
package has been split into runtime (openmp-extras-runtime
) and dev (openmp-extras-devel
) packages. This change has broken the upgrade support for the openmp-extras
package in RHEL/SLES.
An available workaround in RHEL is to use the following command for upgrades:
sudo yum upgrade rocm-language-runtime --allowerasing
An available workaround in SLES is to use the following command for upgrades:
zypper update --force-resolution <meta-package>
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads, but does not impact Discrete Device Assignment (DDA) or Bare Metal.
Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads.
If IOMMU is enabled in SBIOS and ROCm is installed, the system may report the following failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl and cause a system crash.
- IO PAGE FAULT
- IRQ remapping does not support X2APIC mode
- NMI error
Workaround: To avoid the system crash, add amd_iommu=on iommu=pt
as the kernel bootparam, as indicated in the warning message.
Library | Version |
---|---|
hipBLAS | 0.51.0 ⇒ 0.52.0 |
hipCUB | 2.11.1 ⇒ 2.12.0 |
hipFFT | 1.0.8 ⇒ 1.0.9 |
hipSOLVER | 1.4.0 ⇒ 1.5.0 |
hipSPARSE | 2.2.0 ⇒ 2.3.1 |
rccl | 2.12.10 |
rocALUTION | 2.0.3 ⇒ 2.1.0 |
rocBLAS | 2.44.0 ⇒ 2.45.0 |
rocFFT | 1.0.17 ⇒ 1.0.18 |
rocPRIM | 2.10.14 ⇒ 2.11.0 |
rocRAND | 2.10.14 ⇒ 2.10.15 |
rocSOLVER | 3.18.0 ⇒ 3.19.0 |
rocSPARSE | 2.2.0 |
rocThrust | 2.15.0 ⇒ 2.16.0 |
rocWMMA | 0.7 ⇒ 0.8 |
Tensile | 4.33.0 ⇒ 4.34.0 |
hipBLAS 0.52.0 for ROCm 5.3.0
- Added --cudapath option to install.sh to allow user to specify which cuda build they would like to use.
- Added --installcuda option to install.sh to install cuda via a package manager. Can be used with new --installcudaversion option to specify which version of cuda to install.
- Fixed #includes to support a compiler version.
- Fixed client dependency support in install.sh
hipCUB 2.12.0 for ROCm 5.3.0
- UniqueByKey device algorithm
- SubtractLeft, SubtractLeftPartialTile, SubtractRight, SubtractRightPartialTile overloads in BlockAdjacentDifference.
- The old overloads (FlagHeads, FlagTails, FlagHeadsAndTails) are deprecated.
- DeviceAdjacentDifference algorithm.
- Extended benchmark suite of
DeviceHistogram
,DeviceScan
,DevicePartition
,DeviceReduce
,DeviceSegmentedReduce
,DeviceSegmentedRadixSort
,DeviceRadixSort
,DeviceSpmv
,DeviceMergeSort
,DeviceSegmentedSort
- Obsolated type traits defined in util_type.hpp. Use the standard library equivalents instead.
- CUB backend references CUB and thrust version 1.16.0.
- DeviceRadixSort's num_items parameter's type is now templated instead of being an int.
- If an integral type with a size at most 4 bytes is passed (i.e. an int), the former logic applies.
- Otherwise the algorithm uses a larger indexing type that makes it possible to sort input data over 2**32 elements.
- Improved build parallelism of the test suite by splitting up large compilation units
hipFFT 1.0.9 for ROCm 5.3.0
- Clean up build warnings.
- GNUInstall Dir enhancements.
- Requires gtest 1.11.
hipSOLVER 1.5.0 for ROCm 5.3.0
- Added functions
- syevj
- hipsolverSsyevj_bufferSize, hipsolverDsyevj_bufferSize, hipsolverCheevj_bufferSize, hipsolverZheevj_bufferSize
- hipsolverSsyevj, hipsolverDsyevj, hipsolverCheevj, hipsolverZheevj
- syevjBatched
- hipsolverSsyevjBatched_bufferSize, hipsolverDsyevjBatched_bufferSize, hipsolverCheevjBatched_bufferSize, hipsolverZheevjBatched_bufferSize
- hipsolverSsyevjBatched, hipsolverDsyevjBatched, hipsolverCheevjBatched, hipsolverZheevjBatched
- sygvj
- hipsolverSsygvj_bufferSize, hipsolverDsygvj_bufferSize, hipsolverChegvj_bufferSize, hipsolverZhegvj_bufferSize
- hipsolverSsygvj, hipsolverDsygvj, hipsolverChegvj, hipsolverZhegvj
- syevj
- Added compatibility-only functions
- syevdx/heevdx
- hipsolverDnSsyevdx_bufferSize, hipsolverDnDsyevdx_bufferSize, hipsolverDnCheevdx_bufferSize, hipsolverDnZheevdx_bufferSize
- hipsolverDnSsyevdx, hipsolverDnDsyevdx, hipsolverDnCheevdx, hipsolverDnZheevdx
- sygvdx/hegvdx
- hipsolverDnSsygvdx_bufferSize, hipsolverDnDsygvdx_bufferSize, hipsolverDnChegvdx_bufferSize, hipsolverDnZhegvdx_bufferSize
- hipsolverDnSsygvdx, hipsolverDnDsygvdx, hipsolverDnChegvdx, hipsolverDnZhegvdx
- syevdx/heevdx
- Added --mem_query option to hipsolver-bench, which will print the amount of device memory workspace required by the function.
- The rocSOLVER backend will now set
info
to zero if rocSOLVER does not referenceinfo
. (Applies to orgbr/ungbr, orgqr/ungqr, orgtr/ungtr, ormqr/unmqr, ormtr/unmtr, gebrd, geqrf, getrs, potrs, and sytrd/hetrd). - gesvdj will no longer require extra workspace to transpose
V
whenjobz
isHIPSOLVER_EIG_MODE_VECTOR
andecon
is 1.
- Fixed Fortran return value declarations within hipsolver_module.f90
- Fixed gesvdj_bufferSize returning
HIPSOLVER_STATUS_INVALID_VALUE
whenjobz
isHIPSOLVER_EIG_MODE_NOVECTOR
and 1 <=ldv
<n
- Fixed gesvdj returning
HIPSOLVER_STATUS_INVALID_VALUE
whenjobz
isHIPSOLVER_EIG_MODE_VECTOR
,econ
is 1, andm
<n
hipSPARSE 2.3.1 for ROCm 5.3.0
- Add SpMM and SpMM batched for CSC format
rocALUTION 2.1.0 for ROCm 5.3.0
- Benchmarking tool
- Ext+I Interpolation with sparsify strategies added for RS-AMG
- ParallelManager
rocBLAS 2.45.0 for ROCm 5.3.0
- install.sh option --upgrade_tensile_venv_pip to upgrade Pip in Tensile Virtual Environment. The corresponding CMake option is TENSILE_VENV_UPGRADE_PIP.
- install.sh option --relocatable or -r adds rpath and removes ldconf entry on rocBLAS build.
- install.sh option --lazy-library-loading to enable on-demand loading of tensile library files at runtime to speedup rocBLAS initialization.
- Support for RHEL9 and CS9.
- Added Numerical checking routine for symmetric, Hermitian, and triangular matrices, so that they could be checked for any numerical abnormalities such as NaN, Zero, infinity and denormal value.
- trmm_outofplace performance improvements for all sizes and data types using block-recursive algorithm.
- herkx performance improvements for all sizes and data types using block-recursive algorithm.
- syrk/herk performance improvements by utilising optimised syrkx/herkx code.
- symm/hemm performance improvements for all sizes and datatypes using block-recursive algorithm.
- Unifying library logic file names: affects HBH (->HHS_BH), BBH (->BBS_BH), 4xi8BH (->4xi8II_BH). All HPA types are using the new naming convention now.
- Level 3 function argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour.
- Level 1, 2, and 3 function argument checking for enums is now more rigorously matching legacy BLAS so returns rocblas_status_invalid_value if arguments do not match the accepted subset.
- Add quick-return for internal trmm and gemm template functions.
- Moved function block sizes to a shared header file.
- Level 1, 2, and 3 functions use rocblas_stride datatype for offset.
- Modified the matrix and vector memory allocation in our test infrastructure for all Level 1, 2, 3 and BLAS_EX functions.
- Added specific initialization for symmetric, Hermitian, and triangular matrix types in our test infrastructure.
- Added NaN tests to the test infrastructure for the rest of Level 3, BLAS_EX functions.
- Improved logic to #include <filesystem> vs <experimental/filesystem>.
- install.sh -s option to build rocblas as a static library.
- dot function now sets the device results asynchronously for N <= 0
- is_complex helper is now deprecated. Use rocblas_is_complex instead.
- The enum truncate_t and the value truncate is now deprecated and will removed from the ROCm release 6.0. It is replaced by rocblas_truncate_t and rocblas_truncate, respectively. The new enum rocblas_truncate_t and the value rocblas_truncate could be used from this ROCm release for an easy transition.
- install.sh options --hip-clang , --no-hip-clang, --merge-files, --no-merge-files are removed.
rocFFT 1.0.18 for ROCm 5.3.0
- Runtime compilation cache now looks for environment variables XDG_CACHE_HOME (on Linux) and LOCALAPPDATA (on Windows) before falling back to HOME.
- Optimized 2D R2C/C2R to use 2-kernel plans where possible.
- Improved performance of the Bluestein algorithm.
- Optimized sbcc-168 and 100 by using half-lds.
- Fixed occasional failures to parallelize runtime compilation of kernels. Failures would be retried serially and ultimately succeed, but this would take extra time.
- Fixed failures of some R2C 3D transforms that use the unsupported TILE_UNALGNED SBRC kernels. An example is 98^3 R2C out-of-place.
- Fixed bugs in SBRC_ERC type.
rocPRIM 2.11.0 for ROCm 5.3.0
- New functions
subtract_left
andsubtract_right
inblock_adjacent_difference
to apply functions on pairs of adjacent items distributed between threads in a block. - New device level
adjacent_difference
primitives. - Added experimental tooling for automatic kernel configuration tuning for various architectures
- Benchmarks collect and output more detailed system information
- CMake functionality to improve build parallelism of the test suite that splits compilation units by function or by parameters.
- Reverse iterator.
rocRAND 2.10.15 for ROCm 5.3.0
- Increased number of warmup iterations for rocrand_benchmark_generate from 5 to 15 to eliminate corner cases that would generate artificially high benchmark scores.
rocSOLVER 3.19.0 for ROCm 5.3.0
- Partial eigensolver routines for symmetric/hermitian matrices:
- SYEVX (with batched and strided_batched versions)
- HEEVX (with batched and strided_batched versions)
- Generalized symmetric- and hermitian-definite partial eigensolvers:
- SYGVX (with batched and strided_batched versions)
- HEGVX (with batched and strided_batched versions)
- Eigensolver routines for symmetric/hermitian matrices using Jacobi algorithm:
- SYEVJ (with batched and strided_batched versions)
- HEEVJ (with batched and strided_batched versions)
- Generalized symmetric- and hermitian-definite eigensolvers using Jacobi algorithm:
- SYGVJ (with batched and strided_batched versions)
- HEGVJ (with batched and strided_batched versions)
- Added --profile_kernels option to rocsolver-bench, which will include kernel calls in the profile log (if profile logging is enabled with --profile).
- Changed rocsolver-bench result labels
cpu_time
andgpu_time
tocpu_time_us
andgpu_time_us
, respectively.
- Removed dependency on cblas from the rocsolver test and benchmark clients.
- Fixed incorrect SYGS2/HEGS2, SYGST/HEGST, SYGV/HEGV, and SYGVD/HEGVD results for batch counts larger than 32.
- Fixed STEIN memory access fault when nev is 0.
- Fixed incorrect STEBZ results for close eigenvalues when range = index.
- Fixed git unsafe repository error when building with
./install.sh -cd
as a non-root user.
rocThrust 2.16.0 for ROCm 5.3.0
- rocThrust functionality dependent on device malloc works is functional as ROCm 5.2 reneabled device malloc. Device launched
thrust::sort
andthrust::sort_by_key
are available for use.
rocWMMA 0.8 for ROCm 5.3.0
Tensile 4.34.0 for ROCm 5.3.0
- Lazy loading of solution libraries and code object files
- Support for dictionary style logic files
- Support for decision tree based logic files using dictionary format
- DecisionTreeLibrary for solution selection
- DirectToLDS support for HGEMM
- DirectToVgpr support for SGEMM
- Grid based distance metric for solution selection
- Support for gfx11xx
- Support for DirectToVgprA/B + TLU=False
- ForkParameters Groups as a way of specifying solution parameters
- Support for a new Tensile yaml config format
- TensileClientConfig for generating Tensile client config files
- Options for TensileCreateLibrary to build client and create client config file
- Solution generation is now cached and is not repeated if solution parameters are unchanged
- Default MACInstruction to FMA
- Accept StaggerUStride=0 as valid
- Reject invalid data types for UnrollLoopEfficiencyEnable
- Fix invalid code generation issues related to DirectToVgpr
- Return hipErrorNotFound if no modules are loaded
- Fix performance drop for NN ZGEMM with 96x64 macro tile
- Fix memory violation for general batched kernels when alpha/beta/K = 0
Support for Ubuntu 18.04 ends in this release. Future releases of ROCm will not provide prebuilt packages for Ubuntu 18.04. HIP and Other Runtimes
-
A bug was discovered in the HIP graph capture implementation in the ROCm v5.2.0 release. If the same kernel is called twice (with different argument values) in a graph capture, the implementation only kept the argument values for the second kernel call.
-
A bug was introduced in the hiprtc implementation in the ROCm v5.2.0 release. This bug caused the
hiprtcGetLoweredName
call to fail for named expressions with whitespace in it.
Example:
The named expression my_sqrt<complex<double>>
passed but my_sqrt<complex<double >>
failed.
ROCm Libraries
Compatibility with NCCL 2.12.10
-
Packages for test and benchmark executables on all supported OSes using CPack
-
Added custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1
- Additional details provided if Binary File Descriptor library (BFD) is pre-installed.
-
Added experimental support for using multiple ranks per device
-
Requires using a new interface to create communicator (ncclCommInitRankMulti), refer to the interface documentation for details.
-
To avoid potential deadlocks, user might have to set an environment variables increasing the number of hardware queues. For example,
-
export GPU_MAX_HW_QUEUES=16
-
Added support for reusing ports in NET/IB channels
-
Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1
-
When "Call to bind failed: Address already in use" error happens in large-scale AlltoAll(for example, >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue
-
Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1
-
- Removed experimental clique-based kernels
No notable changes in this release for development tools, including the compiler, profiler, and debugger Deployment and Management Tools
No notable changes in this release for deployment and management tools. Older ROCm Releases
For release information for older ROCm releases, refer to https://github.com/RadeonOpenCompute/ROCm/blob/master/CHANGELOG.md
Library | Version |
---|---|
hipBLAS | 0.51.0 |
hipCUB | 2.11.1 |
hipFFT | 1.0.8 |
hipSOLVER | 1.4.0 |
hipSPARSE | 2.2.0 |
rccl | 2.11.4 ⇒ 2.12.10 |
rocALUTION | 2.0.3 |
rocBLAS | 2.44.0 |
rocFFT | 1.0.17 |
rocPRIM | 2.10.14 |
rocRAND | 2.10.14 |
rocSOLVER | 3.18.0 |
rocSPARSE | 2.2.0 |
rocThrust | 2.15.0 |
rocWMMA | 0.7 |
Tensile | 4.33.0 |
RCCL 2.12.10 for ROCm 5.2.3
- Compatibility with NCCL 2.12.10
- Packages for test and benchmark executables on all supported OSes using CPack.
- Adding custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1
- Additional details provided if Binary File Descriptor library (BFD) is pre-installed
- Adding support for reusing ports in NET/IB channels
- Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1
- When "Call to bind failed : Address already in use" error happens in large-scale AlltoAll (e.g., >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue
- Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1
- Removed experimental clique-based kernels
Library | Version |
---|---|
hipBLAS | 0.51.0 |
hipCUB | 2.11.1 |
hipFFT | 1.0.8 |
hipSOLVER | 1.4.0 |
hipSPARSE | 2.2.0 |
rccl | 2.11.4 |
rocALUTION | 2.0.3 |
rocBLAS | 2.44.0 |
rocFFT | 1.0.17 |
rocPRIM | 2.10.14 |
rocRAND | 2.10.14 |
rocSOLVER | 3.18.0 |
rocSPARSE | 2.2.0 |
rocThrust | 2.15.0 |
rocWMMA | 0.7 |
Tensile | 4.33.0 |
The ROCm v5.2 release consists of the following HIP enhancements:
The HIP Installation Guide is updated to include building HIP tests from source on the AMD and NVIDIA platforms.
For more details, refer to the HIP Installation Guide v5.2.
HIP-Clang now supports device-side malloc. This implementation does not require the use of hipDeviceSetLimit(hipLimitMallocHeapSize,value)
nor respect any setting. The heap is fully dynamic and can grow until the available free memory on the device is consumed.
The test codes at the following link show how to implement applications using malloc and free functions in device kernels:
https://github.com/ROCm-Developer-Tools/HIP/blob/develop/tests/src/deviceLib/hipDeviceMalloc.cpp
The following new HIP APIs are available in the ROCm v5.2 release. Note that this is a pre-official version (beta) release of the new APIs:
The new device management HIP APIs are as follows:
-
Gets a UUID for the device. This API returns a UUID for the device.
hipError_t hipDeviceGetUuid(hipUUID* uuid, hipDevice_t device);
Note
This new API corresponds to the following CUDA API:
CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev);
-
Gets default memory pool of the specified device
hipError_t hipDeviceGetDefaultMemPool(hipMemPool_t* mem_pool, int device);
-
Sets the current memory pool of a device
hipError_t hipDeviceSetMemPool(int device, hipMemPool_t mem_pool);
-
Gets the current memory pool for the specified device
hipError_t hipDeviceGetMemPool(hipMemPool_t* mem_pool, int device);
The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory management are as follows:
-
Allocates memory with stream ordered semantics
hipError_t hipMallocAsync(void** dev_ptr, size_t size, hipStream_t stream);
-
Frees memory with stream ordered semantics
hipError_t hipFreeAsync(void* dev_ptr, hipStream_t stream);
-
Releases freed memory back to the OS
hipError_t hipMemPoolTrimTo(hipMemPool_t mem_pool, size_t min_bytes_to_hold);
-
Sets attributes of a memory pool
hipError_t hipMemPoolSetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value);
-
Gets attributes of a memory pool
hipError_t hipMemPoolGetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value);
-
Controls visibility of the specified pool between devices
hipError_t hipMemPoolSetAccess(hipMemPool_t mem_pool, const hipMemAccessDesc* desc_list, size_t count);
-
Returns the accessibility of a pool from a device
hipError_t hipMemPoolGetAccess(hipMemAccessFlags* flags, hipMemPool_t mem_pool, hipMemLocation* location);
-
Creates a memory pool
hipError_t hipMemPoolCreate(hipMemPool_t* mem_pool, const hipMemPoolProps* pool_props);
-
Destroys the specified memory pool
hipError_t hipMemPoolDestroy(hipMemPool_t mem_pool);
-
Allocates memory from a specified pool with stream ordered semantics
hipError_t hipMallocFromPoolAsync(void** dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream);
-
Exports a memory pool to the requested handle type
hipError_t hipMemPoolExportToShareableHandle( void* shared_handle, hipMemPool_t mem_pool, hipMemAllocationHandleType handle_type, unsigned int flags);
-
Imports a memory pool from a shared handle
hipError_t hipMemPoolImportFromShareableHandle( hipMemPool_t* mem_pool, void* shared_handle, hipMemAllocationHandleType handle_type, unsigned int flags);
-
Exports data to share a memory pool allocation between processes
hipError_t hipMemPoolExportPointer(hipMemPoolPtrExportData* export_data, void* dev_ptr); Import a memory pool allocation from another process.t hipError_t hipMemPoolImportPointer( void** dev_ptr, hipMemPool_t mem_pool, hipMemPoolPtrExportData* export_data);
The new HIP Graph Management APIs are as follows:
-
Enqueues a host function call in a stream
hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData);
-
Swaps the stream capture mode of a thread
hipError_t hipThreadExchangeStreamCaptureMode(hipStreamCaptureMode* mode);
-
Sets a node attribute
hipError_t hipGraphKernelNodeSetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, const hipKernelNodeAttrValue* value);
-
Gets a node attribute
hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, hipKernelNodeAttrValue* value);
The new APIs for virtual memory management are as follows:
-
Frees an address range reservation made via hipMemAddressReserve
hipError_t hipMemAddressFree(void* devPtr, size_t size);
-
Reserves an address range
hipError_t hipMemAddressReserve(void** ptr, size_t size, size_t alignment, void* addr, unsigned long long flags);
-
Creates a memory allocation described by the properties and size
hipError_t hipMemCreate(hipMemGenericAllocationHandle_t* handle, size_t size, const hipMemAllocationProp* prop, unsigned long long flags);
-
Exports an allocation to a requested shareable handle type
hipError_t hipMemExportToShareableHandle(void* shareableHandle, hipMemGenericAllocationHandle_t handle, hipMemAllocationHandleType handleType, unsigned long long flags);
-
Gets the access flags set for the given location and ptr
hipError_t hipMemGetAccess(unsigned long long* flags, const hipMemLocation* location, void* ptr);
-
Calculates either the minimal or recommended granularity
hipError_t hipMemGetAllocationGranularity(size_t* granularity, const hipMemAllocationProp* prop, hipMemAllocationGranularity_flags option);
-
Retrieves the property structure of the given handle
hipError_t hipMemGetAllocationPropertiesFromHandle(hipMemAllocationProp* prop, hipMemGenericAllocationHandle_t handle);
-
Imports an allocation from a requested shareable handle type
hipError_t hipMemImportFromShareableHandle(hipMemGenericAllocationHandle_t* handle, void* osHandle, hipMemAllocationHandleType shHandleType);
-
Maps an allocation handle to a reserved virtual address range
hipError_t hipMemMap(void* ptr, size_t size, size_t offset, hipMemGenericAllocationHandle_t handle, unsigned long long flags);
-
Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays
hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int count, hipStream_t stream);
-
Release a memory handle representing a memory allocation, that was previously allocated through hipMemCreate
hipError_t hipMemRelease(hipMemGenericAllocationHandle_t handle);
-
Returns the allocation handle of the backing memory allocation given the address
hipError_t hipMemRetainAllocationHandle(hipMemGenericAllocationHandle_t* handle, void* addr);
-
Sets the access flags for each location specified in desc for the given virtual address range
hipError_t hipMemSetAccess(void* ptr, size_t size, const hipMemAccessDesc* desc, size_t count);
-
Unmaps memory allocation of a given address range
hipError_t hipMemUnmap(void* ptr, size_t size);
For more information, refer to the HIP API documentation at https://docs.amd.com/bundle/HIP_API_Guide/page/modules.html
Changes to hipDeviceProp_t
, HIPMEMCPY_3D
, and hipArray
structures (and related HIP APIs) are planned in the next major release. These changes may impact backward compatibility.
Refer to the Release Notes document in subsequent releases for more information. ROCm Math and Communication Libraries
In this release, ROCm Math and Communication Libraries consist of the following enhancements and fixes: New rocWMMA for Matrix Multiplication and Accumulation Operations Acceleration
This release introduces a new ROCm C++ library for accelerating mixed precision matrix multiplication and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.
rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed.
For more information, refer to https://docs.amd.com/category/libraries.
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These are APIs that allow first-party tools to examine the profile and traces for kernels that execute on a device. A tool may register callbacks for data transfer and kernel dispatch entry points. A tool may use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.
Following is an example demonstrating how a tool would use the OMPT target APIs supported. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to follow, and you can run the provided example as indicated below:
cd /opt/rocm/llvm/examples/tools/ompt/veccopy-ompt-target-tracing
make run
The file veccopy-ompt-target-tracing.c
simulates how a tool would initiate device activity tracing. The file callbacks.h
shows the callbacks that may be registered and implemented by the tool.
ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility.
The following is the new filesystem hierarchy:
/opt/rocm-<ver>
| --bin
| --All externally exposed Binaries
| --libexec
| --<component>
| -- Component specific private non-ISA executables (architecture independent)
| --include
| -- <component>
| --<header files>
| --lib
| --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
(public libraries linked with application)
| --<component> (component specific private library, executable data)
| --<cmake>
| --components
| --<component>.config.cmake
| --share
| --html/<component>/*.html
| --info/<component>/*.[pdf, md, txt]
| --man
| --doc
| --<component>
| --<licenses>
| --<component>
| --<misc files> (arch independent non-executable)
| --samples
Note
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
Note
ROCm will continue supporting backward compatibility until the next major release.
Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include
) with a warning message to include files from the new location (/opt/rocm-xxx/include
) as shown in the example below:
// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"
The wrapper header files’ backward compatibility deprecation is as follows:
#pragma
message announcing deprecation -- ROCm v5.2 release#pragma
message changed to#warning
-- Future release#warning
changed to#error
-- Future release- Backward compatibility wrappers removed -- Future release
Library files are available in the /opt/rocm-xxx/lib
folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib
) has a soft link to the library at the new location.
Example:
$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component>
folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake
) consist of a soft link to the new CMake config.
Example:
$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
In the ROCm v5.2 release, hip-rocclr and hip-base packages (Debian and RPM) are planned for deprecation and will be removed in a future release. hip-runtime-amd and hip-dev(el) will replace these packages respectively. Users of hip-rocclr must install two packages, hip-runtime-amd and hip-dev, to get the same set of packages installed by hip-rocclr previously.
Currently, both package names hip-rocclr (or) hip-runtime-amd and hip-base (or) hip-dev(el) are supported. Deprecation of Integrated HIP Directed Tests
The integrated HIP directed tests, which are currently built by default, are deprecated in this release. The default building and execution support through CMake will be removed in future release.
Fixed Defect | Fix |
---|---|
ROCmInfo does not list gpus | Code fix |
Hang observed while restoring cooperative group samples | Code fix |
ROCM-SMI over SRIOV: Unsupported commands do not return proper error message | Code fix |
This section consists of known issues in this release.
A compiler error occurs when using -O0 flag to compile code for gfx1030 that calls atomicAddNoRet, which is defined in amd_hip_atomic.h. The compiler generates an illegal instruction for gfx1030.
The workaround is not to use the -O0 flag for this case. For higher optimization levels, the compiler does not generate an invalid instruction.
Checkpoint/Restore in Userspace (CRIU) requires 20 MB of VRAM approximately to checkpoint and restore. The CRIU process may freeze if the maximum amount of available VRAM is allocated to checkpoint applications.
To use CRIU to checkpoint and restore your application, limit the amount of VRAM the application uses to ensure at least 20 MB is available.
The compiler may incorrectly compile a program that uses the __shfl_sync(mask, value, srcLane)
function when the "value" parameter to the function is undefined along some path to the function. For most functions, uninitialized inputs cause undefined behavior, but the definition for __shfl_sync
should allow for undefined values.
The workaround is to initialize the parameters to __shfl_sync
.
Note
When the
-Wall
compilation flag is used, the compiler generates a warning indicating the variable is initialized along some path.
Example:
double res = 0.0; // Initialize the input to __shfl_sync.
if (lane == 0) {
res = <some expression>
}
res = __shfl_sync(mask, res, 0);
In recent changes to Clang, insertion of the noundef attribute to all the function arguments has been enabled by default.
In the HIP kernel, variable var in shfl_sync may not be initialized, so LLVM IR treats it as undef.
So, the function argument that is potentially undef (because it is not intialized) has always been assumed to be noundef by LLVM IR (since Clang has inserted noundef attribute). This leads to ambiguous kernel execution.
-
Skip adding
noundef
attribute to functions tagged with convergent attribute. Refer to https://reviews.llvm.org/D124158 for more information. -
Introduce shuffle attribute and add it to
__shfl
like APIs at hip headers. Clang can skip adding noundef attribute, if it finds that argument is tagged with shuffle attribute. Refer to https://reviews.llvm.org/D125378 for more information. -
Introduce clang builtin for
__shfl
to identify it and skip addingnoundef
attribute. -
Introduce
__builtin_freeze
to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs.
There is a known issue with applications that trigger oversubscription. A hardware hang occurs when ROCgdb is used on AMD Instinct™ MI50 and MI100 systems.
This issue is under investigation and will be fixed in a future release.
Library | Version |
---|---|
hipBLAS | 0.50.0 ⇒ 0.51.0 |
hipCUB | 2.11.0 ⇒ 2.11.1 |
hipFFT | 1.0.7 ⇒ 1.0.8 |
hipSOLVER | 1.3.0 ⇒ 1.4.0 |
hipSPARSE | 2.1.0 ⇒ 2.2.0 |
rccl | 2.11.4 |
rocALUTION | 2.0.2 ⇒ 2.0.3 |
rocBLAS | 2.43.0 ⇒ 2.44.0 |
rocFFT | 1.0.16 ⇒ 1.0.17 |
rocPRIM | 2.10.13 ⇒ 2.10.14 |
rocRAND | 2.10.13 ⇒ 2.10.14 |
rocSOLVER | 3.17.0 ⇒ 3.18.0 |
rocSPARSE | 2.1.0 ⇒ 2.2.0 |
rocThrust | 2.14.0 ⇒ 2.15.0 |
rocWMMA | ⇒ 0.7 |
Tensile | 4.32.0 ⇒ 4.33.0 |
hipBLAS 0.51.0 for ROCm 5.2.0
- Packages for test and benchmark executables on all supported OSes using CPack.
- Added File/Folder Reorg Changes with backward compatibility support enabled using ROCM-CMAKE wrapper functions
- Added user-specified initialization option to hipblas-bench
- Fixed version gathering in performance measuring script
hipCUB 2.11.1 for ROCm 5.2.0
- Packages for tests and benchmark executable on all supported OSes using CPack.
hipFFT 1.0.8 for ROCm 5.2.0
- Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.
- Packages for test and benchmark executables on all supported OSes using CPack.
hipSOLVER 1.4.0 for ROCm 5.2.0
- Package generation for test and benchmark executables on all supported OSes using CPack.
- File/Folder Reorg
- Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.
- Fixed the ReadTheDocs documentation generation.
hipSPARSE 2.2.0 for ROCm 5.2.0
- Packages for test and benchmark executables on all supported OSes using CPack.
rocALUTION 2.0.3 for ROCm 5.2.0
- Packages for test and benchmark executables on all supported OSes using CPack.
rocBLAS 2.44.0 for ROCm 5.2.0
- Packages for test and benchmark executables on all supported OSes using CPack.
- Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output vectors of rocBLAS level 1 and 2 functions.
- Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output general matrices of rocBLAS level 2 and 3 functions.
- Added NaN initialization tests to the yaml files of Level 2 rocBLAS batched and strided-batched functions for testing purposes.
- Added memory allocation check to avoid disk swapping during rocblas-test runs by skipping tests.
- Improved performance of non-batched and batched her2 for all sizes and data types.
- Improved performance of non-batched and batched amin for all data types using shuffle reductions.
- Improved performance of non-batched and batched amax for all data types using shuffle reductions.
- Improved performance of trsv for all sizes and data types.
- Modifying gemm_ex for HBH (High-precision F16). The alpha/beta data type remains as F32 without narrowing to F16 and expanding back to F32 in the kernel. This change prevents rounding errors due to alpha/beta conversion in situations where alpha/beta are not exactly represented as an F16.
- Modified non-batched and batched asum, nrm2 functions to use shuffle instruction based reductions.
- For gemm, gemm_ex, gemm_ex2 internal API use rocblas_stride datatype for offset.
- For symm, hemm, syrk, herk, dgmm, geam internal API use rocblas_stride datatype for offset.
- AMD copyright year for all rocBLAS files.
- For gemv (transpose-case), typecasted the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions.
- For function her2 avoid overflow in offset calculation.
- For trsm when alpha == 0 and on host, allow A to be nullptr.
- Fixed memory access issue in trsv.
- Fixed git pre-commit script to update only AMD copyright year.
- Fixed dgmm, geam test functions to set correct stride values.
- For functions ssyr2k and dsyr2k allow trans == rocblas_operation_conjugate_transpose.
- Fixed compilation error for clients-only build.
- Remove Navi12 (gfx1011) from fat binary.
rocFFT 1.0.17 for ROCm 5.2.0
- Packages for test and benchmark executables on all supported OSes using CPack.
- Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.
- Improved reuse of twiddle memory between plans.
- Set a default load/store callback when only one callback type is set via the API for improved performance.
- Introduced a new access pattern of lds (non-linear) and applied it on sbcc kernels len 64 to get performance improvement.
- Fixed plan creation failure in cases where SBCC kernels would need to write to non-unit-stride buffers.
rocPRIM 2.10.14 for ROCm 5.2.0
- Packages for tests and benchmark executable on all supported OSes using CPack.
- Added File/Folder Reorg Changes and Enabled Backward compatibility support using wrapper headers.
rocRAND 2.10.14 for ROCm 5.2.0
- Backward compatibility for deprecated
#include <rocrand.h>
using wrapper header files. - Packages for test and benchmark executables on all supported OSes using CPack.
rocSOLVER 3.18.0 for ROCm 5.2.0
- Partial eigenvalue decomposition routines:
- STEBZ
- STEIN
- Package generation for test and benchmark executables on all supported OSes using CPack.
- Added tests for multi-level logging
- Added tests for rocsolver-bench client
- File/Folder Reorg
- Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.
- Fixed compatibility with libfmt 8.1
rocSPARSE 2.2.0 for ROCm 5.2.0
- batched SpMM for CSR, COO and Blocked ELL formats.
- Packages for test and benchmark executables on all supported OSes using CPack.
- Clients file importers and exporters.
- Clients code size reduction.
- Clients error handling.
- Clients benchmarking for performance tracking.
- Test adjustments due to roundoff errors.
- Fixing API calls compatiblity with rocPRIM.
- none
rocThrust 2.15.0 for ROCm 5.2.0
- Packages for tests and benchmark executable on all supported OSes using CPack.
rocWMMA 0.7 for ROCm 5.2.0
- Added unit tests for DLRM kernels
- Added GEMM sample
- Added DLRM sample
- Added SGEMV sample
- Added unit tests for cooperative wmma load and stores
- Added unit tests for IOBarrier.h
- Added wmma load/ store tests for different matrix types (A, B and Accumulator)
- Added more block sizes 1, 2, 4, 8 to test MmaSyncMultiTest
- Added block sizes 4, 8 to test MmaSynMultiLdsTest
- Added support for wmma load / store layouts with block dimension greater than 64
- Added IOShape structure to define the attributes of mapping and layouts for all wmma matrix types
- Added CI testing for rocWMMA
- Renamed wmma to rocwmma in cmake, header files and documentation
- Renamed library files
- Modified Layout.h to use different matrix offset calculations (base offset, incremental offset and cumulative offset)
- Opaque load/store continue to use incrementatl offsets as they fill the entire block
- Cooperative load/store use cumulative offsets as they fill only small portions for the entire block
- Increased Max split counts to 64 for cooperative load/store
- Moved all the wmma definitions, API headers to rocwmma namespace
- Modified wmma fill unit tests to validate all matrix types (A, B, Accumulator)
Tensile 4.33.0 for ROCm 5.2.0
- TensileUpdateLibrary for updating old library logic files
- Support for TensileRetuneLibrary to use sizes from separate file
- ZGEMM DirectToVgpr/DirectToLds/StoreCInUnroll/MIArchVgpr support
- Tests for denorm correctness
- Option to write different architectures to different TensileLibrary files
- Optimize MessagePackLoadLibraryFile by switching to fread
- DGEMM tail loop optimization for PrefetchAcrossPersistentMode=1/DirectToVgpr
- Alpha/beta datatype remains as F32 for HPA HGEMM
- Force assembly kernels to not flush denorms
- Use hipDeviceAttributePhysicalMultiProcessorCount as multiProcessorCount
- Fix segmentation fault when run i8 datatype with TENSILE_DB=0x80
Library | Version |
---|---|
hipBLAS | 0.50.0 |
hipCUB | 2.11.0 |
hipFFT | 1.0.7 |
hipSOLVER | 1.3.0 |
hipSPARSE | 2.1.0 |
rccl | 2.11.4 |
rocALUTION | 2.0.2 |
rocBLAS | 2.43.0 |
rocFFT | 1.0.16 |
rocPRIM | 2.10.13 |
rocRAND | 2.10.13 |
rocSOLVER | 3.17.0 |
rocSPARSE | 2.1.0 |
rocThrust | 2.14.0 |
Tensile | 4.32.0 |
Library | Version |
---|---|
hipBLAS | 0.50.0 |
hipCUB | 2.11.0 |
hipFFT | 1.0.7 |
hipSOLVER | 1.3.0 |
hipSPARSE | 2.1.0 |
rccl | 2.11.4 |
rocALUTION | 2.0.2 |
rocBLAS | 2.43.0 |
rocFFT | 1.0.16 |
rocPRIM | 2.10.13 |
rocRAND | 2.10.13 |
rocSOLVER | 3.17.0 |
rocSPARSE | 2.1.0 |
rocThrust | 2.14.0 |
Tensile | 4.32.0 |
The ROCm v5.1 release consists of the following HIP enhancements.
The HIP Installation Guide is updated to include installation and building HIP from source on the AMD and NVIDIA platforms.
Refer to the HIP Installation Guide v5.1 for more details.
ROCm v5.1 extends support for HIP Graph.
On ROCm/Linux, to maintain backward compatibility, the hipruntime library (amdhip64) will continue to include hiprtc symbols in future releases. The backward compatible support may be discontinued by removing hiprtc symbols from the hipruntime library (amdhip64) in the next major release.
Changes to the hipDeviceProp_t structure in the next major release may result in backward incompatibility. More details on these changes will be provided in subsequent releases.
The compiler now generates a source-level variable and function argument debug information.
The accuracy is guaranteed if the compiler options -g -O0
are used and apply only to HIP.
This enhancement enables ROCDebugger users to interact with the HIP source-level variables and function arguments.
Note
The newly-suggested compiler -g option must be used instead of the previously-suggested
-ggdb
option. Although the effect of these two options is currently equivalent, this is not guaranteed for the future and might get changed by the upstream LLVM community.
ROCDebugger Machine Interface (MI) extends support to lanes. The following enhancements are made:
-
Added a new -lane-info command, listing the current thread's lanes.
-
The -thread-select command now supports a lane switch to switch to a specific lane of a thread:
-thread-select -l LANE THREAD
-
The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which lane of the thread was selected.
-
The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates which lane is selected, and the latter indicates which lanes explain the stop.
-
MI commands now accept a global --lane option, similar to the global --thread and --frame options.
-
MI varobjs are now lane-aware.
For more information, refer to the ROC Debugger User Guide at https://docs.amd.com.
The clone-inferior command now ensures that the TTY, CMD, ARGS, and AMDGPU PRECISE-MEMORY settings are copied from the original inferior to the new one. All modifications to the environment variables done using the 'set environment' or 'unset environment' commands are also copied to the new inferior.
This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and performance improvements as listed below:
-
MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498)
-
Fixed a correctness issue with ImplicitGemm algorithm
-
Updated the performance data for new kernel versions
-
Improved MIOpen build time by splitting large kernel header files
-
Fixed an issue in reduction kernels for padded tensors
-
Various other bug fixes and performance improvements
For more information, see https://docs.amd.com/bundle/MIOpen_gh-pages/page/releasenotes.html
The new Checkpoint Restore in Userspace (CRIU) functionality is implemented to support AMD GPU and ROCm applications.
CRIU is a userspace tool to Checkpoint and Restore an application.
CRIU lacked the support for checkpoint restore applications that used device files such as a GPU. With this ROCm release, CRIU is enhanced with a new plugin to support AMD GPUs, which includes:
-
Single and Multi GPU systems (Gfx9)
-
Checkpoint / Restore on a different system
-
Checkpoint / Restore inside a docker container
-
PyTorch
-
Tensorflow
-
Using CRIU Image Streamer
For more information, refer to https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu
Note
The CRIU plugin (amdgpu_plugin) is merged upstream with the CRIU repository. The KFD kernel patches are also available upstream with the amd-staging-drm-next branch (public) and the ROCm 5.1 release branch.
Note
This is a Beta release of the Checkpoint and Restore functionality, and some features are not available in this release.
For more information, refer to the following websites:
The following defects are fixed in this release.
The issue with the driver failing to load after ROCm installation is now fixed.
The driver installs successfully, and the server reboots with working rocminfo and clinfo.
Previously, setting a breakpoint in device code by line number before the device code was loaded into the program resulted in ROCgdb incorrectly moving the breakpoint to the first following line that contains host code.
Now, the breakpoint is left pending. When the GPU kernel gets loaded, the breakpoint resolves to a location in the kernel.
Previously, the stale just-written value was presented as a current value.
ROCgdb now invalidates the cached values of registers whose content might differ after being written. For example, registers with read-only bits.
ROCgdb also invalidates all volatile registers when a volatile register is written. For example, writing VCC invalidates the content of STATUS as STATUS.VCCZ may change.
When scheduler-locking is in effect, new wavefronts created by a resumed thread, CPU, or GPU wavefront, are held in the halt state. For example, the "set scheduler-locking" command.
It was possible (although erroneous) for a debugger to load GPU code in memory, send it to the device, start executing a kernel on the device, and dispose of the original code before the kernel had finished execution. If a breakpoint was hit after this point, the debugger failed with an internal error while trying to access the debug information.
This issue is now fixed by ensuring that the debugger keeps a local copy of the original code and debug information.
Issue: Random memory access fault issues are observed while running Math libraries unit tests. This issue is encountered in ROCm v5.0, ROCm v5.0.1, and ROCm v5.0.2.
Note, the faults only occur in the SRIOV environment.
Workaround: Use SDMA to update the page table. The Guest set up steps are as follows:
sudo modprobe amdgpu vm_update_mode=0
To verify, use
Guest:
cat /sys/module/amdgpu/parameters/vm_update_mode 0
Where expectation is 0.
Using CU Masking results in an application freeze or runs exceptionally slowly. This issue is noticed only in the GFX10 suite of products. Note, this issue is observed only in GFX10 suite of products.
This issue is under active investigation at this time.
A defect with Ubuntu images kernel-5.13-30-generic and kernel-5.13-35-generic with Overlay FS results in incorrect reporting of the mount ID.
This issue with Ubuntu causes CRIU checkpointing to fail in Docker containers.
As a workaround, use an older version of the kernel. For example, Ubuntu 5.11.0-46-generic.
Workloads that use the cooperative groups function to ensure all waves can be resident at the same time may fail to restore correctly. This issue is under investigation and will be fixed in a future release.
ROCDebugger is not supported in the SRIOV environment on any GPU.
This is a known issue and will be fixed in a future release.
Random error messages are generated by unsupported functions or commands.
This is a known issue and will be fixed in a future release.
Library | Version |
---|---|
hipBLAS | 0.49.0 ⇒ 0.50.0 |
hipCUB | 2.10.13 ⇒ 2.11.0 |
hipFFT | 1.0.4 ⇒ 1.0.7 |
hipSOLVER | 1.2.0 ⇒ 1.3.0 |
hipSPARSE | 2.0.0 ⇒ 2.1.0 |
rccl | 2.10.3 ⇒ 2.11.4 |
rocALUTION | 2.0.1 ⇒ 2.0.2 |
rocBLAS | 2.42.0 ⇒ 2.43.0 |
rocFFT | 1.0.13 ⇒ 1.0.16 |
rocPRIM | 2.10.12 ⇒ 2.10.13 |
rocRAND | 2.10.12 ⇒ 2.10.13 |
rocSOLVER | 3.16.0 ⇒ 3.17.0 |
rocSPARSE | 2.0.0 ⇒ 2.1.0 |
rocThrust | 2.13.0 ⇒ 2.14.0 |
Tensile | 4.31.0 ⇒ 4.32.0 |
hipBLAS 0.50.0 for ROCm 5.1.0
- Added library version and device information to hipblas-test output
- Added --rocsolver-path command line option to choose path to pre-built rocSOLVER, as absolute or relative path
- Added --cmake_install command line option to update cmake to minimum version if required
- Added cmake-arg parameter to pass in cmake arguments while building
- Added infrastructure to support readthedocs hipBLAS documentation.
- Added hipblasVersionMinor define. hipblaseVersionMinor remains defined for backwards compatibility.
- Doxygen warnings in hipblas.h header file.
- rocblas-path command line option can be specified as either absolute or relative path
- Help message improvements in install.sh and rmake.py
- Updated googletest dependency from 1.10.0 to 1.11.0
hipCUB 2.11.0 for ROCm 5.1.0
- Device segmented sort
- Warp merge sort, WarpMask and thread sort from cub 1.15.0 supported in hipCUB
- Device three way partition
- Device_scan and device_segmented_scan: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type.
- This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output).
- And low-res input with high-res output (e.g. float input, double output)
- Block merge sort no longer supports non power of two blocksizes
hipFFT 1.0.7 for ROCm 5.1.0
- Use fft_params struct for accuracy and benchmark clients.
hipSOLVER 1.3.0 for ROCm 5.1.0
- Added functions
- gels
- hipsolverSSgels_bufferSize, hipsolverDDgels_bufferSize, hipsolverCCgels_bufferSize, hipsolverZZgels_bufferSize
- hipsolverSSgels, hipsolverDDgels, hipsolverCCgels, hipsolverZZgels
- gels
- Added library version and device information to hipsolver-test output.
- Added compatibility API with hipsolverDn prefix.
- Added compatibility-only functions
- gesvdj
- hipsolverDnSgesvdj_bufferSize, hipsolverDnDgesvdj_bufferSize, hipsolverDnCgesvdj_bufferSize, hipsolverDnZgesvdj_bufferSize
- hipsolverDnSgesvdj, hipsolverDnDgesvdj, hipsolverDnCgesvdj, hipsolverDnZgesvdj
- gesvdjBatched
- hipsolverDnSgesvdjBatched_bufferSize, hipsolverDnDgesvdjBatched_bufferSize, hipsolverDnCgesvdjBatched_bufferSize, hipsolverDnZgesvdjBatched_bufferSize
- hipsolverDnSgesvdjBatched, hipsolverDnDgesvdjBatched, hipsolverDnCgesvdjBatched, hipsolverDnZgesvdjBatched
- syevj
- hipsolverDnSsyevj_bufferSize, hipsolverDnDsyevj_bufferSize, hipsolverDnCheevj_bufferSize, hipsolverDnZheevj_bufferSize
- hipsolverDnSsyevj, hipsolverDnDsyevj, hipsolverDnCheevj, hipsolverDnZheevj
- syevjBatched
- hipsolverDnSsyevjBatched_bufferSize, hipsolverDnDsyevjBatched_bufferSize, hipsolverDnCheevjBatched_bufferSize, hipsolverDnZheevjBatched_bufferSize
- hipsolverDnSsyevjBatched, hipsolverDnDsyevjBatched, hipsolverDnCheevjBatched, hipsolverDnZheevjBatched
- sygvj
- hipsolverDnSsygvj_bufferSize, hipsolverDnDsygvj_bufferSize, hipsolverDnChegvj_bufferSize, hipsolverDnZhegvj_bufferSize
- hipsolverDnSsygvj, hipsolverDnDsygvj, hipsolverDnChegvj, hipsolverDnZhegvj
- gesvdj
- The rocSOLVER backend now allows hipsolverXXgels and hipsolverXXgesv to be called in-place when B == X.
- The rocSOLVER backend now allows rwork to be passed as a null pointer to hipsolverXgesvd.
- bufferSize functions will now return HIPSOLVER_STATUS_NOT_INITIALIZED instead of HIPSOLVER_STATUS_INVALID_VALUE when both handle and lwork are null.
- Fixed rare memory allocation failure in syevd/heevd and sygvd/hegvd caused by improper workspace array allocation outside of rocSOLVER.
hipSPARSE 2.1.0 for ROCm 5.1.0
- Added gtsv_interleaved_batch and gpsv_interleaved_batch routines
- Add SpGEMM_reuse
- Changed BUILD_CUDA with USE_CUDA in install script and cmake files
- Update googletest to 11.1
- Fixed a bug in SpMM Alg versioning
- none
RCCL 2.11.4 for ROCm 5.1.0
- Compatibility with NCCL 2.11.4
- Managed memory is not currently supported for clique-based kernels
rocALUTION 2.0.2 for ROCm 5.1.0
- Added out-of-place matrix transpose functionality
- Added LocalVector<bool>
rocBLAS 2.43.0 for ROCm 5.1.0
- Option to install script for number of jobs to use for rocBLAS and Tensile compilation (-j, --jobs)
- Option to install script to build clients without using any Fortran (--clients_no_fortran)
- rocblas_client_initialize function, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time.
- Added tests for output of reduction functions when given bad input
- Added user specified initialization (rand_int/trig_float/hpl) for initializing matrices and vectors in rocblas-bench
- Improved performance of trsm with side == left and n == 1
- Improved perforamnce of trsm with side == left and m <= 32 along with side == right and n <= 32
- For syrkx and trmm internal API use rocblas_stride datatype for offset
- For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match
- Test client dependencies updated to GTest 1.11
- non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives.
- Help menu messages in install.sh
- For ger function, typecast the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions.
- Modified default initialization from rand_int to hpl for initializing matrices and vectors in rocblas-bench
- For function trmv (non-transposed cases) avoid overflow in offset calculation
- Fixed cppcheck errors/warnings
- Fixed doxygen warnings
rocFFT 1.0.16 for ROCm 5.1.0
- Supported unaligned tile dimension for SBRC_2D kernels.
- Improved (more RAII) test and benchmark infrastructure.
- Enabled runtime compilation of length-2304 FFT kernel during plan creation.
- Optimized more large 1D cases by using L1D_CC plan.
- Optimized 3D 200^3 C2R case.
- Optimized 1D 2^30 double precision on MI200.
- Fixed correctness of some R2C transforms with unusual strides.
- The hipFFT API (header) has been removed from after a long deprecation period. Please use the hipFFT package/repository to obtain the hipFFT API.
rocPRIM 2.10.13 for ROCm 5.1.0
- Fixed radix sort int64_t bug introduced in [2.10.11]
- Future value
- Added device partition_three_way to partition input to three output iterators based on two predicates
- The reduce/scan algorithm precision issues in the tests has been resolved for half types.
- device_segmented_radix_sort unit test failing for HIP on Windows
rocRAND 2.10.13 for ROCm 5.1.0
- Generating a random sequence different sizes now produces the same sequence without gaps
indepent of how many values are generated per call.
- Only in the case of XORWOW, MRG32K3A, PHILOX4X32_10, SOBOL32 and SOBOL64
- This only holds true if the size in each call is a divisor of the distributions
output_width
due to performance - Similarly the output pointer has to be aligned to
output_width * sizeof(output_type)
- hipRAND split into a separate package
- Header file installation location changed to match other libraries.
- Using the
rocrand.h
header file should now use#include <rocrand/rocrand.h>
, rather than#include <rocrand/rocrand.h>
- Using the
- rocRAND still includes hipRAND using a submodule
- The rocRAND package also sets the provides field with hipRAND, so projects which require hipRAND can begin to specify it.
- Fix offset behaviour for XORWOW, MRG32K3A and PHILOX4X32_10 generator, setting offset now
correctly generates the same sequence starting from the offset.
- Only uniform int and float will work as these can be generated with a single call to the generator
- kernel_xorwow unit test is failing for certain GPU architectures.
rocSOLVER 3.17.0 for ROCm 5.1.0
- Optimized non-pivoting and batch cases of the LU factorization
- Fixed missing synchronization in SYTRF with
rocblas_fill_lower
that could potentially result in incorrect pivot values. - Fixed multi-level logging output to file with the
ROCSOLVER_LOG_PATH
,ROCSOLVER_LOG_TRACE_PATH
,ROCSOLVER_LOG_BENCH_PATH
andROCSOLVER_LOG_PROFILE_PATH
environment variables. - Fixed performance regression in the batched LU factorization of tiny matrices
rocSPARSE 2.1.0 for ROCm 5.1.0
- gtsv_interleaved_batch
- gpsv_interleaved_batch
- SpGEMM_reuse
- Allow copying of mat info struct
- Optimization for SDDMM
- Allow unsorted matrices in csrgemm multipass algorithm
- none
rocThrust 2.14.0 for ROCm 5.1.0
- Updated to match upstream Thrust 1.15.0
- async_copy, partition, and stable_sort_by_key unit tests are failing on HIP on Windows.
Tensile 4.32.0 for ROCm 5.1.0
- Better control of parallelism to control memory usage
- Support for multiprocessing on Windows for TensileCreateLibrary
- New JSD metric and metric selection functionality
- Initial changes to support two-tier solution selection
- Optimized runtime of TensileCreateLibraries by reducing max RAM usage
- StoreCInUnroll additional optimizations plus adaptive K support
- DGEMM NN optimizations with PrefetchGlobalRead(PGR)=2 support
- Update Googletest to 1.11.0
- Remove no longer supported benchmarking steps
The following defects are fixed in the ROCm v5.0.2 release.
In ROCm v5.0, when using the “assert()” call in a HIP kernel, the compiler may sometimes fail to emit kernel metadata related to the hostcall facility, which results in incomplete initialization of the hostcall facility in the HIP runtime. This can cause the HIP kernel to crash when it attempts to execute the “assert()” call.
The root cause was an incorrect check in the compiler to determine whether the hostcall facility is required by the kernel. This is fixed in the ROCm v5.0.2 release.
The resolution includes a compiler change, which emits the required metadata by default, unless the compiler can prove that the hostcall facility is not required by the kernel. This ensures that the “assert()” call never fails.
Note: This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region and result in an abort in device code. The issue will be fixed in a future release. Compatibility Matrix Updates to ROCm Deep Learning Guide
The compatibility matrix in the AMD Deep Learning Guide is updated for ROCm v5.0.2.
Library | Version |
---|---|
hipBLAS | 0.49.0 |
hipCUB | 2.10.13 |
hipFFT | 1.0.4 |
hipSOLVER | 1.2.0 |
hipSPARSE | 2.0.0 |
rccl | 2.10.3 |
rocALUTION | 2.0.1 |
rocBLAS | 2.42.0 |
rocFFT | 1.0.13 |
rocPRIM | 2.10.12 |
rocRAND | 2.10.12 |
rocSOLVER | 3.16.0 |
rocSPARSE | 2.0.0 |
rocThrust | 2.13.0 |
Tensile | 4.31.0 |
In prior ROCm releases, by default, the hipcc/hipconfig Perl scripts were used to identify and set target compiler options, target platform, compiler, and runtime appropriately.
In ROCm v5.0.1, hipcc.bin and hipconfig.bin have been added as the compiled binary implementations of the hipcc and hipconfig. These new binaries are currently a work-in-progress, considered, and marked as experimental. ROCm plans to fully transition to hipcc.bin and hipconfig.bin in the a future ROCm release. The existing hipcc and hipconfig Perl scripts are renamed to hipcc.pl and hipconfig.pl respectively. New top-level hipcc and hipconfig Perl scripts are created, which can switch between the Perl script or the compiled binary based on the environment variable HIPCC_USE_PERL_SCRIPT.
In ROCm 5.0.1, by default, this environment variable is set to use hipcc and hipconfig through the Perl scripts.
Subsequently, Perl scripts will no longer be available in ROCm in a future release.
Library | Version |
---|---|
hipBLAS | 0.49.0 |
hipCUB | 2.10.13 |
hipFFT | 1.0.4 |
hipSOLVER | 1.2.0 |
hipSPARSE | 2.0.0 |
rccl | 2.10.3 |
rocALUTION | 2.0.1 |
rocBLAS | 2.42.0 |
rocFFT | 1.0.13 |
rocPRIM | 2.10.12 |
rocRAND | 2.10.12 |
rocSOLVER | 3.16.0 |
rocSPARSE | 2.0.0 |
rocThrust | 2.13.0 |
Tensile | 4.31.0 |
The ROCm v5.0 release consists of the following HIP enhancements.
The HIP Installation Guide is updated to include building HIP from source on the NVIDIA platform.
Refer to the HIP Installation Guide v5.0 for more details.
Managed memory, including the __managed__
keyword, is now supported in the HIP combined host/device compilation. Through unified memory allocation, managed memory allows data to be shared and accessible to both the CPU and GPU using a single pointer. The allocation is managed by the AMD GPU driver using the Linux Heterogeneous Memory Management (HMM) mechanism. The user can call managed memory API hipMallocManaged to allocate a large chunk of HMM memory, execute kernels on a device, and fetch data between the host and device as needed.
Note
In a HIP application, it is recommended to do a capability check before calling the managed memory APIs. For example,
int managed_memory = 0; HIPCHECK(hipDeviceGetAttribute(&managed_memory, hipDeviceAttributeManagedMemory,p_gpuDevice)); if (!managed_memory ) { printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice); } else { HIPCHECK(hipSetDevice(p_gpuDevice)); HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T))); . . . }
Note
The managed memory capability check may not be necessary; however, if HMM is not supported, managed malloc will fall back to using system memory. Other managed memory API calls will, then, have
Refer to the HIP API documentation for more details on managed memory APIs.
For the application, see
The following new environment variable is added in this release:
Environment Variable | Value | Description |
---|---|---|
HSA_COOP_CU_COUNT | 0 or 1 (default is 0) | Some processors support more CUs than can reliably be used in a cooperative dispatch. Setting the environment variable HSA_COOP_CU_COUNT to 1 will cause ROCr to return the correct CU count for cooperative groups through the HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT attribute of hsa_agent_get_info(). Setting HSA_COOP_CU_COUNT to other values, or leaving it unset, will cause ROCr to return the same CU count for the attributes HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT and HSA_AMD_AGENT_INFO_COMPUTE_UNIT_COUNT. Future ROCm releases will make HSA_COOP_CU_COUNT=1 the default. |
Re-ordering of the enumerated type in hip_runtime_api.h to better match NV. See below for the difference in enumerated types.
ROCm software will be affected if any of the defined enums listed below are used in the code. Applications built with ROCm v5.0 enumerated types will work with a ROCm 4.5.2 driver. However, an undefined behavior error will occur with a ROCm v4.5.2 application that uses these enumerated types with a ROCm 5.0 runtime.
typedef enum hipDeviceAttribute_t {
- hipDeviceAttributeMaxThreadsPerBlock, ///< Maximum number of threads per block.
- hipDeviceAttributeMaxBlockDimX, ///< Maximum x-dimension of a block.
- hipDeviceAttributeMaxBlockDimY, ///< Maximum y-dimension of a block.
- hipDeviceAttributeMaxBlockDimZ, ///< Maximum z-dimension of a block.
- hipDeviceAttributeMaxGridDimX, ///< Maximum x-dimension of a grid.
- hipDeviceAttributeMaxGridDimY, ///< Maximum y-dimension of a grid.
- hipDeviceAttributeMaxGridDimZ, ///< Maximum z-dimension of a grid.
- hipDeviceAttributeMaxSharedMemoryPerBlock, ///< Maximum shared memory available per block in
- ///< bytes.
- hipDeviceAttributeTotalConstantMemory, ///< Constant memory size in bytes.
- hipDeviceAttributeWarpSize, ///< Warp size in threads.
- hipDeviceAttributeMaxRegistersPerBlock, ///< Maximum number of 32-bit registers available to a
- ///< thread block. This number is shared by all thread
- ///< blocks simultaneously resident on a
- ///< multiprocessor.
- hipDeviceAttributeClockRate, ///< Peak clock frequency in kilohertz.
- hipDeviceAttributeMemoryClockRate, ///< Peak memory clock frequency in kilohertz.
- hipDeviceAttributeMemoryBusWidth, ///< Global memory bus width in bits.
- hipDeviceAttributeMultiprocessorCount, ///< Number of multiprocessors on the device.
- hipDeviceAttributeComputeMode, ///< Compute mode that device is currently in.
- hipDeviceAttributeL2CacheSize, ///< Size of L2 cache in bytes. 0 if the device doesn't have L2
- ///< cache.
- hipDeviceAttributeMaxThreadsPerMultiProcessor, ///< Maximum resident threads per
- ///< multiprocessor.
- hipDeviceAttributeComputeCapabilityMajor, ///< Major compute capability version number.
- hipDeviceAttributeComputeCapabilityMinor, ///< Minor compute capability version number.
- hipDeviceAttributeConcurrentKernels, ///< Device can possibly execute multiple kernels
- ///< concurrently.
- hipDeviceAttributePciBusId, ///< PCI Bus ID.
- hipDeviceAttributePciDeviceId, ///< PCI Device ID.
- hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, ///< Maximum Shared Memory Per
- ///< Multiprocessor.
- hipDeviceAttributeIsMultiGpuBoard, ///< Multiple GPU devices.
- hipDeviceAttributeIntegrated, ///< iGPU
- hipDeviceAttributeCooperativeLaunch, ///< Support cooperative launch
- hipDeviceAttributeCooperativeMultiDeviceLaunch, ///< Support cooperative launch on multiple devices
- hipDeviceAttributeMaxTexture1DWidth, ///< Maximum number of elements in 1D images
- hipDeviceAttributeMaxTexture2DWidth, ///< Maximum dimension width of 2D images in image elements
- hipDeviceAttributeMaxTexture2DHeight, ///< Maximum dimension height of 2D images in image elements
- hipDeviceAttributeMaxTexture3DWidth, ///< Maximum dimension width of 3D images in image elements
- hipDeviceAttributeMaxTexture3DHeight, ///< Maximum dimensions height of 3D images in image elements
- hipDeviceAttributeMaxTexture3DDepth, ///< Maximum dimensions depth of 3D images in image elements
+ hipDeviceAttributeCudaCompatibleBegin = 0,
- hipDeviceAttributeHdpMemFlushCntl, ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register
- hipDeviceAttributeHdpRegFlushCntl, ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register
+ hipDeviceAttributeEccEnabled = hipDeviceAttributeCudaCompatibleBegin, ///< Whether ECC support is enabled.
+ hipDeviceAttributeAccessPolicyMaxWindowSize, ///< Cuda only. The maximum size of the window policy in bytes.
+ hipDeviceAttributeAsyncEngineCount, ///< Cuda only. Asynchronous engines number.
+ hipDeviceAttributeCanMapHostMemory, ///< Whether host memory can be mapped into device address space
+ hipDeviceAttributeCanUseHostPointerForRegisteredMem,///< Cuda only. Device can access host registered memory
+ ///< at the same virtual address as the CPU
+ hipDeviceAttributeClockRate, ///< Peak clock frequency in kilohertz.
+ hipDeviceAttributeComputeMode, ///< Compute mode that device is currently in.
+ hipDeviceAttributeComputePreemptionSupported, ///< Cuda only. Device supports Compute Preemption.
+ hipDeviceAttributeConcurrentKernels, ///< Device can possibly execute multiple kernels concurrently.
+ hipDeviceAttributeConcurrentManagedAccess, ///< Device can coherently access managed memory concurrently with the CPU
+ hipDeviceAttributeCooperativeLaunch, ///< Support cooperative launch
+ hipDeviceAttributeCooperativeMultiDeviceLaunch, ///< Support cooperative launch on multiple devices
+ hipDeviceAttributeDeviceOverlap, ///< Cuda only. Device can concurrently copy memory and execute a kernel.
+ ///< Deprecated. Use instead asyncEngineCount.
+ hipDeviceAttributeDirectManagedMemAccessFromHost, ///< Host can directly access managed memory on
+ ///< the device without migration
+ hipDeviceAttributeGlobalL1CacheSupported, ///< Cuda only. Device supports caching globals in L1
+ hipDeviceAttributeHostNativeAtomicSupported, ///< Cuda only. Link between the device and the host supports native atomic operations
+ hipDeviceAttributeIntegrated, ///< Device is integrated GPU
+ hipDeviceAttributeIsMultiGpuBoard, ///< Multiple GPU devices.
+ hipDeviceAttributeKernelExecTimeout, ///< Run time limit for kernels executed on the device
+ hipDeviceAttributeL2CacheSize, ///< Size of L2 cache in bytes. 0 if the device doesn't have L2 cache.
+ hipDeviceAttributeLocalL1CacheSupported, ///< caching locals in L1 is supported
+ hipDeviceAttributeLuid, ///< Cuda only. 8-byte locally unique identifier in 8 bytes. Undefined on TCC and non-Windows platforms
+ hipDeviceAttributeLuidDeviceNodeMask, ///< Cuda only. Luid device node mask. Undefined on TCC and non-Windows platforms
+ hipDeviceAttributeComputeCapabilityMajor, ///< Major compute capability version number.
+ hipDeviceAttributeManagedMemory, ///< Device supports allocating managed memory on this system
+ hipDeviceAttributeMaxBlocksPerMultiProcessor, ///< Cuda only. Max block size per multiprocessor
+ hipDeviceAttributeMaxBlockDimX, ///< Max block size in width.
+ hipDeviceAttributeMaxBlockDimY, ///< Max block size in height.
+ hipDeviceAttributeMaxBlockDimZ, ///< Max block size in depth.
+ hipDeviceAttributeMaxGridDimX, ///< Max grid size in width.
+ hipDeviceAttributeMaxGridDimY, ///< Max grid size in height.
+ hipDeviceAttributeMaxGridDimZ, ///< Max grid size in depth.
+ hipDeviceAttributeMaxSurface1D, ///< Maximum size of 1D surface.
+ hipDeviceAttributeMaxSurface1DLayered, ///< Cuda only. Maximum dimensions of 1D layered surface.
+ hipDeviceAttributeMaxSurface2D, ///< Maximum dimension (width, height) of 2D surface.
+ hipDeviceAttributeMaxSurface2DLayered, ///< Cuda only. Maximum dimensions of 2D layered surface.
+ hipDeviceAttributeMaxSurface3D, ///< Maximum dimension (width, height, depth) of 3D surface.
+ hipDeviceAttributeMaxSurfaceCubemap, ///< Cuda only. Maximum dimensions of Cubemap surface.
+ hipDeviceAttributeMaxSurfaceCubemapLayered, ///< Cuda only. Maximum dimension of Cubemap layered surface.
+ hipDeviceAttributeMaxTexture1DWidth, ///< Maximum size of 1D texture.
+ hipDeviceAttributeMaxTexture1DLayered, ///< Cuda only. Maximum dimensions of 1D layered texture.
+ hipDeviceAttributeMaxTexture1DLinear, ///< Maximum number of elements allocatable in a 1D linear texture.
+ ///< Use cudaDeviceGetTexture1DLinearMaxWidth() instead on Cuda.
+ hipDeviceAttributeMaxTexture1DMipmap, ///< Cuda only. Maximum size of 1D mipmapped texture.
+ hipDeviceAttributeMaxTexture2DWidth, ///< Maximum dimension width of 2D texture.
+ hipDeviceAttributeMaxTexture2DHeight, ///< Maximum dimension hight of 2D texture.
+ hipDeviceAttributeMaxTexture2DGather, ///< Cuda only. Maximum dimensions of 2D texture if gather operations performed.
+ hipDeviceAttributeMaxTexture2DLayered, ///< Cuda only. Maximum dimensions of 2D layered texture.
+ hipDeviceAttributeMaxTexture2DLinear, ///< Cuda only. Maximum dimensions (width, height, pitch) of 2D textures bound to pitched memory.
+ hipDeviceAttributeMaxTexture2DMipmap, ///< Cuda only. Maximum dimensions of 2D mipmapped texture.
+ hipDeviceAttributeMaxTexture3DWidth, ///< Maximum dimension width of 3D texture.
+ hipDeviceAttributeMaxTexture3DHeight, ///< Maximum dimension height of 3D texture.
+ hipDeviceAttributeMaxTexture3DDepth, ///< Maximum dimension depth of 3D texture.
+ hipDeviceAttributeMaxTexture3DAlt, ///< Cuda only. Maximum dimensions of alternate 3D texture.
+ hipDeviceAttributeMaxTextureCubemap, ///< Cuda only. Maximum dimensions of Cubemap texture
+ hipDeviceAttributeMaxTextureCubemapLayered, ///< Cuda only. Maximum dimensions of Cubemap layered texture.
+ hipDeviceAttributeMaxThreadsDim, ///< Maximum dimension of a block
+ hipDeviceAttributeMaxThreadsPerBlock, ///< Maximum number of threads per block.
+ hipDeviceAttributeMaxThreadsPerMultiProcessor, ///< Maximum resident threads per multiprocessor.
+ hipDeviceAttributeMaxPitch, ///< Maximum pitch in bytes allowed by memory copies
+ hipDeviceAttributeMemoryBusWidth, ///< Global memory bus width in bits.
+ hipDeviceAttributeMemoryClockRate, ///< Peak memory clock frequency in kilohertz.
+ hipDeviceAttributeComputeCapabilityMinor, ///< Minor compute capability version number.
+ hipDeviceAttributeMultiGpuBoardGroupID, ///< Cuda only. Unique ID of device group on the same multi-GPU board
+ hipDeviceAttributeMultiprocessorCount, ///< Number of multiprocessors on the device.
+ hipDeviceAttributeName, ///< Device name.
+ hipDeviceAttributePageableMemoryAccess, ///< Device supports coherently accessing pageable memory
+ ///< without calling hipHostRegister on it
+ hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via the host's page tables
+ hipDeviceAttributePciBusId, ///< PCI Bus ID.
+ hipDeviceAttributePciDeviceId, ///< PCI Device ID.
+ hipDeviceAttributePciDomainID, ///< PCI Domain ID.
+ hipDeviceAttributePersistingL2CacheMaxSize, ///< Cuda11 only. Maximum l2 persisting lines capacity in bytes
+ hipDeviceAttributeMaxRegistersPerBlock, ///< 32-bit registers available to a thread block. This number is shared
+ ///< by all thread blocks simultaneously resident on a multiprocessor.
+ hipDeviceAttributeMaxRegistersPerMultiprocessor, ///< 32-bit registers available per block.
+ hipDeviceAttributeReservedSharedMemPerBlock, ///< Cuda11 only. Shared memory reserved by CUDA driver per block.
+ hipDeviceAttributeMaxSharedMemoryPerBlock, ///< Maximum shared memory available per block in bytes.
+ hipDeviceAttributeSharedMemPerBlockOptin, ///< Cuda only. Maximum shared memory per block usable by special opt in.
+ hipDeviceAttributeSharedMemPerMultiprocessor, ///< Cuda only. Shared memory available per multiprocessor.
+ hipDeviceAttributeSingleToDoublePrecisionPerfRatio, ///< Cuda only. Performance ratio of single precision to double precision.
+ hipDeviceAttributeStreamPrioritiesSupported, ///< Cuda only. Whether to support stream priorities.
+ hipDeviceAttributeSurfaceAlignment, ///< Cuda only. Alignment requirement for surfaces
+ hipDeviceAttributeTccDriver, ///< Cuda only. Whether device is a Tesla device using TCC driver
+ hipDeviceAttributeTextureAlignment, ///< Alignment requirement for textures
+ hipDeviceAttributeTexturePitchAlignment, ///< Pitch alignment requirement for 2D texture references bound to pitched memory;
+ hipDeviceAttributeTotalConstantMemory, ///< Constant memory size in bytes.
+ hipDeviceAttributeTotalGlobalMem, ///< Global memory available on devicice.
+ hipDeviceAttributeUnifiedAddressing, ///< Cuda only. An unified address space shared with the host.
+ hipDeviceAttributeUuid, ///< Cuda only. Unique ID in 16 byte.
+ hipDeviceAttributeWarpSize, ///< Warp size in threads.
- hipDeviceAttributeMaxPitch, ///< Maximum pitch in bytes allowed by memory copies
- hipDeviceAttributeTextureAlignment, ///<Alignment requirement for textures
- hipDeviceAttributeTexturePitchAlignment, ///<Pitch alignment requirement for 2D texture references bound to pitched memory;
- hipDeviceAttributeKernelExecTimeout, ///<Run time limit for kernels executed on the device
- hipDeviceAttributeCanMapHostMemory, ///<Device can map host memory into device address space
- hipDeviceAttributeEccEnabled, ///<Device has ECC support enabled
+ hipDeviceAttributeCudaCompatibleEnd = 9999,
+ hipDeviceAttributeAmdSpecificBegin = 10000,
- hipDeviceAttributeCooperativeMultiDeviceUnmatchedFunc, ///< Supports cooperative launch on multiple
- ///devices with unmatched functions
- hipDeviceAttributeCooperativeMultiDeviceUnmatchedGridDim, ///< Supports cooperative launch on multiple
- ///devices with unmatched grid dimensions
- hipDeviceAttributeCooperativeMultiDeviceUnmatchedBlockDim, ///< Supports cooperative launch on multiple
- ///devices with unmatched block dimensions
- hipDeviceAttributeCooperativeMultiDeviceUnmatchedSharedMem, ///< Supports cooperative launch on multiple
- ///devices with unmatched shared memories
- hipDeviceAttributeAsicRevision, ///< Revision of the GPU in this device
- hipDeviceAttributeManagedMemory, ///< Device supports allocating managed memory on this system
- hipDeviceAttributeDirectManagedMemAccessFromHost, ///< Host can directly access managed memory on
- /// the device without migration
- hipDeviceAttributeConcurrentManagedAccess, ///< Device can coherently access managed memory
- /// concurrently with the CPU
- hipDeviceAttributePageableMemoryAccess, ///< Device supports coherently accessing pageable memory
- /// without calling hipHostRegister on it
- hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via
- /// the host's page tables
- hipDeviceAttributeCanUseStreamWaitValue ///< '1' if Device supports hipStreamWaitValue32() and
- ///< hipStreamWaitValue64() , '0' otherwise.
+ hipDeviceAttributeClockInstructionRate = hipDeviceAttributeAmdSpecificBegin, ///< Frequency in khz of the timer used by the device-side "clock*"
+ hipDeviceAttributeArch, ///< Device architecture
+ hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, ///< Maximum Shared Memory PerMultiprocessor.
+ hipDeviceAttributeGcnArch, ///< Device gcn architecture
+ hipDeviceAttributeGcnArchName, ///< Device gcnArch name in 256 bytes
+ hipDeviceAttributeHdpMemFlushCntl, ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register
+ hipDeviceAttributeHdpRegFlushCntl, ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register
+ hipDeviceAttributeCooperativeMultiDeviceUnmatchedFunc, ///< Supports cooperative launch on multiple
+ ///< devices with unmatched functions
+ hipDeviceAttributeCooperativeMultiDeviceUnmatchedGridDim, ///< Supports cooperative launch on multiple
+ ///< devices with unmatched grid dimensions
+ hipDeviceAttributeCooperativeMultiDeviceUnmatchedBlockDim, ///< Supports cooperative launch on multiple
+ ///< devices with unmatched block dimensions
+ hipDeviceAttributeCooperativeMultiDeviceUnmatchedSharedMem, ///< Supports cooperative launch on multiple
+ ///< devices with unmatched shared memories
+ hipDeviceAttributeIsLargeBar, ///< Whether it is LargeBar
+ hipDeviceAttributeAsicRevision, ///< Revision of the GPU in this device
+ hipDeviceAttributeCanUseStreamWaitValue, ///< '1' if Device supports hipStreamWaitValue32() and
+ ///< hipStreamWaitValue64() , '0' otherwise.
+ hipDeviceAttributeAmdSpecificEnd = 19999,
+ hipDeviceAttributeVendorSpecificBegin = 20000,
+ // Extended attributes for vendors
} hipDeviceAttribute_t;
enum hipComputeMode {
The AMDVBFlash tool, used for flashing the VBIOS image to dGPU, does not communicate with the ROM Controller specifically when the driver is present. This is because the driver, as part of its runtime power management feature, puts the dGPU to a sleep state.
As a workaround, users can run amdgpu.runpm=0, which temporarily disables the runtime power management feature from the driver and dynamically changes some power control-related sysfs files.
Users may encounter an issue with the enabled timestamp functionality for monitoring one or multiple counters. ROCProfiler outputs the following four timestamps for each kernel:
-
Dispatch
-
Start
-
End
-
Complete
This defect is related to the Start timestamp functionality, which incorrectly shows an earlier time than the Dispatch timestamp.
To reproduce the issue,
-
Enable timing using the --timestamp on flag.
-
Use the -i option with the input filename that contains the name of the counter(s) to monitor.
-
Run the program.
-
Check the output result file.
BeginNS is lower than DispatchNS, which is incorrect.
The correct order is:
Dispatch < Start < End < Complete
Users cannot use ROCProfiler to measure the time spent on each kernel because of the incorrect timestamp with counter collection enabled.
Users are recommended to collect kernel execution timestamps without monitoring counters, as follows:
-
Enable timing using the --timestamp on flag, and run the application.
-
Rerun the application using the -i option with the input filename that contains the name of the counter(s) to monitor, and save this to a different output file using the -o flag.
-
Check the output result file from step 1.
-
The order of timestamps correctly displays as: DispathNS < BeginNS < EndNS < CompleteNS
-
Users can find the values of the collected counters in the output file generated in step 2.
System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV environment on any GPU. For more information, refer to the Systems Management Interface documentation.
-
The hipFFT.h header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get hipFFT.h in the rocFFT package too.
-
The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class instead.
-
The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0, rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm rocsparse_spmm in 5.0
rocsparse_status rocsparse_spmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, const void* alpha, const rocsparse_spmat_descr mat_A, const rocsparse_dnmat_descr mat_B, const void* beta, const rocsparse_dnmat_descr mat_C, rocsparse_datatype compute_type, rocsparse_spmm_alg alg, rocsparse_spmm_stage stage, size_t* buffer_size, void* temp_buffer);
rocSPARSE_spmm in 4.0
rocsparse_status rocsparse_spmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, const void* alpha, const rocsparse_spmat_descr mat_A, const rocsparse_dnmat_descr mat_B, const void* beta, const rocsparse_dnmat_descr mat_C, rocsparse_datatype compute_type, rocsparse_spmm_alg alg, size_t* buffer_size, void* temp_buffer);
In this release, arithmetic operators of HIP complex and vector types are deprecated.
-
As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of
std::complex
types. -
As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native clang vector type associated with the data member of HIP vector types.
During the deprecation, two macros _HIP_ENABLE_COMPLEX_OPERATORS
and _HIP_ENABLE_VECTOR_OPERATORS
are provided to allow users to conditionally enable arithmetic operators of HIP complex or vector types.
Note, the two macros are mutually exclusive and, by default, set to Off.
The arithmetic operators of HIP complex and vector types will be removed in a future release.
Refer to the HIP API Guide for more information.
Support for loading compiler-generated code object version 4 will be deprecated in a future release with no release announcement and replaced with code object 5 as the default version.
The current default is code object version 4.
MIOpenTensile will be deprecated in a future release.
Library | Version |
---|---|
hipBLAS | ⇒ 0.49.0 |
hipCUB | ⇒ 2.10.13 |
hipFFT | ⇒ 1.0.4 |
hipSOLVER | ⇒ 1.2.0 |
hipSPARSE | ⇒ 2.0.0 |
rccl | ⇒ 2.10.3 |
rocALUTION | ⇒ 2.0.1 |
rocBLAS | ⇒ 2.42.0 |
rocFFT | ⇒ 1.0.13 |
rocPRIM | ⇒ 2.10.12 |
rocRAND | ⇒ 2.10.12 |
rocSOLVER | ⇒ 3.16.0 |
rocSPARSE | ⇒ 2.0.0 |
rocThrust | ⇒ 2.13.0 |
Tensile | ⇒ 4.31.0 |
hipBLAS 0.49.0 for ROCm 5.0.0
- Added rocSOLVER functions to hipblas-bench
- Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex
- Added compilation warning for future trmm changes
- Added documentation to hipblas.h
- Added option to forgo pivoting for getrf and getri when ipiv is nullptr
- Added code coverage option
- Fixed use of incorrect 'HIP_PATH' when building from source.
- Fixed windows packaging
- Allowing negative increments in hipblas-bench
- Removed boost dependency
hipCUB 2.10.13 for ROCm 5.0.0
- Added missing includes to hipcub.hpp
- Bfloat16 support to test cases (device_reduce & device_radix_sort)
- Device merge sort
- Block merge sort
- API update to CUB 1.14.0
- The SetupNVCC.cmake automatic target selector select all of the capabalities of all available card for NVIDIA backend.
hipFFT 1.0.4 for ROCm 5.0.0
- Add calls to rocFFT setup/cleanup.
- Cmake fixes for clients and backend support.
- Added support for Windows 10 as a build target.
hipSOLVER 1.2.0 for ROCm 5.0.0
- Added functions
- sytrf
- hipsolverSsytrf_bufferSize, hipsolverDsytrf_bufferSize, hipsolverCsytrf_bufferSize, hipsolverZsytrf_bufferSize
- hipsolverSsytrf, hipsolverDsytrf, hipsolverCsytrf, hipsolverZsytrf
- sytrf
- Fixed use of incorrect
HIP_PATH
when building from source (#40). Thanks @jakub329homola!
hipSPARSE 2.0.0 for ROCm 5.0.0
- Added (conjugate) transpose support for csrmv, hybmv and spmv routines
RCCL 2.10.3 for ROCm 5.0.0
- Compatibility with NCCL 2.10.3
- Managed memory is not currently supported for clique-based kernels
rocALUTION 2.0.1 for ROCm 5.0.0
- Removed deprecated GlobalPairwiseAMG class, please use PairwiseAMG instead.
- Changed to C++ 14 Standard
- Added sanitizer option
- Improved documentation
rocBLAS 2.42.0 for ROCm 5.0.0
- Added rocblas_get_version_string_size convenience function
- Added rocblas_xtrmm_outofplace, an out-of-place version of rocblas_xtrmm
- Added hpl and trig initialization for gemm_ex to rocblas-bench
- Added source code gemm. It can be used as an alternative to Tensile for debugging and development
- Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex
- Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU.
- Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU.
- Instantiate templated rocBLAS functions to reduce size of librocblas.so
- Removed static library dependency on msgpack
- Removed boost dependencies for clients
- Option to install script to build only rocBLAS clients with a pre-built rocBLAS library
- Correctly set output of nrm2_batched_ex and nrm2_strided_batched_ex when given bad input
- Fix for dgmm with side == rocblas_side_left and a negative incx
- Fixed out-of-bounds read for small trsm
- Fixed numerical checking for tbmv_strided_batched
rocFFT 1.0.13 for ROCm 5.0.0
- Improved many plans by removing unnecessary transpose steps.
- Optimized scheme selection for 3D problems.
- Imposed less restrictions on 3D_BLOCK_RC selection. More problems can use 3D_BLOCK_RC and have some performance gain.
- Enabled 3D_RC. Some 3D problems with SBCC-supported z-dim can use less kernels and get benefit.
- Force --length 336 336 56 (dp) use faster 3D_RC to avoid it from being skipped by conservative threshold test.
- Optimized some even-length R2C/C2R cases by doing more operations in-place and combining pre/post processing into Stockham kernels.
- Added radix-17.
- Added new kernel generator for select fused-2D transforms.
- Improved large 1D transform decompositions.
rocPRIM 2.10.12 for ROCm 5.0.0
- Enable bfloat16 tests and reduce threshold for bfloat16
- Fix device scan limit_size feature
- Non-optimized builds no longer trigger local memory limit errors
- Added scan size limit feature
- Added reduce size limit feature
- Added transform size limit feature
- Add block_load_striped and block_store_striped
- Add gather_to_blocked to gather values from other threads into a blocked arrangement
- The block sizes for device merge sorts initial block sort and its merge steps are now separate in its kernel config
- the block sort step supports multiple items per thread
- size_limit for scan, reduce and transform can now be set in the config struct instead of a parameter
- Device_scan and device_segmented_scan:
inclusive_scan
now uses the input-type as accumulator-type,exclusive_scan
uses initial-value-type.- This particularly changes behaviour of small-size input types with large-size output types (e.g.
short
input,int
output). - And low-res input with high-res output (e.g.
float
input,double
output)
- This particularly changes behaviour of small-size input types with large-size output types (e.g.
- Revert old Fiji workaround, because they solved the issue at compiler side
- Update README cmake minimum version number
- Block sort support multiple items per thread
- currently only powers of two block sizes, and items per threads are supported and only for full blocks
- Bumped the minimum required version of CMake to 3.16
- Unit tests may soft hang on MI200 when running in hipMallocManaged mode.
- device_segmented_radix_sort, device_scan unit tests failing for HIP on Windows
- ReduceEmptyInput cause random faulire with bfloat16
rocRAND 2.10.12 for ROCm 5.0.0
- No updates or changes for ROCm 5.0.0.
rocSOLVER 3.16.0 for ROCm 5.0.0
- Symmetric matrix factorizations:
- LASYF
- SYTF2, SYTRF (with batched and strided_batched versions)
- Added
rocsolver_get_version_string_size
to help with version string queries - Added
rocblas_layer_mode_ex
and the ability to print kernel calls in the trace and profile logs - Expanded batched and strided_batched sample programs.
- Improved general performance of LU factorization
- Increased parallelism of specialized kernels when compiling from source, reducing build times on multi-core systems.
- The rocsolver-test client now prints the rocSOLVER version used to run the tests, rather than the version used to build them
- The rocsolver-bench client now prints the rocSOLVER version used in the benchmark
- Added missing stdint.h include to rocsolver.h
rocSPARSE 2.0.0 for ROCm 5.0.0
- csrmv, coomv, ellmv, hybmv for (conjugate) transposed matrices
- csrmv for symmetric matrices
- spmm_ex is now deprecated and will be removed in the next major release
- Optimization for gtsv
rocThrust 2.13.0 for ROCm 5.0.0
- Updated to match upstream Thrust 1.13.0
- Updated to match upstream Thrust 1.14.0
- Added async scan
- Scan algorithms:
inclusive_scan
now uses the input-type as accumulator-type,exclusive_scan
uses initial-value-type.- This particularly changes behaviour of small-size input types with large-size output types (e.g.
short
input,int
output). - And low-res input with high-res output (e.g.
float
input,double
output)
- This particularly changes behaviour of small-size input types with large-size output types (e.g.
Tensile 4.31.0 for ROCm 5.0.0
- DirectToLds support (x2/x4)
- DirectToVgpr support for DGEMM
- Parameter to control number of files kernels are merged into to better parallelize kernel compilation
- FP16 alternate implementation for HPA HGEMM on aldebaran
- Add DGEMM NN custom kernel for HPL on aldebaran
- Update tensile_client executable to std=c++14
- Remove unused old Tensile client code
- Fix hipErrorInvalidHandle during benchmarks
- Fix addrVgpr for atomic GSU
- Fix for Python 3.8: add case for Constant nodeType
- Fix architecture mapping for gfx1011 and gfx1012
- Fix PrintSolutionRejectionReason verbiage in KernelWriter.py
- Fix vgpr alignment problem when enabling flat buffer load