| 2692e278 | 08-Jul-2013 |
Paul Mullowney <paulm@txcorp.com> |
Adding PREPROCESSOR directives to protect ELL and HYB storage formats.
I've added preprocessor directives around all code using the cusparse hybrid (or ellpack) format to only build when CUDA 4.2 or
Adding PREPROCESSOR directives to protect ELL and HYB storage formats.
I've added preprocessor directives around all code using the cusparse hybrid (or ellpack) format to only build when CUDA 4.2 or beyond is being used. I've also changed the documentation in a few places to reflect this. In a few places, protections were required for CUDA 5.0 (hyb2csr conversion and in the stream creation in veccusp.cu).
Also adding code to the init.c that 1) checks cuda error codes and 2) sets the device flags so that memory can be registered as paged- locked via : cudaSetDeviceFlags(cudaDeviceMapHost). This should be valid for all 1.3 devices and later. Moreover, these changes allow multiple MPI threads to work on 1 GPU using cuda streams in a thread safe manner.
show more ...
|
| b06137fd | 27-Jun-2013 |
Paul Mullowney <paulm@txcorp.com> |
Removing TXPETSCGPU from veccusp and mpiaijcusparse
In this next step of removing TXPETSCGPU, the host-device and device-host messaging code has been significantly simplified. In particular, all met
Removing TXPETSCGPU from veccusp and mpiaijcusparse
In this next step of removing TXPETSCGPU, the host-device and device-host messaging code has been significantly simplified. In particular, all methods VecCUSPCopyToGPU/FromGPU now use a cudaMemcpyAsync with a stream (and a stream synchronize()). This never hurts you. Moreover, it can help you in the case of the multi-GPU SpMV as this data transfer will overlap with the MatMult kernel. The more signficant change comes in VecCUSPCopyToGPUSome and VecCUSPCopyFromGPUSome. In this code, the data transfer now moves the smallest contiguous set of vector data containing ALL the indices in a single asynchronous data transfer. Then, the stream containing the data transfer is synchronized (not the entire device). While this can be wasteful in terms of messaging too much data, it has shown the best scalability performance across a wide range of matrices. Lastly the simplicity of the code is a significant advantage over the old way of doing the data transfer. Some old cold in these methods is "if 0"-ed out for reference and will be cleaned up later. One final optimization in the vector code involves registering the host buffer as page locked--which is done in VecCUSPAllocateCheck. Then, the buffer must be unregistered at VecDestroy_SeqCUSP. This shows a nice speedup in the data transfer for a parallel MatMult.
Also in this commit, I am removing the TXPETSCGPU dependence from the mpiaijcusparse class--it now depends only on CUDA. In order for the same stream to be used in the MatMult and MatMultAdd (necessary for an optimal Multi-GPU SpMV), the stream is built in the mpiaijcusparse and then passed in the seqaijcusparse data structure via a new method (MatCUSPARSESetStream). A similar method is added for the CUSPARSE library handle (context) as I think the stream needs to be attached to a particular context to work properly. When running in parallel, multiple GPUs, the references to the handle in the seqaijcusparse are cleared from the mpiaijcusparse classes with the method MatCUSPARSEClearHandle. Then, the mpiaijcusparse class deletes the handle.
One other non-trivial change was made to the seqaijcusparse. The alpha and beta parameters to the SpMV are now device data which is owned by the Mat_SEQAIJCUSPARSEMultStruct structure. This enables slightly better multi-GPU performance as this data does not need to be copied to the GPU at each kernel launch.
Multi-GPU SpMV now works without TXPETSCGPU and the performance is recovered as tested on up to 4 GPUs. Code is valgrind clean and cuda-memcheck clean.
Results of tests have been modified to have 1 less digit of precision. This yields consistent results across different GPUs. Lastly, the parallel test is set to run on a different matrix (shallow_water1) so that the iteration actually converges.
show more ...
|
| bc3f50f2 | 01-Apr-2013 |
Paul Mullowney <paulm@txcorp.com> |
aijcusparse : fixed MatGetFactor and other small issues in this class
Fixed MatGetFactor_seqaij_cusparse to have a more standard set of of function calls (similar to aij/seq/umfpack or superlu) for
aijcusparse : fixed MatGetFactor and other small issues in this class
Fixed MatGetFactor_seqaij_cusparse to have a more standard set of of function calls (similar to aij/seq/umfpack or superlu) for setting up the factorization. In particular, I replaced the scoping call to MatGetFactor_seqaij_petsc with the sequence MatCreate, MatSetSizes, MatSetType, and MatXXXSetPreallocation. With these changes, all tests that use aijcusparse class pass in optimized and debug builds. Moreover, all memory leaks have been removed.
Additional small fixes to this class include the removal of unnecessary PETSC_CUDA_EXTERN_C_BEGIN/END and poor use of PETSC_COMM_WORLD in this file. Lastly, a few missing error checks around several PETSc API method calls for symmetry/hermitian tests were added.
show more ...
|
| b175d8bb | 07-Apr-2013 |
Paul Mullowney <paulm@txcorp.com> |
MatSeqAIJCUSPARSE: white space, style issues, static and extern functions
Conflicts: src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu
[Jed] Touch-up to conform to style guide |