| 90d2215b | 12-Jan-2021 |
Hong Zhang <hongzhang@anl.gov> |
Add the load-balancing kernel for MatMultAdd_SeqSELL and fine tune the heuristic
Kernel7 is significantly slower than kernel9x for the following two cases: - nrows is too small. Kernel7 uses 2 threa
Add the load-balancing kernel for MatMultAdd_SeqSELL and fine tune the heuristic
Kernel7 is significantly slower than kernel9x for the following two cases: - nrows is too small. Kernel7 uses 2 threads per row (assuming sliceheight=16), it does not fully utilize the GPU if nrows < 100K. - maxslicewidth is too big.
Thanks-to: Peng Wang <penwang@nvidia.com>
show more ...
|
| 07e43b41 | 10-Sep-2020 |
Hong Zhang <hongzhang@anl.gov> |
Further optimization of MatMult_SeqSELLCUDA
- Add more kernels - Use multiple threads per row for matrices with narrow slices - Use multiple blocks per slice for matrices with wide slices - Add thre
Further optimization of MatMult_SeqSELLCUDA
- Add more kernels - Use multiple threads per row for matrices with narrow slices - Use multiple blocks per slice for matrices with wide slices - Add three new APIs to return the irregularity ratio, the maximum slice width and the average slice width
Experiments show that column blocking gives much worse performance for wide matrices and permulation based on slice width has almost no impact on the performance.
show more ...
|