| #
90d2215b
|
| 12-Jan-2021 |
Hong Zhang <hongzhang@anl.gov> |
Add the load-balancing kernel for MatMultAdd_SeqSELL and fine tune the heuristic
Kernel7 is significantly slower than kernel9x for the following two cases: - nrows is too small. Kernel7 uses 2 threa
Add the load-balancing kernel for MatMultAdd_SeqSELL and fine tune the heuristic
Kernel7 is significantly slower than kernel9x for the following two cases: - nrows is too small. Kernel7 uses 2 threads per row (assuming sliceheight=16), it does not fully utilize the GPU if nrows < 100K. - maxslicewidth is too big.
Thanks-to: Peng Wang <penwang@nvidia.com>
show more ...
|