Do MatAXPY in a single kernel instead of multiple cublas calls
Do MatScale in a single kernel instead of multiple cublas calls
MatDense CUPM
12