One last detail. What number should I place for nPartitions? Im doing for (300,1000)x(1000,500) matrices. I haven't looked in your code in detail so it'd be a shortcut if you state. @tobias elbert
Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
WooW. Result is amazing. See the photos and let's go on the discussion.
Amazing indeed: managed to extract whopping 100 MFLOPS out of machine that is probably north of 50 GFLOPS peak.
This question was answered for you, some months ago, on this thread: Matlab is faster because it probably uses optimized version of BLAS library for matrix multiplication. You cannot come even close (if you didn't get it - I was ironic in my previous post about "impressive" speeds of C# codes posted here) to this speed through coding matrix multiplication in three for loops. To come up to this level of speed, you'd have to utilize SSE processing units, take great care about re-organizing multiplication code with regards to caching, etc. - so you should be very, very knowledgeable about code optimization before even thinking about approaching such sort of task (alternatively, if you're C++ wizard, maybe you could come close through employing some template meta-programming magic, like in Eigen or alike libraries). For these reasons, for vector/matrix operations, one should always stick to using its vendor supplied version of BLAS library.
How about using OpenCL library?