Thursday, June 24, 2010

[from Josh] CUDAizing progress report

I would say I'm about half way done CUDA conversion. Anyway this is my scheme for now since it seems the simplest with the given time we have. I have a single copy of tau grid on device and a copy of particle array on both host and device. Once i initialize data on host I send it to the device and call a single kernel. The kernel executes force command and integrates a single time step. I only retrieve data from the device at intervals of modulus N. In between kernel calls I have to sync the threads so that the particle array remain synchronous. The only type of memory I am currently using is global memory for the two arrays, all other data is passed as params. After I get this first version working I would like to make it work on multiple device (which require memory sharing and work load splitting) have groups of particles in the same tau shell use shared page locked memory to share the tau slice properties of that slice and do particles in order by slice instead of chronological. This would have two benefits: page locked memory is very fast (it is the memory that thread blocks can see and use fastest. It is also called pinned memory and it is very limited in size so only small usages are applicable), and doing particle by slice would cause tau not to be retrieved multiple times (wasted time). Anyway I think I'll be able to make it to the meeting tomorrow however I may spend the time finishing my code since I'm leaving tomorrow for my cottage and need to finish the code before the 29th. I'll see if I can get my code running on c4 or 5 before I leave on a single device.

1 comment:

  1. if you're indeed half-way through then you're super-fast... let's talk tomorrow (Fri) at UTSC,