Astro-HPC Research Team, Summer 2010: [from Pawel] Josh's code

Since Josh has left us a cuda proto-code. We had our work-party gathering this Sunday. Only Pawel, Jeffrey, Anthony and James were available to enjoy the BBQd fish (basa=panga), but there was (i) no flying because of Buttenville airport lockdown to please 20 alien leaders, and (ii) no swimming either, because it started raining and we preferred to stay under the umbrella.
the wrong forcast must be due to the lack of CUDA.

Those of us who had looked into the code told us about how impossible it is to compile it on their computers. Anthony had to leave and the 3 remaining Magnificent retreated upstairs and started up cudak2 (yes, a clandestine copy of cudak1 does exist).

After an hour or two we figured out how to compile Josh's code:
cudak2[152]:~...C/src/tau/CUDA_pSim$ nvcc CUDA_pSim.cu -I/usr/local/cuda/sdk2.3/C/common/inc/ -L/usr/local/cuda/sdk2.3/C/lib/ -lm -lglut -lcudart -lcutil
and after a few additional hours we figured out what the !@#$ does it do that 500 particles that were stared in a nice disk begin following some very strange orbits essentially going up and falling onto the central star, then scattering at high speed to infinity.

ok, to make the long story short we corrected the initial conditions and the equations of motion to the point where some of us (me) thought they now represented the correct, Binney-Tremaine-like equations, only written using the reversed phi <--> theta convention of Josh, and some of us (Jeffrey) were complaining about the signs in front of the sin(phi_Josh). The code was still doing its weird thing, unless we forced it to simulate a 2-D, flat disk, where it worked uncomfortably slowly, but apprently ok.

In desperation, we allowed Jeffrey to do the nonsense (-:) change of signs and... everything became like Newton intended it: a stable 3-d disk. It turned out Josh's phi and theta were not the spherical-coordinates phi & theta, his phi was actually theta+pi/2 not pi/2-theta! so given his (non-standard) definition of meridional axis pointing downward not upward, our equations had two out of a dozen signs in front of acceleration terms wrong --> non-conservation of angular momentum --> trouble on timescale of a few orbits.

We put the corrected code on cudak3 here:
cudak3[9]:~/cuda/projects/tau$

cudak3[10]:~/cuda/projects/tau$ ls

3Ddisplay.h CUDA_pSim.sln movPos.h
a.out CUDA_pSim_vc90.sln movPos.h~
CUDA_pSim.cu CUDA_pSim_vc90.suo particle.h
CUDA_pSim.cu~ CUDA_pSim_vc90.vcproj particle.h~
CUDA_pSim_gold.cpp CUDA_pSim_vc90.vcproj.Josh-PC.Josh.user vc90.pdb
CUDA_pSim_kernel.cu CUDA_pSim.vcproj
CUDA_pSim_kernel.cu~ forces.h

cudak3[11]:~/cuda/projects/tau$
cudak3[11]:~/cuda/projects/tau$ a.out

size of each particle is 24 bytes

Number of particles: 1000
Mass ratio (0 <>
minimum radius from solar mass (0.0 <>
Time step: .01
Npart=1000 u=1.000e-02 R_min=5.000e-01 dt=1.000e-02 h=2.500e-04 M=1.000e+05 G=1.000e-05
creating particle array with 1000 particles...
setting initial conditions for particle array...
creating optical thickness array with 4 slices on device...
optical thickness array created...
freeglut (a.out): Unable to create direct context rendering for window 'Particle Simulation'
This may hurt performance.
cudak3[12]:~/cuda/projects/tau$
* * *
Interestingly, we did get openGL output on our remote machine (cudak2) from cudak3. It was very slow but it worked. Apparently, if you transfer data back to host from gpu device and then plot it with openGL, which is not the fastest way to do it, and you have the ssh -X ... connection to a cuda-capable machine, you can get you graphics served over internet. that's nice. All SDK demos fail to do it since they keep the frame buffer for plotting on the gpu runnig the calculation, and openGL extensions to somehow replicate the data on a remote client are not available, so demos print error messages and quit.
* * *
Next step: why is Josh's code running only ~500 particles (anything more and it refused to move the particles, the dsplay refreshed but the particles don't evolve in time)?
{Oh! I think I know. just as I was writing this I think I realized.. of course the max number must be 512, and 512 is the limit of threds per block
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum sizes of each dimension of a block: 512 x 512 x 64
this is of course the output from cudak2[6]:~...C/bin/linux/release$ deviceQuery on the gtx280 cards }

2 comments:

UnknownJune 27, 2010 at 10:53 PM
[from Jeffrey] Nonsense prevailed!

So it seems by setting the blocksize to its maximum we can include practically as many particles as we want? I wonder how fast it'll be.
Summer Research Team 2010July 5, 2010 at 9:19 PM
thats great you were able to make use of my code. I'm actually surprised myself it compiled since most stuff was done midway through the morning and my concentration was running low. Now that we have some more time it shouldn't be long before things are moving smoothly (particles included). In order for me to make stuff work on CUDA I had to completely redo most of the tau code which is probably why it didn't work theres just so much stuff to keep in mind while parallel coding.

Sunday, June 27, 2010

[from Pawel] Josh's code

2 comments: