This is going to be a very short update.

I improved the CUDA code in nearly every way I know, so now is the moment of truth. How much faster is it?

16 times.

This is on k3, comparing the speed of my c code to my CUDA c code under the exact same conditions.

This is slightly slower than I have hoped. Probably I'm still doing something wrong.

Another disappointing thing is the maximum resolution I can get is still just 732 by 732. If I push it to 1k by 1k, I get a segmentation fault that indicates memory shortage, which is confusing since the amount of memory the program needs should not be that much.

The next thing I plan to do is to run it on k4/k5. Sounds like a simple thing to do, but unfortunately k4 & k5 don't have mathgl installed yet. Guess I need to do that first.

# Astro-HPC Research Team, Summer 2010

We learn how to build nVidia GPU-based (super)computers, network and program them for HPC (High Performance Computing) using **CUDA C** language. We apply all this to do some astrophysical simulations related to forming planetary systems.

## Sunday, August 1, 2010

## Friday, July 30, 2010

### [from Anthony] progress cont...

Just a quick comment on the asymptotic behaviour of D...what I thought for k3_real (red line) appeared as it looked was that we need to take care of the sign when doing a square-root...so one root for k=+sqrt(-D) and another for k=-sqrt(-D) which was why k3_real looked like a reflection of k1_real for r>>1. I agree...we need to do more analysis to explain k2 and also k3_real for r<<1.

****

So I tried integrating starting from the left with the following initial settings...

m=2 (yes...m was equal to 2)

dr = 0.003 (small enough to see if there could be kinks or interesting lumps)

r_initial = 0.01

eta(r=r_initial)= 0

v_eta(r=r_initial)=1

The coefficients B(r) and C(r) should be correct as I did the asymptotic predictions analytically and confirmed it numerically. Oh, from what I found, for r>>1, C(r) is actually increasing (rather than decreasing from our other parameters) because of the last term D/c^2, which is directly proportional to r (assuming c ~ v_kep). Nevertheless, the only part I think that has the greatest contribution to the solution of eta is I(r). Doing the asymptotic predictions for I(r) was a problem for me because I was not sure what assumptions I can do to simplify the term so I had to do it numerically.

Before I show the results, I'm going to define some of the symbols I used...

tau_1 ~ perturbed optical depth

c ~ sound speed

v_tau ~ dtau/dr (for unperturbed optical depth)

f ~ unperturbed radiative force

WW ~ shorthand for m*(Omega-Omega_p)

sigma ~ surface density

a ~ 2nd derivative of eta

v ~ 1st derivative of eta

From the list: sigma, c, f, WW, and v_tau have already been calculated prior to this integration.

So with the initial settings I adapt the leapfrog method for the integration...

tau_1[j+1] = tau_1[j]+(1/(c[j]*c[j]))*v_tau[j]*eta[j]*dr;

I[j+1]= I[j]+f[j]*(((log(r[j+1]*sigma[j+1]*f[j+1]/D[j+1])-log(r[j]*sigma[j]*f[j]/D[j]))/dr)+

((2*m*W[j])/(r[j]*WW[j])))*tau_1[j];

a[j]= - B[j]*v[j] - C[j]*eta[j]- I[j];

v[j+1] = v[j] + a[j]*dr;

eta[j+1] = eta[j] + v[j+1]*dr;

So again, the only part I'm suspicious is tau_1 which includes I(r). But here's what I got from this....

I stopped the integration at R=5 because I thought it was meaningless to continue on. This seems to act very similar to the wave for k_3 (look back at my previous post). Now, I thought it was weird that 'nothing' was happening between r=0 to r=1 so I restricted r to this range and I get...

I noticed something was happening between r=0 and r=0.2 so I restricted r further...

Looking at the numerical results, it seems that C(r) was the dominating factor (which I think was the one making these responses) and after r=0.4, I(r) became the dominate one very quickly. Moreover, this also resembles the waves for k_1 and k_2. I also plotted eta between 0.4 to r=0.8-ish...

So if you piece together the last 3 pics you would get the first pic I posted.

That's all I have so far in terms of integrating. If you need to see more pics, let me know and I'll post them.

***

Pawel, I believe you reminded me that this second ODE we been dealing with is actually an integro-differential equation. So the only method that rings to my head to solve this kind is using Laplace transform. With the initial conditions we defined (eta(r_initial)=0; v_eta(r_initial)=1) it would be convenient to use this method where we transform and solve for L(eta) then inverse laplace transform it to get an analytic solution. The first couple of terms are easy to transform but for I(r) I was not sure whether we should treat (1/c^2)(dtau/dr) as a constant and pull them out of the integral. If I did that, it would be easier to solve. So now I'm at simplifying and solving for L(eta) and so far I have this fraction with a 3rd degree polynomial at the denominator. Now I just need to find some way to simplify it and inverse transform it. But before I go further, do you think this method is actually suitable for our case???

Anthony

## Thursday, July 29, 2010

### [From Jeffrey] cuda-hydro

So the code works now...

The resolution for this simulation is 732x732 ( in case you wonder why such strange numbers: 732=3*(256-12) ). Approximately 10000 time steps were taken and it took 2 hours on k3. Total duration in simulation time is approximately 2 crossing time.

This is definitely not the fastest it can get. Computations are currently only done on one block that has a size of 256, so the simulation box is 3x3=9 times too large for the block. I should be able to utilize 9 blocks in parallel to do this simulation, which would in principle give me another factor of 9 in speed. The only reason why this is not done yet is that the memory sharing between blocks is not trivial and I didn't want to involve too much complication before I get it to work. Now that I understand cuda a little more, I already have a plan for the improvement and it shouldn't be too difficult.

The resolution for this simulation is 732x732 ( in case you wonder why such strange numbers: 732=3*(256-12) ). Approximately 10000 time steps were taken and it took 2 hours on k3. Total duration in simulation time is approximately 2 crossing time.

This is definitely not the fastest it can get. Computations are currently only done on one block that has a size of 256, so the simulation box is 3x3=9 times too large for the block. I should be able to utilize 9 blocks in parallel to do this simulation, which would in principle give me another factor of 9 in speed. The only reason why this is not done yet is that the memory sharing between blocks is not trivial and I didn't want to involve too much complication before I get it to work. Now that I understand cuda a little more, I already have a plan for the improvement and it shouldn't be too difficult.

## Tuesday, July 20, 2010

### [from Anthony] Progress report on modal analysis

First of all, I'm now back to Toronto from Santa Cruz and I have to say ISIMA was excellent. I definitely learned that there's a lot of more complicated problems in astro modeling where it involves lots of hydrodynamics (MHD) and there was one group from UC San Diego working on fusion (plasma physics). So, after meeting a bunch of graduate students from all over the world there's definitely a lot more we could learn in astro modeling.

Here's the site for what's currently happening in the program...

***

So while I was at Santa Cruz, I managed to get some work done on my modal analysis with Pawel. Having to solve for the roots of k (wave number) via WKB approximation along with Cardano's method on the third order polynomial; so there are three roots to this polynomial.

This is the first root of k with both real and imaginary parts. Notice the the three lumps on the graph. Those are the Inner Lindblad, corotation, and Outer Lindblad resonances respectively.

This is the second root for k.

This is the third root for k.

By integrating for eta (look back on my previous entry on what eta is...) using the WKB solution exp(ikr) or more specifically exp(i*integral(k_{i} dr)) we get the following three eta's...

This is the first eta using k_1 with both real and imaginary parts. There seems to be a large response between 0.1 to 0.5 and then some response between r=1 to r=1.7-ish.

This is the second eta using k_2. It seems nothing happens after r=0.5.

This is the strangest one. It seems eta acts violently (look at the number on the eta axis) as r grows which I can't seem wrap my head around. Looking back at the roots of k, k3 sort of acts opposite to k1, so there's a sign reversal on the k's which implies that the graph is flipped horizontally but I don't quite understand why there's such large values for eta_3.

## Thursday, July 1, 2010

### [from James] current status of dust-disk sim

Currently there are a couple of versions of simulation based on Josh's original code. On Cudak4, they're somewhere inside the student/.../C/src/ directories (I don't recall the exact path but it's the same folder in which all the other CUDA programs are kept). There are two versions in two separate folders called "tau" and "taunot", respectively.

For some reason, the program which ran just fine on cudak2 and cudak3 Sunday night wouldn't run on cudak4 or cudak5; the first two cudaMemCpy calls (from host to device) worked fine, but all subsequent cudaMemCpy calls, regardless of direction, threw a "wrong direction" error. Equally strange, commenting out everything tau-related made it work again.

The "taunot" folder contains the working simulation in which everything relating to the tau grid has been commented out (which is why, if you re-compile it, nvcc complains several times about nested comments). Also in this version, just for the fun of it we included a force term in the theta'' calculation to simulate a non-rotating bar. As mentioned previously, you can also ssh into cudak4 and run this sim remotely (although when I tried it from home the display was hideously slow, presumably because I have a slow internet connection and a 5-year-old computer).

The "tau" folder contains the non-working sim-in-progress in which the tau-grid is being re-implemented. After much discussion and debate (even some argument) on Tuesday we decided to re-implement tau using the following scheme:

- use two grids: an integer array "dtau_set" storing the number of particles in each grid cell; and a float array "tau_set" storing the value of tau in each cell (this will of course take up more memory but for now it seems that speed is more important)

- when the disk is initially generated, find out which cell each particle belongs to and increment that cell in the integer dtau_set

- start the simulation loop

- kernel 1: send one thread along each "ray" (solid angle) from r_min to r_max, calculating tau for each cell in the float array tau_set based on the data stored in the dtau_set integer array

- kernel 2: integration, one thread per particle: first figure out which cell the particle is in, retrieve the appropriate value of tau, and calculate beta; then calculate the acclerations and update the new positions and velocities; now re-calculate the particle's cell location, and if it has changed cells, decrement the old dtau_set location and increment the new dtau_set location

That's the plan, still in progress. Feel free to work on the code. Clearly this algorithm still isn't super-efficient: there are several calls to global memory, and some values get re-calculated more than once even though they haven't changed (most notably the particle's cell index). However we have managed to limit ourselves to two kernels, and each thread is independent with no need to invoke syncthreads(). The presentation date has come and gone (with no presentation :( ) but for my part I'd still like to see this thing up and running.

For some reason, the program which ran just fine on cudak2 and cudak3 Sunday night wouldn't run on cudak4 or cudak5; the first two cudaMemCpy calls (from host to device) worked fine, but all subsequent cudaMemCpy calls, regardless of direction, threw a "wrong direction" error. Equally strange, commenting out everything tau-related made it work again.

The "taunot" folder contains the working simulation in which everything relating to the tau grid has been commented out (which is why, if you re-compile it, nvcc complains several times about nested comments). Also in this version, just for the fun of it we included a force term in the theta'' calculation to simulate a non-rotating bar. As mentioned previously, you can also ssh into cudak4 and run this sim remotely (although when I tried it from home the display was hideously slow, presumably because I have a slow internet connection and a 5-year-old computer).

The "tau" folder contains the non-working sim-in-progress in which the tau-grid is being re-implemented. After much discussion and debate (even some argument) on Tuesday we decided to re-implement tau using the following scheme:

- use two grids: an integer array "dtau_set" storing the number of particles in each grid cell; and a float array "tau_set" storing the value of tau in each cell (this will of course take up more memory but for now it seems that speed is more important)

- when the disk is initially generated, find out which cell each particle belongs to and increment that cell in the integer dtau_set

- start the simulation loop

- kernel 1: send one thread along each "ray" (solid angle) from r_min to r_max, calculating tau for each cell in the float array tau_set based on the data stored in the dtau_set integer array

- kernel 2: integration, one thread per particle: first figure out which cell the particle is in, retrieve the appropriate value of tau, and calculate beta; then calculate the acclerations and update the new positions and velocities; now re-calculate the particle's cell location, and if it has changed cells, decrement the old dtau_set location and increment the new dtau_set location

That's the plan, still in progress. Feel free to work on the code. Clearly this algorithm still isn't super-efficient: there are several calls to global memory, and some values get re-calculated more than once even though they haven't changed (most notably the particle's cell index). However we have managed to limit ourselves to two kernels, and each thread is independent with no need to invoke syncthreads(). The presentation date has come and gone (with no presentation :( ) but for my part I'd still like to see this thing up and running.

## Tuesday, June 29, 2010

### [from Pawel] Last minute results

Robert, James and Pawel spent a lot of time on Monday working on

(i) making infiniband work concurrently with ethernet, and installing all openMPI

tests - a lot of success with that but tests aren't finished yet. a lot of software setup and configuration needs to be done. it look like we'll get our 20Gbps!

(ii) simulation of particles.

we decided to skip tau-effects and concentrate on benchmarking our naive, barely optimized code. after conquerring the problem of 512 particles-only

in the old code, we can now run up to 30 million particles. around 10pm we even started doing some "galaxy" simulations by adding external forces. we assumed a non-rotating, weakly barred force.

we benchmarked the compute part of the code at roughly 100 GFLOP/s.

this is approximately what we'd expect from a global memory bandwidth-limited problem:

the card can stream 120 GB/s from DDR5 to GPU, that is 30 GFLoats/s.

If 3.3 FLOP are done on every float from RAM, then we are comp/bandwidth balanced. Our leapfrog integration is probably limited by bandwidth, although the GPU heats up to 93C at times, and the fan becomes a hair-dryer.

(i) making infiniband work concurrently with ethernet, and installing all openMPI

tests - a lot of success with that but tests aren't finished yet. a lot of software setup and configuration needs to be done. it look like we'll get our 20Gbps!

(ii) simulation of particles.

we decided to skip tau-effects and concentrate on benchmarking our naive, barely optimized code. after conquerring the problem of 512 particles-only

in the old code, we can now run up to 30 million particles. around 10pm we even started doing some "galaxy" simulations by adding external forces. we assumed a non-rotating, weakly barred force.

we benchmarked the compute part of the code at roughly 100 GFLOP/s.

this is approximately what we'd expect from a global memory bandwidth-limited problem:

the card can stream 120 GB/s from DDR5 to GPU, that is 30 GFLoats/s.

If 3.3 FLOP are done on every float from RAM, then we are comp/bandwidth balanced. Our leapfrog integration is probably limited by bandwidth, although the GPU heats up to 93C at times, and the fan becomes a hair-dryer.

## Sunday, June 27, 2010

### [from Pawel] Josh's code

Since Josh has left us a cuda proto-code. We had our work-party gathering this Sunday. Only Pawel, Jeffrey, Anthony and James were available to enjoy the BBQd fish (basa=panga), but there was (i) no flying because of Buttenville airport lockdown to please 20 alien leaders, and (ii) no swimming either, because it started raining and we preferred to stay under the umbrella.

the wrong forcast must be due to the lack of CUDA.

Those of us who had looked into the code told us about how impossible it is to compile it on their computers. Anthony had to leave and the 3 remaining Magnificent retreated upstairs and started up cudak2 (yes, a clandestine copy of cudak1 does exist).

After an hour or two we figured out how to compile Josh's code:

cudak2[152]:~...C/src/tau/CUDA_pSim$ nvcc CUDA_pSim.cu -I/usr/local/cuda/sdk2.3/C/common/inc/ -L/usr/local/cuda/sdk2.3/C/lib/ -lm -lglut -lcudart -lcutil

and after a few additional hours we figured out what the !@#$ does it do that 500 particles that were stared in a nice disk begin following some very strange orbits essentially going up and falling onto the central star, then scattering at high speed to infinity.

ok, to make the long story short we corrected the initial conditions and the equations of motion to the point where some of us (me) thought they now represented the correct, Binney-Tremaine-like equations, only written using the reversed phi <--> theta convention of Josh, and some of us (Jeffrey) were complaining about the signs in front of the sin(phi_Josh). The code was still doing its weird thing, unless we forced it to simulate a 2-D, flat disk, where it worked uncomfortably slowly, but apprently ok.

In desperation, we allowed Jeffrey to do the nonsense (-:) change of signs and... everything became like Newton intended it: a stable 3-d disk. It turned out Josh's phi and theta were not the spherical-coordinates phi & theta, his phi was actually theta+pi/2 not pi/2-theta! so given his (non-standard) definition of meridional axis pointing downward not upward, our equations had two out of a dozen signs in front of acceleration terms wrong --> non-conservation of angular momentum --> trouble on timescale of a few orbits.

We put the corrected code on cudak3 here:

cudak3[9]:~/cuda/projects/tau$

cudak3[10]:~/cuda/projects/tau$ ls

3Ddisplay.h CUDA_pSim.sln movPos.h

a.out CUDA_pSim_vc90.sln movPos.h~

CUDA_pSim.cu CUDA_pSim_vc90.suo particle.h

CUDA_pSim.cu~ CUDA_pSim_vc90.vcproj particle.h~

CUDA_pSim_gold.cpp CUDA_pSim_vc90.vcproj.Josh-PC.Josh.user vc90.pdb

CUDA_pSim_kernel.cu CUDA_pSim.vcproj

CUDA_pSim_kernel.cu~ forces.h

cudak3[11]:~/cuda/projects/tau$

cudak3[11]:~/cuda/projects/tau$ a.out

size of each particle is 24 bytes

Number of particles: 1000

Mass ratio (0 <>

minimum radius from solar mass (0.0 <>

Time step: .01

Npart=1000 u=1.000e-02 R_min=5.000e-01 dt=1.000e-02 h=2.500e-04 M=1.000e+05 G=1.000e-05

creating particle array with 1000 particles...

setting initial conditions for particle array...

creating optical thickness array with 4 slices on device...

optical thickness array created...

freeglut (a.out): Unable to create direct context rendering for window 'Particle Simulation'

This may hurt performance.

cudak3[12]:~/cuda/projects/tau$

* * *

Interestingly, we did get openGL output on our remote machine (cudak2) from cudak3. It was very slow but it worked. Apparently, if you transfer data back to host from gpu device and then plot it with openGL, which is not the fastest way to do it, and you have the ssh -X ... connection to a cuda-capable machine, you can get you graphics served over internet. that's nice. All SDK demos fail to do it since they keep the frame buffer for plotting on the gpu runnig the calculation, and openGL extensions to somehow replicate the data on a remote client are not available, so demos print error messages and quit.

* * *

Next step: why is Josh's code running only ~500 particles (anything more and it refused to move the particles, the dsplay refreshed but the particles don't evolve in time)?

{Oh! I think I know. just as I was writing this I think I realized.. of course the max number must be 512, and 512 is the limit of threds per block

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum sizes of each dimension of a block: 512 x 512 x 64

this is of course the output from cudak2[6]:~...C/bin/linux/release$ deviceQuery on the gtx280 cards }

the wrong forcast must be due to the lack of CUDA.

Those of us who had looked into the code told us about how impossible it is to compile it on their computers. Anthony had to leave and the 3 remaining Magnificent retreated upstairs and started up cudak2 (yes, a clandestine copy of cudak1 does exist).

After an hour or two we figured out how to compile Josh's code:

cudak2[152]:~...C/src/tau/CUDA_pSim$ nvcc CUDA_pSim.cu -I/usr/local/cuda/sdk2.3/C/common/inc/ -L/usr/local/cuda/sdk2.3/C/lib/ -lm -lglut -lcudart -lcutil

and after a few additional hours we figured out what the !@#$ does it do that 500 particles that were stared in a nice disk begin following some very strange orbits essentially going up and falling onto the central star, then scattering at high speed to infinity.

ok, to make the long story short we corrected the initial conditions and the equations of motion to the point where some of us (me) thought they now represented the correct, Binney-Tremaine-like equations, only written using the reversed phi <--> theta convention of Josh, and some of us (Jeffrey) were complaining about the signs in front of the sin(phi_Josh). The code was still doing its weird thing, unless we forced it to simulate a 2-D, flat disk, where it worked uncomfortably slowly, but apprently ok.

In desperation, we allowed Jeffrey to do the nonsense (-:) change of signs and... everything became like Newton intended it: a stable 3-d disk. It turned out Josh's phi and theta were not the spherical-coordinates phi & theta, his phi was actually theta+pi/2 not pi/2-theta! so given his (non-standard) definition of meridional axis pointing downward not upward, our equations had two out of a dozen signs in front of acceleration terms wrong --> non-conservation of angular momentum --> trouble on timescale of a few orbits.

We put the corrected code on cudak3 here:

cudak3[9]:~/cuda/projects/tau$

cudak3[10]:~/cuda/projects/tau$ ls

3Ddisplay.h CUDA_pSim.sln movPos.h

a.out CUDA_pSim_vc90.sln movPos.h~

CUDA_pSim.cu CUDA_pSim_vc90.suo particle.h

CUDA_pSim.cu~ CUDA_pSim_vc90.vcproj particle.h~

CUDA_pSim_gold.cpp CUDA_pSim_vc90.vcproj.Josh-PC.Josh.user vc90.pdb

CUDA_pSim_kernel.cu CUDA_pSim.vcproj

CUDA_pSim_kernel.cu~ forces.h

cudak3[11]:~/cuda/projects/tau$

cudak3[11]:~/cuda/projects/tau$ a.out

size of each particle is 24 bytes

Number of particles: 1000

Mass ratio (0 <>

minimum radius from solar mass (0.0 <>

Time step: .01

Npart=1000 u=1.000e-02 R_min=5.000e-01 dt=1.000e-02 h=2.500e-04 M=1.000e+05 G=1.000e-05

creating particle array with 1000 particles...

setting initial conditions for particle array...

creating optical thickness array with 4 slices on device...

optical thickness array created...

freeglut (a.out): Unable to create direct context rendering for window 'Particle Simulation'

This may hurt performance.

cudak3[12]:~/cuda/projects/tau$

* * *

Interestingly, we did get openGL output on our remote machine (cudak2) from cudak3. It was very slow but it worked. Apparently, if you transfer data back to host from gpu device and then plot it with openGL, which is not the fastest way to do it, and you have the ssh -X ... connection to a cuda-capable machine, you can get you graphics served over internet. that's nice. All SDK demos fail to do it since they keep the frame buffer for plotting on the gpu runnig the calculation, and openGL extensions to somehow replicate the data on a remote client are not available, so demos print error messages and quit.

* * *

Next step: why is Josh's code running only ~500 particles (anything more and it refused to move the particles, the dsplay refreshed but the particles don't evolve in time)?

{Oh! I think I know. just as I was writing this I think I realized.. of course the max number must be 512, and 512 is the limit of threds per block

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum sizes of each dimension of a block: 512 x 512 x 64

this is of course the output from cudak2[6]:~...C/bin/linux/release$ deviceQuery on the gtx280 cards }

Subscribe to:
Posts (Atom)