Currently there are a couple of versions of simulation based on Josh's original code. On Cudak4, they're somewhere inside the student/.../C/src/ directories (I don't recall the exact path but it's the same folder in which all the other CUDA programs are kept). There are two versions in two separate folders called "tau" and "taunot", respectively.
For some reason, the program which ran just fine on cudak2 and cudak3 Sunday night wouldn't run on cudak4 or cudak5; the first two cudaMemCpy calls (from host to device) worked fine, but all subsequent cudaMemCpy calls, regardless of direction, threw a "wrong direction" error. Equally strange, commenting out everything tau-related made it work again.
The "taunot" folder contains the working simulation in which everything relating to the tau grid has been commented out (which is why, if you re-compile it, nvcc complains several times about nested comments). Also in this version, just for the fun of it we included a force term in the theta'' calculation to simulate a non-rotating bar. As mentioned previously, you can also ssh into cudak4 and run this sim remotely (although when I tried it from home the display was hideously slow, presumably because I have a slow internet connection and a 5-year-old computer).
The "tau" folder contains the non-working sim-in-progress in which the tau-grid is being re-implemented. After much discussion and debate (even some argument) on Tuesday we decided to re-implement tau using the following scheme:
- use two grids: an integer array "dtau_set" storing the number of particles in each grid cell; and a float array "tau_set" storing the value of tau in each cell (this will of course take up more memory but for now it seems that speed is more important)
- when the disk is initially generated, find out which cell each particle belongs to and increment that cell in the integer dtau_set
- start the simulation loop
- kernel 1: send one thread along each "ray" (solid angle) from r_min to r_max, calculating tau for each cell in the float array tau_set based on the data stored in the dtau_set integer array
- kernel 2: integration, one thread per particle: first figure out which cell the particle is in, retrieve the appropriate value of tau, and calculate beta; then calculate the acclerations and update the new positions and velocities; now re-calculate the particle's cell location, and if it has changed cells, decrement the old dtau_set location and increment the new dtau_set location
That's the plan, still in progress. Feel free to work on the code. Clearly this algorithm still isn't super-efficient: there are several calls to global memory, and some values get re-calculated more than once even though they haven't changed (most notably the particle's cell index). However we have managed to limit ourselves to two kernels, and each thread is independent with no need to invoke syncthreads(). The presentation date has come and gone (with no presentation :( ) but for my part I'd still like to see this thing up and running.