Tuesday, June 29, 2010

[from Pawel] Last minute results

Robert, James and Pawel spent a lot of time on Monday working on
(i) making infiniband work concurrently with ethernet, and installing all openMPI
tests - a lot of success with that but tests aren't finished yet. a lot of software setup and configuration needs to be done. it look like we'll get our 20Gbps!

(ii) simulation of particles.

we decided to skip tau-effects and concentrate on benchmarking our naive, barely optimized code. after conquerring the problem of 512 particles-only
in the old code, we can now run up to 30 million particles. around 10pm we even started doing some "galaxy" simulations by adding external forces. we assumed a non-rotating, weakly barred force.

we benchmarked the compute part of the code at roughly 100 GFLOP/s.
this is approximately what we'd expect from a global memory bandwidth-limited problem:
the card can stream 120 GB/s from DDR5 to GPU, that is 30 GFLoats/s.
If 3.3 FLOP are done on every float from RAM, then we are comp/bandwidth balanced. Our leapfrog integration is probably limited by bandwidth, although the GPU heats up to 93C at times, and the fan becomes a hair-dryer.

Sunday, June 27, 2010

[from Pawel] Josh's code

Since Josh has left us a cuda proto-code. We had our work-party gathering this Sunday. Only Pawel, Jeffrey, Anthony and James were available to enjoy the BBQd fish (basa=panga), but there was (i) no flying because of Buttenville airport lockdown to please 20 alien leaders, and (ii) no swimming either, because it started raining and we preferred to stay under the umbrella.
the wrong forcast must be due to the lack of CUDA.

Those of us who had looked into the code told us about how impossible it is to compile it on their computers. Anthony had to leave and the 3 remaining Magnificent retreated upstairs and started up cudak2 (yes, a clandestine copy of cudak1 does exist).

After an hour or two we figured out how to compile Josh's code:
cudak2[152]:~...C/src/tau/CUDA_pSim$ nvcc CUDA_pSim.cu -I/usr/local/cuda/sdk2.3/C/common/inc/ -L/usr/local/cuda/sdk2.3/C/lib/ -lm -lglut -lcudart -lcutil
and after a few additional hours we figured out what the !@#$ does it do that 500 particles that were stared in a nice disk begin following some very strange orbits essentially going up and falling onto the central star, then scattering at high speed to infinity.

ok, to make the long story short we corrected the initial conditions and the equations of motion to the point where some of us (me) thought they now represented the correct, Binney-Tremaine-like equations, only written using the reversed phi <--> theta convention of Josh, and some of us (Jeffrey) were complaining about the signs in front of the sin(phi_Josh). The code was still doing its weird thing, unless we forced it to simulate a 2-D, flat disk, where it worked uncomfortably slowly, but apprently ok.

In desperation, we allowed Jeffrey to do the nonsense (-:) change of signs and... everything became like Newton intended it: a stable 3-d disk. It turned out Josh's phi and theta were not the spherical-coordinates phi & theta, his phi was actually theta+pi/2 not pi/2-theta! so given his (non-standard) definition of meridional axis pointing downward not upward, our equations had two out of a dozen signs in front of acceleration terms wrong --> non-conservation of angular momentum --> trouble on timescale of a few orbits.

We put the corrected code on cudak3 here:
cudak3[9]:~/cuda/projects/tau$

cudak3[10]:~/cuda/projects/tau$ ls

3Ddisplay.h CUDA_pSim.sln movPos.h
a.out CUDA_pSim_vc90.sln movPos.h~
CUDA_pSim.cu CUDA_pSim_vc90.suo particle.h
CUDA_pSim.cu~ CUDA_pSim_vc90.vcproj particle.h~
CUDA_pSim_gold.cpp CUDA_pSim_vc90.vcproj.Josh-PC.Josh.user vc90.pdb
CUDA_pSim_kernel.cu CUDA_pSim.vcproj
CUDA_pSim_kernel.cu~ forces.h

cudak3[11]:~/cuda/projects/tau$
cudak3[11]:~/cuda/projects/tau$ a.out

size of each particle is 24 bytes

Number of particles: 1000
Mass ratio (0 <>
minimum radius from solar mass (0.0 <>
Time step: .01
Npart=1000 u=1.000e-02 R_min=5.000e-01 dt=1.000e-02 h=2.500e-04 M=1.000e+05 G=1.000e-05
creating particle array with 1000 particles...
setting initial conditions for particle array...
creating optical thickness array with 4 slices on device...
optical thickness array created...
freeglut (a.out): Unable to create direct context rendering for window 'Particle Simulation'
This may hurt performance.
cudak3[12]:~/cuda/projects/tau$
* * *
Interestingly, we did get openGL output on our remote machine (cudak2) from cudak3. It was very slow but it worked. Apparently, if you transfer data back to host from gpu device and then plot it with openGL, which is not the fastest way to do it, and you have the ssh -X ... connection to a cuda-capable machine, you can get you graphics served over internet. that's nice. All SDK demos fail to do it since they keep the frame buffer for plotting on the gpu runnig the calculation, and openGL extensions to somehow replicate the data on a remote client are not available, so demos print error messages and quit.
* * *
Next step: why is Josh's code running only ~500 particles (anything more and it refused to move the particles, the dsplay refreshed but the particles don't evolve in time)?
{Oh! I think I know. just as I was writing this I think I realized.. of course the max number must be 512, and 512 is the limit of threds per block
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum sizes of each dimension of a block: 512 x 512 x 64
this is of course the output from cudak2[6]:~...C/bin/linux/release$ deviceQuery on the gtx280 cards }

Saturday, June 26, 2010

[From Robert] Installing the software Part 1

Installing the software proved to be just as difficult, if not more, to set up than the hardware. The specifications on Nvidia's website state that the CUDA software will run on (one of many flavours of linux) Fedora 10 .
Because of this, we decided to work with version 10 64-bit (even though the latest build was 13 at the time). However, due to troubles in recognizing our new video cards (GTX 480), we were not able to load the Graphical User Interface (GUI) properly. This then prompted us to install Fedora 11 64-bit, but we ran into the main issue of the NVIDIA drivers not installing on our system due to incompatible kernel. With so much frustration, we decided to take a chance with Fedora 12 64-bit. Sure enough, everything started to work out, and installing the NVIDIA and CUDA software were on their way.
---
First off, installing the NVIDIA drivers, we decided to use Linux x86_64 Display Driver Version 256.25 Beta.
Because the NVIDIA drivers require a non-X11 interface, the computer needs to be exited out from it. To do so, a Terminal window was opened up, and logged in as root
# su
Then, the OS initialization file called 'inittab' was edited
# cd /etc/
Here, the last line of the file was edited from
id:5:initdefault:
to
id:3:initdefault:
The file was saved and the computer was rebooted. From here the computer then loaded in the OS but loaded a text based log-in screen, due to the changes made in the file. Here, the OS was then logged in as root, and then moved to the location of the NVIDIA driver to run the installation file. In our case:
# cd /Downloads/
# sh NVIDIA-Linux-x86_64-256.25.run
After installing, the inittab file needed to be reverted back to 'id:5:initdefault:' so that on boot, it would load the GUI. Once saved, the computer was restarted.
--
After installing the NVIDIA drivers, the CUDA compiler needed to be installed. In our case:
# cd /Downloads/
# sudo sh cudatoolkit_3.0_linux_64_fedora10.run.sh
This was installed in the default path (/usr/local/cuda)
To make the compiler command (nvcc) work, the .bash_profile and the .bashrc (located at ~/) needed to be edited. Both PATH and LD_LIBRARY_PATH needs to be pointed to the CUDA library:
#
PATH=/usr/local/cuda/bin:$PATH
LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH #if 32 bit machine is used, use lib:$
export PATH LD_LIBRARY_PATH
#
For the .bashrc file, the following needed to be added:
#
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
#

The system then needed to be restarted for the changes to take place.
--
The only thing left now was installing the SDK, which includes both sample and benchmark programs and diagnostics for the hardware. In our case:
# cd /Downloads/
# sh gpucomputingsdk_3.0_linux.run.sh
This was then installed in the default path (~/NVIDIA_GPU_Computing_SDK/), although it could have been installed in non-user path as well.

Once this was installed, a MAKE file had to be run on the SDK at ~/NVIDIA_GPU_Computing_SDK/C/
In our machine, our make file did not execute properly, complaining about " cannot find -lglut "
To fix this, we needed to install the glut packages:
# sudo yum install freeglut
and
# sudo yum install freeglut-devel

Once these were installed, the MAKE file was executed again, and it passed, installing the SDK examples to ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/
From there, a numerous executables are available, mainly deviceQuery and bandwidthTest.

# ./bandwidthTest --device=all
[bandwidthTest]
./bandwidthTest Starting...


!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!

Running on...

Device 0: GeForce GTX 480
Device 1: GeForce GTX 480
Device 2: GeForce GTX 480
Quick Mode

Host to Device Bandwidth, 3 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5155.2

Device to Host Bandwidth, 3 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4267.7

Device to Device Bandwidth, 3 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 120670.9


[bandwidthTest] - Test results:
PASSED


# ./deviceQuery --all
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 3 devices supporting CUDA

Device 0: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1609760768 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 1: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1610285056 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 2: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1610285056 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4246847, CUDA Runtime Version = 3.00, NumDevs = 3, Device = GeForce GTX 480, Device = GeForce GTX 480


PASSED

At this point, the NVIDIA CUDA was installed and working.
To be continued...

[from Pawel] success with drivers, methinks

well, it seems that some of our previous problems stemmed from our not RTFM; we had mismatched versions on three things in the nvidia stack, all of which need to always be updated together:
in /home/student/Downloads/ on both cudak4 and cudak5 we now have the newest driver (newer than was available from nvidia download page a few days ago when I reported last)
NVIDIA-Linux-x86_64-256.35.run [version 256.35]
we also have a toolkit cudatoolkit_3.1_linux_64_fedora12.run, and finally
the SDK in gpucomputingsdk_3.1_linux.run
all these things except the last need to be installed by a superuser like
#sh (file name with .run extension)

I think you should set up a directory, either NVIDIA_GPU_Computing_SDK/
or like I do, cuda31 (to have cuda toolkit/SDK version visible) **in your own**
home directory. Then you can modify the sdk programs in the, say,
/home/your-name/cuda31/C/src/ directory at will, and use their makefiles which
put the demos or your own program executables in the /home/your-name/cuda31/C/bin/linux/release/ directory

* * *
It seems that all the sdk demos are working properly on all cards, on both
cudak4 and cudak5 now!

[I can see temperatures in the nvidia-settings utility on k4 now, but not yet on k5,
I'll try rebooting. in any case, I have stress-tested k4, and it works ok, also doesn't get stuck in high-temp state].

I'm curious about the CPU speed. the Fedora 12 utility shows that it's remaining in the slow/cool state 1.6(?)GHz, even when running multiple cuda examples. I'd think it should switch to the max ~2.86 GHz. does that have to do with any benchmark results?

[from Josh]

I have sent an email with my code in it. It is written in C/C++ and CUDA C. It is almost working. What it really needs is someone who knows more about CUDA to take a look at it. Pawel knows more than any of us on the subject I'm sure. Anyway I wont be able to work on it at my camp so I'm leaving it to you. Theres 3 days and I've completed 99% of the code. All that really needs to be done is adjust the way some things are written to avoid some memory bugs that wont allow it to run properly (I get through about 5 time steps then there's an error). I will post what the problems are and the layout of my code to make things easier for you.

1. in the particle.h define all the constants and structs (CUDA C only deals with structs not classes).
2. in 3ddisplay.h is the visualization which i have debugged completely.
3. in movPos.h is the calling of the kernels and memcpy from device to host.
4. in CUDA_pSim.cu is the main function and setup resides in runSim function.
5. in CUDA_pSim_kernel.cu are the kernels

Things that need to be checked over: 3,4,5
I think the problem may stem from how I have defined the tau_set. I compressed a 3D array into a 1D array. since the array was of dimensions theta_s x phi_s x r_s the 1D array goes from [0..theta_s * phi_s * r_s-1] where each vertical slice of the shell starts at the bottom at (theta=0,phi=-pi/2,r=0). The easiest thing to do here would be for someone who understands CUDA better to implement a 3D array and simplify things greatly.

There is need in the second kernel to place an individual particle in the corresponding tau cell and I had calculated the index to be (int(theta/dtheta)%theta_s)*phi_s*r_s + (int(phi/dphi)%phi_s)*r_s + (int(r/dr_%r_s). based on theta,phi,r of particle, and tau cell spacing dtheta,dphi,dr and total number of pieces in each demension theta_s,phi_s,r_s.

However this did not work since obviously a negative number is achievable. This caused an error. Also it would be good if someone could think of a cheap way of reducing theta and phi into a lower angle once it exceeds 2pi or pi/2 respectively.

If you implement a 3d array then you must modify the kernels so that every loop in movPos calls the first kernel which zeros all tau grid. Then the second kernel must be called (after __syncthreads()) which counts the number of particles in each grid and saves it in the respective tau cell of your array. Then the final kernel may be called again after synchronizing which integrates the particles by calculating tau and using the beta equation.

Everything is well commented and if worst comes to worst and you just want to dispaly CUDA then you can try to remove all aspects of tau and simply show an N-body simulation. That would mean that the only array is a 1D particle array. And I already know that my integration code works from my previous version.

Also to compile my code you must specify in linux "-lm -lglut -lglu -lcudart -lcutil" You can most likely just use a make file from another project that has openGL enabled.

I'm very sorry I have wont be able to help out the next few days. I wanted to be at the presentation very much and had planned on it until the date changed (after I had already made plans with my family).

I wish you all good luck. I will check my email in the next few hours to before I leave to see if any of you respond.

I will also send contact info so that you can reach me and ask questions about my code if you choose to try and fix it. I think only someone good and CUDA can fix the issue however I think you all should try.

Josh,

Friday, June 25, 2010

[from Pawel] new driver/SDK on cudak5

we have a 195.36.31 driver (stable, new) on cudak5.
the previous one, 256.25 was a beta version that we needed to compile anything
using the gcc-3.4 compiler, but gradually showed some serious problems,
like not seeing chosen devices and putting them in high-gear mode (high T) permanently.

Josh installed the appropriate toolkit with SDK and it now works ~ok!!
I've tested every SDK demo on all cards and the result is that we no longer have
any problems with multi-GPU!

the problems you'll encouner running nvidia-settings utility & examples in
/home/student/NVIDIA_GPU_Computing_SDK/C/bin/linux/release

are restricted to following programs
====================================================

nvidia-setting does not show:
Thermal Settings
PowerMizer
DFP-1(20 - GeForce GTX 480)

for devices 1 and 2.
to see temperatures you must say
% nvidia-smi -q -a
which prints correct temperatures. I've been able to generate T=90-94C while running
3 examples of nbody test on 3 cards. it's normal.
=======================================================================================
printout from ./deviceQuery
(..)
Concurrent copy and execution: Yes
Run time limit on kernels: No
(...)

and yet...

nt@cudak5 release]$
[student@cudak5 release]$ ./concurrentKernels
[concurrentKernels] - Starting...

CUDA Device GeForce GTX 480 has 15 Multi-Processors
CUDA Device GeForce GTX 480 is NOT capable of concurrent kernel execution
concurrentKernels.cu(111) : cudaSafeCall() Runtime API error : unspecified launch failure.
[student@cudak5 release]$
======================================================================================

[student@cudak5 release]$ ./fluidsGL
[fluidsGL] - [OpenGL/CUDA simulation]
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL_kernels.cu(44) : cutilCheckMsg() CUTIL CUDA error : cudaMemcpy failed : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$

[student@cudak5 release]$
[student@cudak5 release]$ ./fluidsGL --device=2
[fluidsGL] - [OpenGL/CUDA simulation]
Using device 2: GeForce GTX 480
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL.cu(183) : cudaSafeCall() Runtime API error : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$
[student@cudak5 release]$
[student@cudak5 release]$
[student@cudak5 release]$ ./fluidsGL --device=1
[fluidsGL] - [OpenGL/CUDA simulation]
Using device 1: GeForce GTX 480
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL.cu(183) : cudaSafeCall() Runtime API error : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$
[student@cudak5 release]$


====================================================================

Those who are close to producing working CUDA programs please do so on cudak5 in student account. Don;t worry (by necessity) about infiniband.
While having no internode comm looks moderately bad, not having any CUDA will definitely look bad during our 29th Jun meeting.

We have very little time.. full time engagement is expected now to the end of month.

Thursday, June 24, 2010

[from Josh] CUDAizing progress report

I would say I'm about half way done CUDA conversion. Anyway this is my scheme for now since it seems the simplest with the given time we have. I have a single copy of tau grid on device and a copy of particle array on both host and device. Once i initialize data on host I send it to the device and call a single kernel. The kernel executes force command and integrates a single time step. I only retrieve data from the device at intervals of modulus N. In between kernel calls I have to sync the threads so that the particle array remain synchronous. The only type of memory I am currently using is global memory for the two arrays, all other data is passed as params. After I get this first version working I would like to make it work on multiple device (which require memory sharing and work load splitting) have groups of particles in the same tau shell use shared page locked memory to share the tau slice properties of that slice and do particles in order by slice instead of chronological. This would have two benefits: page locked memory is very fast (it is the memory that thread blocks can see and use fastest. It is also called pinned memory and it is very limited in size so only small usages are applicable), and doing particle by slice would cause tau not to be retrieved multiple times (wasted time). Anyway I think I'll be able to make it to the meeting tomorrow however I may spend the time finishing my code since I'm leaving tomorrow for my cottage and need to finish the code before the 29th. I'll see if I can get my code running on c4 or 5 before I leave on a single device.

Wednesday, June 23, 2010

[from Anthony] Brief Update

I should explain what I've been doing. For the past several weeks I've been reading quite a lot of papers on modal anaylsis by going through their derivations of the modal equation (specifically Goldreich/Tremaine 1979, Val-Borro et al. 2008, Laughin/Bodenheimer 1994,...and more). We mainly stayed with Goldreich/Tremaine (GT79) derivation where the modal equation is...
...where...
Phi_1 is the total perturbed potential, phi_1 can be thought of as the perturbed gravitational potential (self-gravity, external potential, etc.) and eta is the perturbed enthalpy.

Now, with radiation pressure, the modal equation becomes...

Sorry if it's too small...I can't seem to make it bigger. Essentially, this equation is a second order ODE of the form...
where...
Note that the term I is also in terms of eta so we believe this equation can be thought of as a third order (homogeneous?) equation. And obviously the coefficients we're dealing are non-constant with derivatives all over the place.

So far, I tried to code this type of ODE (assuming the coefficients are constant) via leapfrog and I have something like this...
...like a decaying free-vibration...which is expected. Also remember that equations like these typically have two solutions (if you solve the characeristic equation you would a few roots - real or imagninary).

So now, I'm trying to figure out how to code this equation with all the non-constant coefficients included.

Tuesday, June 22, 2010

[from James]

I've discovered that my integrator isn't working properly; I've tried three different formulations of the leapfrog method, with different initial conditions, but with all of them my error keeps growing. Also, the program gradually slows down, even with only 1 particle, and after about 50,000 force calculations it slows to a crawl (and the relative error climbs close to unity).

leapfrog version 1:

R = r + 0.5*h*v # drift for 1/2 step
T = theta + 0.5*h*omega # drift for 1/2 step

dv = R*omega*omega - GM/(R*R) # acceleration in r
dw = -2.0*omega*v/R # acceleration in theta

v = v + h*dv # kick for full step
omega = omega + h*dw # kick for full step

r = R + 0.5*h*v # drift for remaining 1/2 step
theta = T + 0.5*h*omega # drift for remaining 1/2 step

dE = h*(GM*v/(r*r) + v*dv + r*v*omega*omega + r*r*omega*dw)
error = log(abs(dE/E),10)

Version 2 (this one does an initial 1/2 kick before the loop starts):

# incremental acceleration
dv = L*L/(r*r*r) - GM/(r*r)
dw = -2.0*omega*v/r

# drift for full step
r = r + h*v
theta = theta + h*omega

# Kick for full step - note that because of the initial 1/2 kick, this
# will put the velocities 1/2 step ahead of the positions
v = v + h*dv
omega = omega + h*dw

# compute total energy, as a check:
E2 = 0.5*(v*v + r*r*omega*omega) - GM/r
error = log(abs((E-E2)/E),10)


Version 3 (this one modifies the equations to allow full time steps over
the same time interval, but is mathematically equivalent to the other
versions):

dv1 = L*L/(r*r*r) - GM/(r*r) # incremental acceleration in r at t
dw1 = -2.0*omega*v/r

r = r + h*v + 0.5*h*h*dv1
theta = theta + h*omega + 0.5*h*h*dw1

dv2 = L*L/(r*r*r) - GM/(r*r) # incremental acceleration in r at t+dt
dw2 = -2.0*omega*v/r

v = v + 0.5*h*(dv1 + dv2)
omega = omega + 0.5*h*(dw1 + dw2)

E2 = 0.5*(v*v + r*r*omega*omega) - GM/r
error=log(abs((E-E2)/E),10)


You probably noticed that I calculated the error in total energy differently in the first method compared to the second two; the first one calculates dE and then the error as dE/E; the second way is to calculate E2 based on the updated positions and velocities and then the error as (E-E2)/E. Both methods should yield similar errors.

[from Pawel] Mail problem

Today I have a strange mail problem - maybe it's some campus upgrade gone wrong. Anyway, what I was going to mention is that we're in our final 1-1.5 weeks of work and that means: switch to panic mode. I expect progress and problem reports on this blog from all of you almost every day! I'd much prefer the former :-)

The St George campus will be closed from Thu so we'll meet on Fri at UTSC,
and on Sat. at my place. We'll of course spend some time with some of you who haven't been to Buttonville, within 50 km radius from there >->--

The meeting is 29 June, 2pm in AA bldg rm 401 or something like that (Vce Principal Research room on 4th floor). Those who can attend, say so!

There are many things which don;t work right now.
cudak4/5 don't have:
1. fixed IP addresses,
2. a working set of driver + SDK which sees 3 fermi cards properly (I installed driver 195.36.31 which was released days ago, but that doesn't work with the more experimental 256.25 (or some such)
that we had before. I didn;t have time to solve that thing and see if all cards are usable with the new recommended driver.
3. device 2 on cudak5 overheated, was 78C at idle, instead of normal 56C.
The reason was that it was in the high=gear state all the time (max clock speed on gpu and mem).
After driver installation, toolkit failed to install so I don;t yet know if that problem went away.
nvidia-settings showed that 1 of of 3 devices (0,1,2) has a different version of
VBIOS. however, it was dev 1 not 2, so it's false hint.
4. even though most SDK demos ran correctly on dev 0 and 1, there was a disturbingly long startup time, as opposed to instantaneous load on old cuda setups. dev 2 wass also hanging for a looong time before generating the Cuda Mem Alloc Error or Unspecified Launch Error.
this is some kind of software/driver problem. Robert swapped 2 cards and
dev 2 was always having problems (confusingly, dev 1 has problems on cudak4).

Any help in solving these fascinating problems will be appreciated.
Maybe I'll go to UTSC today..

keep up the good work

Pawel

Monday, June 21, 2010

[from Jeffrey] hydrocode

Code works now. Yay.

This is a shock tube example done in 2d. The resolution is 500 by 500, and the Courant number is 0.5.



Now onto cuda-izing.

[from Josh] progress update

Google erased my last post so I'm posting this again in a hurried fashion.
I finished my tau simulation.
Used same format as previous simulation in xyz that I did.
Used sphereical coords with r.. i derived in previous post.
Where radial acceleration I use F = Fgrav*(1-beta*exp(-tau))
tau is calculated by
tau = gamma*(A_*)*sum[from i=0 by increments of dr up to i=r](N_i/A(i))
Where gamma is a constant defined by
gamma = -log(I/I0) (for an individual particle)
N_i is the number of particles at radius i (in slice)
A(i) is the surface area of section at radius r defined as
(r^2 + r*dr + 0.5*dr^2)*dtheta*dphi

Program is slow because I hurried to finish and I will need to spend some time optimizing. But now I need to convert it to CUDA C by Thursday since I plan to leave for my cottage by Friday.

EDIT: again I just realized that I can't see anything with google resolution modification so I'll post on youtube when I get a chance and post a link

Wednesday, June 16, 2010

[from Josh] r.. (..=double derivative) in r,theta,phi coord system

The coord system I use along with r.. in spherical coords. I'm nearly positive that I'm accurate since it was very messy in the rough but I think I caught everything and simplified as far as can be done.




Tuesday, June 15, 2010

[from Josh] r.. theta.. and phi.. (..=double derivative) in x,y,z coord system


When using spherical coords (r,theta,phi)
where r is the radius from center,
theta is the angle from the x axis (0..2pi),
and phi is the angle from the x-y plane (-pi/2..pi/2):

x = r*cos(theta)*cos(phi)
y = r*sin(theta)*cos(phi)
z = r*sin(phi)

In reverse:

r = sqrt(x^2+y^2+z^2)
theta = arctan(y/x)
phi = arctan(z/sqrt(x^2+y^2))

The following time derivative were derived using the above three equations for
r, theta, and phi. I tried to make it as legible as possible
Let me know if you would like a neater copy and I'll post a picture:


The first time derivative of r(t) is:

/ d \ / d \ / d \
x(t) |--- x(t)| + y(t) |--- y(t)| + z(t) |--- z(t)|
\ dt / \ dt / \ dt /
---------------------------------------------------
(1/2)
/ 2 2 2\
\x(t) + y(t) + z(t) /



The second time derivative of r(t) is:

/ 2
1 |/ d \ / d / d \\
------------------------ ||--- x(t)| + x(t) |--- |--- x(t)||
(1/2) \\ dt / \ dt \ dt //
/ 2 2 2\
\x(t) + y(t) + z(t) /

2 2
/ d \ / d / d \\ / d \ / d / d \\
+ |--- y(t)| + y(t) |--- |--- y(t)|| + |--- z(t)| + z(t) |--- |--- z(t)||
\ dt / \ dt \ dt // \ dt / \ dt \ dt //

2
/ / d \ / d \ / d \\
\ |x(t) |--- x(t)| + y(t) |--- y(t)| + z(t) |--- z(t)||
| \ \ dt / \ dt / \ dt //
| - ------------------------------------------------------
/ (3/2)
/ 2 2 2\
\x(t) + y(t) + z(t) /

The first time derivative of theta is:

/ d \ / d \
|--- y(t)| x(t) - y(t) |--- x(t)|
\ dt / \ dt /
---------------------------------
2 2
x(t) + y(t)

The second time derivative of theta is:

/ d / d \\ / d / d \\
|--- |--- y(t)|| x(t) - y(t) |--- |--- x(t)||
\ dt \ dt // \ dt \ dt //
---------------------------------------------
2 2
x(t) + y(t)

-

// d \ / d \\ / / d \ / d \\
||--- y(t)| x(t) - y(t) |--- x(t)|| |2 x(t) |--- x(t)| + 2 y(t) |--- y(t)||
\\ dt / \ dt // \ \ dt / \ dt //
---------------------------------------------------------------------------
2
/ 2 2\
\x(t) + y(t) /


The first time derivative of phi is:

/ d \ / 2 2\ / d \ / d \
|--- z(t)| \x(t) + y(t) / - z(t) x(t) |--- x(t)| - z(t) y(t) |--- y(t)|
\ dt / \ dt / \ dt /
------------------------------------------------------------------------
(1/2)
/ 2 2\ / 2 2 2\
\x(t) + y(t) / \x(t) + y(t) + z(t) /

The second time derivative of phi is (might as well shoot yourself now):

/ 2
1 |/ 2 2\ / 2 2
------------------------------------------- |\x(t) + y(t) / \x(t) + y(t)
(3/2) 2 \
/ 2 2\ / 2 2 2\
\x(t) + y(t) / \x(t) + y(t) + z(t) /

2\ / d / d \\
+ z(t) / |--- |--- z(t)||
\ dt \ dt //

/ 2 2\ / 2 2 2\ / d / d \\
- x(t) z(t) \x(t) + y(t) / \x(t) + y(t) + z(t) / |--- |--- x(t)||
\ dt \ dt //

/ 2 2\ / 2 2 2\ / d / d \\
- z(t) y(t) \x(t) + y(t) / \x(t) + y(t) + z(t) / |--- |--- y(t)||
\ dt \ dt //

2 2
/ d \ / 2 2\ / 2 2\ / 2 2
- 2 z(t) |--- z(t)| \x(t) + y(t) / - 2 \x(t) + y(t) / \x(t) - z(t)
\ dt /

/
2\ / / d \ / d \\ / d \ |/ 1 4
+ y(t) / |x(t) |--- x(t)| + y(t) |--- y(t)|| |--- z(t)| + 2 ||- - y(t)
\ \ dt / \ dt // \ dt / \\ 2

2
4 1 2 2 1 2 2\ / d \
+ x(t) - - y(t) z(t) + - x(t) y(t) | |--- x(t)|
2 2 / \ dt /

/ 2 2 1 2\ / d \ / d \
+ 3 |x(t) + y(t) + - z(t) | |--- y(t)| y(t) x(t) |--- x(t)|
\ 3 / \ dt / \ dt /

2\ \
1 / 4 / 2 2\ 2 4\ / d \ | |
- - \x(t) + \z(t) - y(t) / x(t) - 2 y(t) / |--- y(t)| | z(t)|
2 \ dt / / /

Happy Coding,

Josh,

[from Jeffrey]mathgl on k3

The mathgl library is now available on k3 for everyone to use. Here is a simple example for it:

//////////////////////////////////////////////////////

#include </usr/local/include/mgl/mgl_zb.h>

void image(char *fname){

int imax=60;
int jmax=60;
mglGraph *gr = new mglGraphZB;

mglData dat(imax,jmax);
//dat is declared as 2d with dimensions imax by jmax.

for (int i=0; i<imax; i++){
for (int j=0; j<jmax; j++){
dat.a[i+imax*j]={insert data to array};
}
}

//double *a is a public variable in
//the mglData class where data is
//saved as a 1d array.

//similarly for 3d plotting, dat can be declared as:
//
//mglData dat(imax,jmax,kmax);
//
//and each number in the array can be accessed though:
//
//for (int i=0; i<imax; i++){
// for (int j=0; j<jmax; j++){
// for (int k=0; k<kmax; k++){
// dat.a[i+imax*(j+jmax*k)]={insert data to array};
// }
// }
//}

gr->Box();
gr->Dens(dat);
gr->Colorbar();

cout << fname << " is being saved." << endl;
gr->WritePNG(fname);

return;
}

/////////////////////////////////////////////////////////

The function above produces a 2d density plot which is saved as a png file. For more fancy plots just check out the mathgl manual. Have fun!

In other news, my ppm sweep is not working quite right for unknown reasons. More debugging for me.

Monday, June 14, 2010

[from Pawel]

Yes, these things are challenging. last Friday(?) we played around with SDK demos on cudak4 with Jeffrey.

Some don't work (notably fluidsGL!), and with others we've had system freeze-ups. :-| They may not be related to SDK demos however, it could be a thermal instablility(?) of something on the mobo.

We need to investigate - hardware team, can you spend some time runnig things on cudak4, e.g., multiple SDK demos, to heat up the box and see if this will occur again. Before that, please install additional 14 cm fans (one in the vertical position behind harddisk compartment. The fans should snap into the black plastic frames, which should be attached to the bottom of the case with 3 screws I believe.
* * *

Next thing - networking with infiniband, we need to install drivers.

* * *
NOTICE!!
I've rescheduled the 24 June meeting for 29 June, rm AA401 UTSC, at 10-11am.
Time is very short and we must spent full time now working on this, ok?

I'd like us to deliver the working pair of nodes, with with our software demos, and networking
by the end ot this month. If we fail to deliver any working, useful programs, some of us may see a pay cut. :-(
cheers

P

Friday, June 11, 2010

[From Robert] Installing the hardware

Our super computer hardware set-up has proved to be more challenging step than what we originally planned for.
Custom build for high performance, each computer consists of:
1 Thermaltake VH6000BWS tower case
1 Asus P6T7 WS Supercomputer motherboard
1 Intel i7 930 2.8GHz processor
6 OCZ 2GB DDR3-1333MHz Triple Channel memory giving us 12GB in total
1 Western Digital 2TB Green Caviar
3 EVGA Nvidia GTX 480 with 1.5GB Video memory
1 LG GH22N S50 DVD Writer
1 Silverstone Strider ST1500 1.5kW power supply
1 Infiniband InfiniHost III Lx Single-Port MHGS18 20Gb/s card

There were numerous challenges when putting the parts together.
Our first speed bump happened when our newly built system would not recognize more than 8GB RAM, and at other times, no more than 2GB. It was a very peculiar error, as it randomly changed how much the computer saw. It did not help when moving the RAM sticks around, as it sometimes would not even start up. After confirming that it was not the memory modules that was bringing down the machine (by verifying it with the other computer), it was determined that it was the motherboard.
An repair was placed for it, but unfortunately, the repair was said to take 1 - 2 weeks, which we could not afford, and hence, another motherboard was purchased.
Our second speed bump occurred when the replacement motherboard arrived. After configuring the system, the computer would not turn on. After diagnosing and checking everything over numerous times, it was decided, that this new motherboard was also defective. This too was sent in for repair, though fortunately, our first motherboard had been 'fixed' and returned to us.
After receiving back the motherboard, the system was built again, and fortunately this time, everything worked flawlessly.

More updates to come...

Wednesday, June 9, 2010

[from Josh] Progress Report



I would like to say I'm almost done my CUDA conversion process but who knows with debugging. I would say my best guess is that I'll be done in 4 days, so by Monday. It should hopefully include a Cpu version and working CUDA version of a tau modeling polar coord simulation.

Also me and James were discussing how it would be possible for an adaptive leapfrog scheme and came to the conclusion that we would have to adjust the time stepping globally to do it. Individual time step adjustment would lead to skewed time system since some particles would not update in synchronous time with all the rest (causing undesired shadow effects).

Josh,

Monday, June 7, 2010

see how it flies

link to a book by a physicist on flying, it's actually the best I've seen anywhere, as it concentrates on the physical things. It's a complex reading at itmes if you haven't encountered aerodynamics before, and therefore it's interesting.

http://www.av8n.com/how/

* * *

please report difficulties and successes in blog entries. ask for help if you need to.
by now I'd think everybody's got the cpu side working and is deciphering the cuda stuff.

time is T-17 days for being ready with our testbed system running cuda examples and we're not even half way :-|

the infiniband switch is somewhere near - canada customs is already asking for money..
[edit: it's in the UTSC shipping dept!]