Astro-HPC Research Team, Summer 2010

Tuesday, June 29, 2010

[from Pawel] Last minute results

Robert, James and Pawel spent a lot of time on Monday working on
(i) making infiniband work concurrently with ethernet, and installing all openMPI
tests - a lot of success with that but tests aren't finished yet. a lot of software setup and configuration needs to be done. it look like we'll get our 20Gbps!

(ii) simulation of particles.

we decided to skip tau-effects and concentrate on benchmarking our naive, barely optimized code. after conquerring the problem of 512 particles-only
in the old code, we can now run up to 30 million particles. around 10pm we even started doing some "galaxy" simulations by adding external forces. we assumed a non-rotating, weakly barred force.

we benchmarked the compute part of the code at roughly 100 GFLOP/s.
this is approximately what we'd expect from a global memory bandwidth-limited problem:
the card can stream 120 GB/s from DDR5 to GPU, that is 30 GFLoats/s.
If 3.3 FLOP are done on every float from RAM, then we are comp/bandwidth balanced. Our leapfrog integration is probably limited by bandwidth, although the GPU heats up to 93C at times, and the fan becomes a hair-dryer.

Sunday, June 27, 2010

[from Pawel] Josh's code

Since Josh has left us a cuda proto-code. We had our work-party gathering this Sunday. Only Pawel, Jeffrey, Anthony and James were available to enjoy the BBQd fish (basa=panga), but there was (i) no flying because of Buttenville airport lockdown to please 20 alien leaders, and (ii) no swimming either, because it started raining and we preferred to stay under the umbrella.
the wrong forcast must be due to the lack of CUDA.

Those of us who had looked into the code told us about how impossible it is to compile it on their computers. Anthony had to leave and the 3 remaining Magnificent retreated upstairs and started up cudak2 (yes, a clandestine copy of cudak1 does exist).

After an hour or two we figured out how to compile Josh's code:
cudak2[152]:~...C/src/tau/CUDA_pSim$ nvcc CUDA_pSim.cu -I/usr/local/cuda/sdk2.3/C/common/inc/ -L/usr/local/cuda/sdk2.3/C/lib/ -lm -lglut -lcudart -lcutil
and after a few additional hours we figured out what the !@#$ does it do that 500 particles that were stared in a nice disk begin following some very strange orbits essentially going up and falling onto the central star, then scattering at high speed to infinity.

ok, to make the long story short we corrected the initial conditions and the equations of motion to the point where some of us (me) thought they now represented the correct, Binney-Tremaine-like equations, only written using the reversed phi <--> theta convention of Josh, and some of us (Jeffrey) were complaining about the signs in front of the sin(phi_Josh). The code was still doing its weird thing, unless we forced it to simulate a 2-D, flat disk, where it worked uncomfortably slowly, but apprently ok.

In desperation, we allowed Jeffrey to do the nonsense (-:) change of signs and... everything became like Newton intended it: a stable 3-d disk. It turned out Josh's phi and theta were not the spherical-coordinates phi & theta, his phi was actually theta+pi/2 not pi/2-theta! so given his (non-standard) definition of meridional axis pointing downward not upward, our equations had two out of a dozen signs in front of acceleration terms wrong --> non-conservation of angular momentum --> trouble on timescale of a few orbits.

We put the corrected code on cudak3 here:
cudak3[9]:~/cuda/projects/tau$

cudak3[10]:~/cuda/projects/tau$ ls

3Ddisplay.h CUDA_pSim.sln movPos.h
a.out CUDA_pSim_vc90.sln movPos.h~
CUDA_pSim.cu CUDA_pSim_vc90.suo particle.h
CUDA_pSim.cu~ CUDA_pSim_vc90.vcproj particle.h~
CUDA_pSim_gold.cpp CUDA_pSim_vc90.vcproj.Josh-PC.Josh.user vc90.pdb
CUDA_pSim_kernel.cu CUDA_pSim.vcproj
CUDA_pSim_kernel.cu~ forces.h

cudak3[11]:~/cuda/projects/tau$
cudak3[11]:~/cuda/projects/tau$ a.out

size of each particle is 24 bytes

Number of particles: 1000
Mass ratio (0 <>
minimum radius from solar mass (0.0 <>
Time step: .01
Npart=1000 u=1.000e-02 R_min=5.000e-01 dt=1.000e-02 h=2.500e-04 M=1.000e+05 G=1.000e-05
creating particle array with 1000 particles...
setting initial conditions for particle array...
creating optical thickness array with 4 slices on device...
optical thickness array created...
freeglut (a.out): Unable to create direct context rendering for window 'Particle Simulation'
This may hurt performance.
cudak3[12]:~/cuda/projects/tau$
* * *
Interestingly, we did get openGL output on our remote machine (cudak2) from cudak3. It was very slow but it worked. Apparently, if you transfer data back to host from gpu device and then plot it with openGL, which is not the fastest way to do it, and you have the ssh -X ... connection to a cuda-capable machine, you can get you graphics served over internet. that's nice. All SDK demos fail to do it since they keep the frame buffer for plotting on the gpu runnig the calculation, and openGL extensions to somehow replicate the data on a remote client are not available, so demos print error messages and quit.
* * *
Next step: why is Josh's code running only ~500 particles (anything more and it refused to move the particles, the dsplay refreshed but the particles don't evolve in time)?
{Oh! I think I know. just as I was writing this I think I realized.. of course the max number must be 512, and 512 is the limit of threds per block
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum sizes of each dimension of a block: 512 x 512 x 64
this is of course the output from cudak2[6]:~...C/bin/linux/release$ deviceQuery on the gtx280 cards }

Saturday, June 26, 2010

[From Robert] Installing the software Part 1

Installing the software proved to be just as difficult, if not more, to set up than the hardware. The specifications on Nvidia's website state that the CUDA software will run on (one of many flavours of linux) Fedora 10 .
Because of this, we decided to work with version 10 64-bit (even though the latest build was 13 at the time). However, due to troubles in recognizing our new video cards (GTX 480), we were not able to load the Graphical User Interface (GUI) properly. This then prompted us to install Fedora 11 64-bit, but we ran into the main issue of the NVIDIA drivers not installing on our system due to incompatible kernel. With so much frustration, we decided to take a chance with Fedora 12 64-bit. Sure enough, everything started to work out, and installing the NVIDIA and CUDA software were on their way.
---
First off, installing the NVIDIA drivers, we decided to use Linux x86_64 Display Driver Version 256.25 Beta.
Because the NVIDIA drivers require a non-X11 interface, the computer needs to be exited out from it. To do so, a Terminal window was opened up, and logged in as root
# su
Then, the OS initialization file called 'inittab' was edited
# cd /etc/
Here, the last line of the file was edited from
id:5:initdefault:
to
id:3:initdefault:
The file was saved and the computer was rebooted. From here the computer then loaded in the OS but loaded a text based log-in screen, due to the changes made in the file. Here, the OS was then logged in as root, and then moved to the location of the NVIDIA driver to run the installation file. In our case:
# cd /Downloads/
# sh NVIDIA-Linux-x86_64-256.25.run
After installing, the inittab file needed to be reverted back to 'id:5:initdefault:' so that on boot, it would load the GUI. Once saved, the computer was restarted.
--
After installing the NVIDIA drivers, the CUDA compiler needed to be installed. In our case:
# cd /Downloads/
# sudo sh cudatoolkit_3.0_linux_64_fedora10.run.sh
This was installed in the default path (/usr/local/cuda)
To make the compiler command (nvcc) work, the .bash_profile and the .bashrc (located at ~/) needed to be edited. Both PATH and LD_LIBRARY_PATH needs to be pointed to the CUDA library:
#
PATH=/usr/local/cuda/bin:$PATH
LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH #if 32 bit machine is used, use lib:$
export PATH LD_LIBRARY_PATH
#
For the .bashrc file, the following needed to be added:
#
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
#

The system then needed to be restarted for the changes to take place.
--
The only thing left now was installing the SDK, which includes both sample and benchmark programs and diagnostics for the hardware. In our case:
# cd /Downloads/
# sh gpucomputingsdk_3.0_linux.run.sh
This was then installed in the default path (~/NVIDIA_GPU_Computing_SDK/), although it could have been installed in non-user path as well.

Once this was installed, a MAKE file had to be run on the SDK at ~/NVIDIA_GPU_Computing_SDK/C/
In our machine, our make file did not execute properly, complaining about " cannot find -lglut "
To fix this, we needed to install the glut packages:
# sudo yum install freeglut
and
# sudo yum install freeglut-devel

Once these were installed, the MAKE file was executed again, and it passed, installing the SDK examples to ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/
From there, a numerous executables are available, mainly deviceQuery and bandwidthTest.

# ./bandwidthTest --device=all
[bandwidthTest]
./bandwidthTest Starting...

!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!

Running on...

Device 0: GeForce GTX 480
Device 1: GeForce GTX 480
Device 2: GeForce GTX 480
Quick Mode

Host to Device Bandwidth, 3 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5155.2

Device to Host Bandwidth, 3 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4267.7

Device to Device Bandwidth, 3 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 120670.9

[bandwidthTest] - Test results:
PASSED

# ./deviceQuery --all
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 3 devices supporting CUDA

Device 0: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1609760768 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 1: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1610285056 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 2: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1610285056 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4246847, CUDA Runtime Version = 3.00, NumDevs = 3, Device = GeForce GTX 480, Device = GeForce GTX 480

PASSED

At this point, the NVIDIA CUDA was installed and working.
To be continued...

[from Pawel] success with drivers, methinks

well, it seems that some of our previous problems stemmed from our not RTFM; we had mismatched versions on three things in the nvidia stack, all of which need to always be updated together:
in /home/student/Downloads/ on both cudak4 and cudak5 we now have the newest driver (newer than was available from nvidia download page a few days ago when I reported last)
NVIDIA-Linux-x86_64-256.35.run [version 256.35]
we also have a toolkit cudatoolkit_3.1_linux_64_fedora12.run, and finally
the SDK in gpucomputingsdk_3.1_linux.run
all these things except the last need to be installed by a superuser like
#sh (file name with .run extension)

I think you should set up a directory, either NVIDIA_GPU_Computing_SDK/
or like I do, cuda31 (to have cuda toolkit/SDK version visible) **in your own**
home directory. Then you can modify the sdk programs in the, say,
/home/your-name/cuda31/C/src/ directory at will, and use their makefiles which
put the demos or your own program executables in the /home/your-name/cuda31/C/bin/linux/release/ directory

* * *
It seems that all the sdk demos are working properly on all cards, on both
cudak4 and cudak5 now!

[I can see temperatures in the nvidia-settings utility on k4 now, but not yet on k5,
I'll try rebooting. in any case, I have stress-tested k4, and it works ok, also doesn't get stuck in high-temp state].

I'm curious about the CPU speed. the Fedora 12 utility shows that it's remaining in the slow/cool state 1.6(?)GHz, even when running multiple cuda examples. I'd think it should switch to the max ~2.86 GHz. does that have to do with any benchmark results?

[from Josh]

I have sent an email with my code in it. It is written in C/C++ and CUDA C. It is almost working. What it really needs is someone who knows more about CUDA to take a look at it. Pawel knows more than any of us on the subject I'm sure. Anyway I wont be able to work on it at my camp so I'm leaving it to you. Theres 3 days and I've completed 99% of the code. All that really needs to be done is adjust the way some things are written to avoid some memory bugs that wont allow it to run properly (I get through about 5 time steps then there's an error). I will post what the problems are and the layout of my code to make things easier for you.

1. in the particle.h define all the constants and structs (CUDA C only deals with structs not classes).

2. in 3ddisplay.h is the visualization which i have debugged completely.

3. in movPos.h is the calling of the kernels and memcpy from device to host.

4. in CUDA_pSim.cu is the main function and setup resides in runSim function.

5. in CUDA_pSim_kernel.cu are the kernels

Things that need to be checked over: 3,4,5

I think the problem may stem from how I have defined the tau_set. I compressed a 3D array into a 1D array. since the array was of dimensions theta_s x phi_s x r_s the 1D array goes from [0..theta_s * phi_s * r_s-1] where each vertical slice of the shell starts at the bottom at (theta=0,phi=-pi/2,r=0). The easiest thing to do here would be for someone who understands CUDA better to implement a 3D array and simplify things greatly.

There is need in the second kernel to place an individual particle in the corresponding tau cell and I had calculated the index to be (int(theta/dtheta)%theta_s)*phi_s*r_s + (int(phi/dphi)%phi_s)*r_s + (int(r/dr_%r_s). based on theta,phi,r of particle, and tau cell spacing dtheta,dphi,dr and total number of pieces in each demension theta_s,phi_s,r_s.

However this did not work since obviously a negative number is achievable. This caused an error. Also it would be good if someone could think of a cheap way of reducing theta and phi into a lower angle once it exceeds 2pi or pi/2 respectively.

If you implement a 3d array then you must modify the kernels so that every loop in movPos calls the first kernel which zeros all tau grid. Then the second kernel must be called (after __syncthreads()) which counts the number of particles in each grid and saves it in the respective tau cell of your array. Then the final kernel may be called again after synchronizing which integrates the particles by calculating tau and using the beta equation.

Everything is well commented and if worst comes to worst and you just want to dispaly CUDA then you can try to remove all aspects of tau and simply show an N-body simulation. That would mean that the only array is a 1D particle array. And I already know that my integration code works from my previous version.

Also to compile my code you must specify in linux "-lm -lglut -lglu -lcudart -lcutil" You can most likely just use a make file from another project that has openGL enabled.

I'm very sorry I have wont be able to help out the next few days. I wanted to be at the presentation very much and had planned on it until the date changed (after I had already made plans with my family).

I wish you all good luck. I will check my email in the next few hours to before I leave to see if any of you respond.

I will also send contact info so that you can reach me and ask questions about my code if you choose to try and fix it. I think only someone good and CUDA can fix the issue however I think you all should try.

Josh,

Friday, June 25, 2010

[from Pawel] new driver/SDK on cudak5

we have a 195.36.31 driver (stable, new) on cudak5.
the previous one, 256.25 was a beta version that we needed to compile anything
using the gcc-3.4 compiler, but gradually showed some serious problems,
like not seeing chosen devices and putting them in high-gear mode (high T) permanently.

Josh installed the appropriate toolkit with SDK and it now works ~ok!!
I've tested every SDK demo on all cards and the result is that we no longer have
any problems with multi-GPU!

the problems you'll encouner running nvidia-settings utility & examples in
/home/student/NVIDIA_GPU_Computing_SDK/C/bin/linux/release

are restricted to following programs
====================================================

nvidia-setting does not show:
Thermal Settings
PowerMizer
DFP-1(20 - GeForce GTX 480)

for devices 1 and 2.
to see temperatures you must say
% nvidia-smi -q -a
which prints correct temperatures. I've been able to generate T=90-94C while running
3 examples of nbody test on 3 cards. it's normal.
=======================================================================================
printout from ./deviceQuery
(..)
Concurrent copy and execution: Yes
Run time limit on kernels: No
(...)

and yet...

nt@cudak5 release]$
[student@cudak5 release]$ ./concurrentKernels
[concurrentKernels] - Starting...

CUDA Device GeForce GTX 480 has 15 Multi-Processors
CUDA Device GeForce GTX 480 is NOT capable of concurrent kernel execution
concurrentKernels.cu(111) : cudaSafeCall() Runtime API error : unspecified launch failure.
[student@cudak5 release]$
======================================================================================

[student@cudak5 release]$ ./fluidsGL
[fluidsGL] - [OpenGL/CUDA simulation]
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL_kernels.cu(44) : cutilCheckMsg() CUTIL CUDA error : cudaMemcpy failed : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$

[student@cudak5 release]$
[student@cudak5 release]$ ./fluidsGL --device=2
[fluidsGL] - [OpenGL/CUDA simulation]
Using device 2: GeForce GTX 480
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL.cu(183) : cudaSafeCall() Runtime API error : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$
[student@cudak5 release]$
[student@cudak5 release]$
[student@cudak5 release]$ ./fluidsGL --device=1
[fluidsGL] - [OpenGL/CUDA simulation]
Using device 1: GeForce GTX 480
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL.cu(183) : cudaSafeCall() Runtime API error : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$
[student@cudak5 release]$

====================================================================

Those who are close to producing working CUDA programs please do so on cudak5 in student account. Don;t worry (by necessity) about infiniband.
While having no internode comm looks moderately bad, not having any CUDA will definitely look bad during our 29th Jun meeting.

We have very little time.. full time engagement is expected now to the end of month.

Thursday, June 24, 2010

[from Josh] CUDAizing progress report

I would say I'm about half way done CUDA conversion. Anyway this is my scheme for now since it seems the simplest with the given time we have. I have a single copy of tau grid on device and a copy of particle array on both host and device. Once i initialize data on host I send it to the device and call a single kernel. The kernel executes force command and integrates a single time step. I only retrieve data from the device at intervals of modulus N. In between kernel calls I have to sync the threads so that the particle array remain synchronous. The only type of memory I am currently using is global memory for the two arrays, all other data is passed as params. After I get this first version working I would like to make it work on multiple device (which require memory sharing and work load splitting) have groups of particles in the same tau shell use shared page locked memory to share the tau slice properties of that slice and do particles in order by slice instead of chronological. This would have two benefits: page locked memory is very fast (it is the memory that thread blocks can see and use fastest. It is also called pinned memory and it is very limited in size so only small usages are applicable), and doing particle by slice would cause tau not to be retrieved multiple times (wasted time). Anyway I think I'll be able to make it to the meeting tomorrow however I may spend the time finishing my code since I'm leaving tomorrow for my cottage and need to finish the code before the 29th. I'll see if I can get my code running on c4 or 5 before I leave on a single device.