We learn how to build nVidia GPU-based (super)computers, network and program them for HPC (High Performance Computing) using CUDA C language. We apply all this to do some astrophysical simulations related to forming planetary systems.
Sunday, August 1, 2010
[From Jeffrey] cuda-hydro speedup
I improved the CUDA code in nearly every way I know, so now is the moment of truth. How much faster is it?
16 times.
This is on k3, comparing the speed of my c code to my CUDA c code under the exact same conditions.
This is slightly slower than I have hoped. Probably I'm still doing something wrong.
Another disappointing thing is the maximum resolution I can get is still just 732 by 732. If I push it to 1k by 1k, I get a segmentation fault that indicates memory shortage, which is confusing since the amount of memory the program needs should not be that much.
The next thing I plan to do is to run it on k4/k5. Sounds like a simple thing to do, but unfortunately k4 & k5 don't have mathgl installed yet. Guess I need to do that first.
Friday, July 30, 2010
[from Anthony] progress cont...
tau_1[j+1] = tau_1[j]+(1/(c[j]*c[j]))*v_tau[j]*eta[j]*dr;
I[j+1]= I[j]+f[j]*(((log(r[j+1]*sigma[j+1]*f[j+1]/D[j+1])-log(r[j]*sigma[j]*f[j]/D[j]))/dr)+
((2*m*W[j])/(r[j]*WW[j])))*tau_1[j];
a[j]= - B[j]*v[j] - C[j]*eta[j]- I[j];
v[j+1] = v[j] + a[j]*dr;
eta[j+1] = eta[j] + v[j+1]*dr;
Thursday, July 29, 2010
[From Jeffrey] cuda-hydro
The resolution for this simulation is 732x732 ( in case you wonder why such strange numbers: 732=3*(256-12) ). Approximately 10000 time steps were taken and it took 2 hours on k3. Total duration in simulation time is approximately 2 crossing time.
This is definitely not the fastest it can get. Computations are currently only done on one block that has a size of 256, so the simulation box is 3x3=9 times too large for the block. I should be able to utilize 9 blocks in parallel to do this simulation, which would in principle give me another factor of 9 in speed. The only reason why this is not done yet is that the memory sharing between blocks is not trivial and I didn't want to involve too much complication before I get it to work. Now that I understand cuda a little more, I already have a plan for the improvement and it shouldn't be too difficult.
Tuesday, July 20, 2010
[from Anthony] Progress report on modal analysis
Thursday, July 1, 2010
[from James] current status of dust-disk sim
For some reason, the program which ran just fine on cudak2 and cudak3 Sunday night wouldn't run on cudak4 or cudak5; the first two cudaMemCpy calls (from host to device) worked fine, but all subsequent cudaMemCpy calls, regardless of direction, threw a "wrong direction" error. Equally strange, commenting out everything tau-related made it work again.
The "taunot" folder contains the working simulation in which everything relating to the tau grid has been commented out (which is why, if you re-compile it, nvcc complains several times about nested comments). Also in this version, just for the fun of it we included a force term in the theta'' calculation to simulate a non-rotating bar. As mentioned previously, you can also ssh into cudak4 and run this sim remotely (although when I tried it from home the display was hideously slow, presumably because I have a slow internet connection and a 5-year-old computer).
The "tau" folder contains the non-working sim-in-progress in which the tau-grid is being re-implemented. After much discussion and debate (even some argument) on Tuesday we decided to re-implement tau using the following scheme:
- use two grids: an integer array "dtau_set" storing the number of particles in each grid cell; and a float array "tau_set" storing the value of tau in each cell (this will of course take up more memory but for now it seems that speed is more important)
- when the disk is initially generated, find out which cell each particle belongs to and increment that cell in the integer dtau_set
- start the simulation loop
- kernel 1: send one thread along each "ray" (solid angle) from r_min to r_max, calculating tau for each cell in the float array tau_set based on the data stored in the dtau_set integer array
- kernel 2: integration, one thread per particle: first figure out which cell the particle is in, retrieve the appropriate value of tau, and calculate beta; then calculate the acclerations and update the new positions and velocities; now re-calculate the particle's cell location, and if it has changed cells, decrement the old dtau_set location and increment the new dtau_set location
That's the plan, still in progress. Feel free to work on the code. Clearly this algorithm still isn't super-efficient: there are several calls to global memory, and some values get re-calculated more than once even though they haven't changed (most notably the particle's cell index). However we have managed to limit ourselves to two kernels, and each thread is independent with no need to invoke syncthreads(). The presentation date has come and gone (with no presentation :( ) but for my part I'd still like to see this thing up and running.
Tuesday, June 29, 2010
[from Pawel] Last minute results
(i) making infiniband work concurrently with ethernet, and installing all openMPI
tests - a lot of success with that but tests aren't finished yet. a lot of software setup and configuration needs to be done. it look like we'll get our 20Gbps!
(ii) simulation of particles.
we decided to skip tau-effects and concentrate on benchmarking our naive, barely optimized code. after conquerring the problem of 512 particles-only
in the old code, we can now run up to 30 million particles. around 10pm we even started doing some "galaxy" simulations by adding external forces. we assumed a non-rotating, weakly barred force.
we benchmarked the compute part of the code at roughly 100 GFLOP/s.
this is approximately what we'd expect from a global memory bandwidth-limited problem:
the card can stream 120 GB/s from DDR5 to GPU, that is 30 GFLoats/s.
If 3.3 FLOP are done on every float from RAM, then we are comp/bandwidth balanced. Our leapfrog integration is probably limited by bandwidth, although the GPU heats up to 93C at times, and the fan becomes a hair-dryer.
Sunday, June 27, 2010
[from Pawel] Josh's code
the wrong forcast must be due to the lack of CUDA.
Those of us who had looked into the code told us about how impossible it is to compile it on their computers. Anthony had to leave and the 3 remaining Magnificent retreated upstairs and started up cudak2 (yes, a clandestine copy of cudak1 does exist).
After an hour or two we figured out how to compile Josh's code:
cudak2[152]:~...C/src/tau/CUDA_pSim$ nvcc CUDA_pSim.cu -I/usr/local/cuda/sdk2.3/C/common/inc/ -L/usr/local/cuda/sdk2.3/C/lib/ -lm -lglut -lcudart -lcutil
and after a few additional hours we figured out what the !@#$ does it do that 500 particles that were stared in a nice disk begin following some very strange orbits essentially going up and falling onto the central star, then scattering at high speed to infinity.
ok, to make the long story short we corrected the initial conditions and the equations of motion to the point where some of us (me) thought they now represented the correct, Binney-Tremaine-like equations, only written using the reversed phi <--> theta convention of Josh, and some of us (Jeffrey) were complaining about the signs in front of the sin(phi_Josh). The code was still doing its weird thing, unless we forced it to simulate a 2-D, flat disk, where it worked uncomfortably slowly, but apprently ok.
In desperation, we allowed Jeffrey to do the nonsense (-:) change of signs and... everything became like Newton intended it: a stable 3-d disk. It turned out Josh's phi and theta were not the spherical-coordinates phi & theta, his phi was actually theta+pi/2 not pi/2-theta! so given his (non-standard) definition of meridional axis pointing downward not upward, our equations had two out of a dozen signs in front of acceleration terms wrong --> non-conservation of angular momentum --> trouble on timescale of a few orbits.
We put the corrected code on cudak3 here:
cudak3[9]:~/cuda/projects/tau$
cudak3[10]:~/cuda/projects/tau$ ls
3Ddisplay.h CUDA_pSim.sln movPos.h
a.out CUDA_pSim_vc90.sln movPos.h~
CUDA_pSim.cu CUDA_pSim_vc90.suo particle.h
CUDA_pSim.cu~ CUDA_pSim_vc90.vcproj particle.h~
CUDA_pSim_gold.cpp CUDA_pSim_vc90.vcproj.Josh-PC.Josh.user vc90.pdb
CUDA_pSim_kernel.cu CUDA_pSim.vcproj
CUDA_pSim_kernel.cu~ forces.h
cudak3[11]:~/cuda/projects/tau$
cudak3[11]:~/cuda/projects/tau$ a.out
size of each particle is 24 bytes
Number of particles: 1000
Mass ratio (0 <>
minimum radius from solar mass (0.0 <>
Time step: .01
Npart=1000 u=1.000e-02 R_min=5.000e-01 dt=1.000e-02 h=2.500e-04 M=1.000e+05 G=1.000e-05
creating particle array with 1000 particles...
setting initial conditions for particle array...
creating optical thickness array with 4 slices on device...
optical thickness array created...
freeglut (a.out): Unable to create direct context rendering for window 'Particle Simulation'
This may hurt performance.
cudak3[12]:~/cuda/projects/tau$
* * *
Interestingly, we did get openGL output on our remote machine (cudak2) from cudak3. It was very slow but it worked. Apparently, if you transfer data back to host from gpu device and then plot it with openGL, which is not the fastest way to do it, and you have the ssh -X ... connection to a cuda-capable machine, you can get you graphics served over internet. that's nice. All SDK demos fail to do it since they keep the frame buffer for plotting on the gpu runnig the calculation, and openGL extensions to somehow replicate the data on a remote client are not available, so demos print error messages and quit.
* * *
Next step: why is Josh's code running only ~500 particles (anything more and it refused to move the particles, the dsplay refreshed but the particles don't evolve in time)?
{Oh! I think I know. just as I was writing this I think I realized.. of course the max number must be 512, and 512 is the limit of threds per block
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum sizes of each dimension of a block: 512 x 512 x 64
this is of course the output from cudak2[6]:~...C/bin/linux/release$ deviceQuery on the gtx280 cards }
Saturday, June 26, 2010
[From Robert] Installing the software Part 1
Because of this, we decided to work with version 10 64-bit (even though the latest build was 13 at the time). However, due to troubles in recognizing our new video cards (GTX 480), we were not able to load the Graphical User Interface (GUI) properly. This then prompted us to install Fedora 11 64-bit, but we ran into the main issue of the NVIDIA drivers not installing on our system due to incompatible kernel. With so much frustration, we decided to take a chance with Fedora 12 64-bit. Sure enough, everything started to work out, and installing the NVIDIA and CUDA software were on their way.
---
First off, installing the NVIDIA drivers, we decided to use Linux x86_64 Display Driver Version 256.25 Beta.
Because the NVIDIA drivers require a non-X11 interface, the computer needs to be exited out from it. To do so, a Terminal window was opened up, and logged in as root
# su
Then, the OS initialization file called 'inittab' was edited
# cd /etc/
Here, the last line of the file was edited from
id:5:initdefault:
to
id:3:initdefault:
The file was saved and the computer was rebooted. From here the computer then loaded in the OS but loaded a text based log-in screen, due to the changes made in the file. Here, the OS was then logged in as root, and then moved to the location of the NVIDIA driver to run the installation file. In our case:
# cd /Downloads/
# sh NVIDIA-Linux-x86_64-256.25.run
After installing, the inittab file needed to be reverted back to 'id:5:initdefault:' so that on boot, it would load the GUI. Once saved, the computer was restarted.
--
After installing the NVIDIA drivers, the CUDA compiler needed to be installed. In our case:
# cd /Downloads/
# sudo sh cudatoolkit_3.0_linux_64_fedora10.run.sh
This was installed in the default path (/usr/local/cuda)
To make the compiler command (nvcc) work, the .bash_profile and the .bashrc (located at ~/) needed to be edited. Both PATH and LD_LIBRARY_PATH needs to be pointed to the CUDA library:
#
PATH=/usr/local/cuda/bin:$PATH
LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH #if 32 bit machine is used, use lib:$
export PATH LD_LIBRARY_PATH
#
For the .bashrc file, the following needed to be added:
#
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
#
The system then needed to be restarted for the changes to take place.
--
The only thing left now was installing the SDK, which includes both sample and benchmark programs and diagnostics for the hardware. In our case:
# cd /Downloads/
# sh gpucomputingsdk_3.0_linux.run.sh
This was then installed in the default path (~/NVIDIA_GPU_Computing_SDK/), although it could have been installed in non-user path as well.
Once this was installed, a MAKE file had to be run on the SDK at ~/NVIDIA_GPU_Computing_SDK/C/
In our machine, our make file did not execute properly, complaining about " cannot find -lglut "
To fix this, we needed to install the glut packages:
# sudo yum install freeglut
and
# sudo yum install freeglut-devel
Once these were installed, the MAKE file was executed again, and it passed, installing the SDK examples to ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/
From there, a numerous executables are available, mainly deviceQuery and bandwidthTest.
# ./bandwidthTest --device=all
[bandwidthTest]
./bandwidthTest Starting...
!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!
Running on...
Device 0: GeForce GTX 480
Device 1: GeForce GTX 480
Device 2: GeForce GTX 480
Quick Mode
Host to Device Bandwidth, 3 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5155.2
Device to Host Bandwidth, 3 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4267.7
Device to Device Bandwidth, 3 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 120670.9
[bandwidthTest] - Test results:
PASSED
# ./deviceQuery --all
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 3 devices supporting CUDA
Device 0: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1609760768 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Device 1: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1610285056 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Device 2: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1610285056 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4246847, CUDA Runtime Version = 3.00, NumDevs = 3, Device = GeForce GTX 480, Device = GeForce GTX 480
PASSED
At this point, the NVIDIA CUDA was installed and working.
To be continued...
[from Pawel] success with drivers, methinks
in /home/student/Downloads/ on both cudak4 and cudak5 we now have the newest driver (newer than was available from nvidia download page a few days ago when I reported last)
NVIDIA-Linux-x86_64-256.35.run [version 256.35]
we also have a toolkit cudatoolkit_3.1_linux_64_fedora12.run, and finally
the SDK in gpucomputingsdk_3.1_linux.run
all these things except the last need to be installed by a superuser like
#sh (file name with .run extension)
I think you should set up a directory, either NVIDIA_GPU_Computing_SDK/
or like I do, cuda31 (to have cuda toolkit/SDK version visible) **in your own**
home directory. Then you can modify the sdk programs in the, say,
/home/your-name/cuda31/C/src/ directory at will, and use their makefiles which
put the demos or your own program executables in the /home/your-name/cuda31/C/bin/linux/release/ directory
* * *
It seems that all the sdk demos are working properly on all cards, on both
cudak4 and cudak5 now!
[I can see temperatures in the nvidia-settings utility on k4 now, but not yet on k5,
I'll try rebooting. in any case, I have stress-tested k4, and it works ok, also doesn't get stuck in high-temp state].
I'm curious about the CPU speed. the Fedora 12 utility shows that it's remaining in the slow/cool state 1.6(?)GHz, even when running multiple cuda examples. I'd think it should switch to the max ~2.86 GHz. does that have to do with any benchmark results?
[from Josh]
Friday, June 25, 2010
[from Pawel] new driver/SDK on cudak5
the previous one, 256.25 was a beta version that we needed to compile anything
using the gcc-3.4 compiler, but gradually showed some serious problems,
like not seeing chosen devices and putting them in high-gear mode (high T) permanently.
Josh installed the appropriate toolkit with SDK and it now works ~ok!!
I've tested every SDK demo on all cards and the result is that we no longer have
any problems with multi-GPU!
the problems you'll encouner running nvidia-settings utility & examples in
/home/student/NVIDIA_GPU_Computing_SDK/C/bin/linux/release
are restricted to following programs
====================================================
nvidia-setting does not show:
Thermal Settings
PowerMizer
DFP-1(20 - GeForce GTX 480)
for devices 1 and 2.
to see temperatures you must say
% nvidia-smi -q -a
which prints correct temperatures. I've been able to generate T=90-94C while running
3 examples of nbody test on 3 cards. it's normal.
=======================================================================================
printout from ./deviceQuery
(..)
Concurrent copy and execution: Yes
Run time limit on kernels: No
(...)
and yet...
nt@cudak5 release]$
[student@cudak5 release]$ ./concurrentKernels
[concurrentKernels] - Starting...
CUDA Device GeForce GTX 480 has 15 Multi-Processors
CUDA Device GeForce GTX 480 is NOT capable of concurrent kernel execution
concurrentKernels.cu(111) : cudaSafeCall() Runtime API error : unspecified launch failure.
[student@cudak5 release]$
======================================================================================
[student@cudak5 release]$ ./fluidsGL
[fluidsGL] - [OpenGL/CUDA simulation]
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL_kernels.cu(44) : cutilCheckMsg() CUTIL CUDA error : cudaMemcpy failed : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$
[student@cudak5 release]$
[student@cudak5 release]$ ./fluidsGL --device=2
[fluidsGL] - [OpenGL/CUDA simulation]
Using device 2: GeForce GTX 480
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL.cu(183) : cudaSafeCall() Runtime API error : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$
[student@cudak5 release]$
[student@cudak5 release]$
[student@cudak5 release]$ ./fluidsGL --device=1
[fluidsGL] - [OpenGL/CUDA simulation]
Using device 1: GeForce GTX 480
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL.cu(183) : cudaSafeCall() Runtime API error : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$
[student@cudak5 release]$
====================================================================
Those who are close to producing working CUDA programs please do so on cudak5 in student account. Don;t worry (by necessity) about infiniband.
While having no internode comm looks moderately bad, not having any CUDA will definitely look bad during our 29th Jun meeting.
We have very little time.. full time engagement is expected now to the end of month.
Thursday, June 24, 2010
[from Josh] CUDAizing progress report
Wednesday, June 23, 2010
[from Anthony] Brief Update
Tuesday, June 22, 2010
[from James]
leapfrog version 1:
R = r + 0.5*h*v # drift for 1/2 step
T = theta + 0.5*h*omega # drift for 1/2 step
dv = R*omega*omega - GM/(R*R) # acceleration in r
dw = -2.0*omega*v/R # acceleration in theta
v = v + h*dv # kick for full step
omega = omega + h*dw # kick for full step
r = R + 0.5*h*v # drift for remaining 1/2 step
theta = T + 0.5*h*omega # drift for remaining 1/2 step
dE = h*(GM*v/(r*r) + v*dv + r*v*omega*omega + r*r*omega*dw)
error = log(abs(dE/E),10)
Version 2 (this one does an initial 1/2 kick before the loop starts):
# incremental acceleration
dv = L*L/(r*r*r) - GM/(r*r)
dw = -2.0*omega*v/r
# drift for full step
r = r + h*v
theta = theta + h*omega
# Kick for full step - note that because of the initial 1/2 kick, this
# will put the velocities 1/2 step ahead of the positions
v = v + h*dv
omega = omega + h*dw
# compute total energy, as a check:
E2 = 0.5*(v*v + r*r*omega*omega) - GM/r
error = log(abs((E-E2)/E),10)
Version 3 (this one modifies the equations to allow full time steps over
the same time interval, but is mathematically equivalent to the other
versions):
dv1 = L*L/(r*r*r) - GM/(r*r) # incremental acceleration in r at t
dw1 = -2.0*omega*v/r
r = r + h*v + 0.5*h*h*dv1
theta = theta + h*omega + 0.5*h*h*dw1
dv2 = L*L/(r*r*r) - GM/(r*r) # incremental acceleration in r at t+dt
dw2 = -2.0*omega*v/r
v = v + 0.5*h*(dv1 + dv2)
omega = omega + 0.5*h*(dw1 + dw2)
E2 = 0.5*(v*v + r*r*omega*omega) - GM/r
error=log(abs((E-E2)/E),10)
You probably noticed that I calculated the error in total energy differently in the first method compared to the second two; the first one calculates dE and then the error as dE/E; the second way is to calculate E2 based on the updated positions and velocities and then the error as (E-E2)/E. Both methods should yield similar errors.
[from Pawel] Mail problem
The St George campus will be closed from Thu so we'll meet on Fri at UTSC,
and on Sat. at my place. We'll of course spend some time with some of you who haven't been to Buttonville, within 50 km radius from there >->--
The meeting is 29 June, 2pm in AA bldg rm 401 or something like that (Vce Principal Research room on 4th floor). Those who can attend, say so!
There are many things which don;t work right now.
cudak4/5 don't have:
1. fixed IP addresses,
2. a working set of driver + SDK which sees 3 fermi cards properly (I installed driver 195.36.31 which was released days ago, but that doesn't work with the more experimental 256.25 (or some such)
that we had before. I didn;t have time to solve that thing and see if all cards are usable with the new recommended driver.
3. device 2 on cudak5 overheated, was 78C at idle, instead of normal 56C.
The reason was that it was in the high=gear state all the time (max clock speed on gpu and mem).
After driver installation, toolkit failed to install so I don;t yet know if that problem went away.
nvidia-settings showed that 1 of of 3 devices (0,1,2) has a different version of
VBIOS. however, it was dev 1 not 2, so it's false hint.
4. even though most SDK demos ran correctly on dev 0 and 1, there was a disturbingly long startup time, as opposed to instantaneous load on old cuda setups. dev 2 wass also hanging for a looong time before generating the Cuda Mem Alloc Error or Unspecified Launch Error.
this is some kind of software/driver problem. Robert swapped 2 cards and
dev 2 was always having problems (confusingly, dev 1 has problems on cudak4).
Any help in solving these fascinating problems will be appreciated.
Maybe I'll go to UTSC today..
keep up the good work
Pawel
Monday, June 21, 2010
[from Jeffrey] hydrocode
This is a shock tube example done in 2d. The resolution is 500 by 500, and the Courant number is 0.5.
Now onto cuda-izing.
[from Josh] progress update
Wednesday, June 16, 2010
[from Josh] r.. (..=double derivative) in r,theta,phi coord system
Tuesday, June 15, 2010
[from Josh] r.. theta.. and phi.. (..=double derivative) in x,y,z coord system
When using spherical coords (r,theta,phi)
where r is the radius from center,
theta is the angle from the x axis (0..2pi),
and phi is the angle from the x-y plane (-pi/2..pi/2):
x = r*cos(theta)*cos(phi)
y = r*sin(theta)*cos(phi)
z = r*sin(phi)
In reverse:
r = sqrt(x^2+y^2+z^2)
theta = arctan(y/x)
phi = arctan(z/sqrt(x^2+y^2))
The following time derivative were derived using the above three equations for
r, theta, and phi. I tried to make it as legible as possible
Let me know if you would like a neater copy and I'll post a picture:
The first time derivative of r(t) is:
/ d \ / d \ / d \
x(t) |--- x(t)| + y(t) |--- y(t)| + z(t) |--- z(t)|
\ dt / \ dt / \ dt /
---------------------------------------------------
(1/2)
/ 2 2 2\
\x(t) + y(t) + z(t) /
The second time derivative of r(t) is:
/ 2
1 |/ d \ / d / d \\
------------------------ ||--- x(t)| + x(t) |--- |--- x(t)||
(1/2) \\ dt / \ dt \ dt //
/ 2 2 2\
\x(t) + y(t) + z(t) /
2 2
/ d \ / d / d \\ / d \ / d / d \\
+ |--- y(t)| + y(t) |--- |--- y(t)|| + |--- z(t)| + z(t) |--- |--- z(t)||
\ dt / \ dt \ dt // \ dt / \ dt \ dt //
2
/ / d \ / d \ / d \\
\ |x(t) |--- x(t)| + y(t) |--- y(t)| + z(t) |--- z(t)||
| \ \ dt / \ dt / \ dt //
| - ------------------------------------------------------
/ (3/2)
/ 2 2 2\
\x(t) + y(t) + z(t) /
The first time derivative of theta is:
/ d \ / d \
|--- y(t)| x(t) - y(t) |--- x(t)|
\ dt / \ dt /
---------------------------------
2 2
x(t) + y(t)
The second time derivative of theta is:
/ d / d \\ / d / d \\
|--- |--- y(t)|| x(t) - y(t) |--- |--- x(t)||
\ dt \ dt // \ dt \ dt //
---------------------------------------------
2 2
x(t) + y(t)
-
// d \ / d \\ / / d \ / d \\
||--- y(t)| x(t) - y(t) |--- x(t)|| |2 x(t) |--- x(t)| + 2 y(t) |--- y(t)||
\\ dt / \ dt // \ \ dt / \ dt //
---------------------------------------------------------------------------
2
/ 2 2\
\x(t) + y(t) /
The first time derivative of phi is:
/ d \ / 2 2\ / d \ / d \
|--- z(t)| \x(t) + y(t) / - z(t) x(t) |--- x(t)| - z(t) y(t) |--- y(t)|
\ dt / \ dt / \ dt /
------------------------------------------------------------------------
(1/2)
/ 2 2\ / 2 2 2\
\x(t) + y(t) / \x(t) + y(t) + z(t) /
The second time derivative of phi is (might as well shoot yourself now):
/ 2
1 |/ 2 2\ / 2 2
------------------------------------------- |\x(t) + y(t) / \x(t) + y(t)
(3/2) 2 \
/ 2 2\ / 2 2 2\
\x(t) + y(t) / \x(t) + y(t) + z(t) /
2\ / d / d \\
+ z(t) / |--- |--- z(t)||
\ dt \ dt //
/ 2 2\ / 2 2 2\ / d / d \\
- x(t) z(t) \x(t) + y(t) / \x(t) + y(t) + z(t) / |--- |--- x(t)||
\ dt \ dt //
/ 2 2\ / 2 2 2\ / d / d \\
- z(t) y(t) \x(t) + y(t) / \x(t) + y(t) + z(t) / |--- |--- y(t)||
\ dt \ dt //
2 2
/ d \ / 2 2\ / 2 2\ / 2 2
- 2 z(t) |--- z(t)| \x(t) + y(t) / - 2 \x(t) + y(t) / \x(t) - z(t)
\ dt /
/
2\ / / d \ / d \\ / d \ |/ 1 4
+ y(t) / |x(t) |--- x(t)| + y(t) |--- y(t)|| |--- z(t)| + 2 ||- - y(t)
\ \ dt / \ dt // \ dt / \\ 2
2
4 1 2 2 1 2 2\ / d \
+ x(t) - - y(t) z(t) + - x(t) y(t) | |--- x(t)|
2 2 / \ dt /
/ 2 2 1 2\ / d \ / d \
+ 3 |x(t) + y(t) + - z(t) | |--- y(t)| y(t) x(t) |--- x(t)|
\ 3 / \ dt / \ dt /
2\ \
1 / 4 / 2 2\ 2 4\ / d \ | |
- - \x(t) + \z(t) - y(t) / x(t) - 2 y(t) / |--- y(t)| | z(t)|
2 \ dt / / /
Happy Coding,
Josh,
[from Jeffrey]mathgl on k3
//////////////////////////////////////////////////////
#include </usr/local/include/mgl/mgl_zb.h>
void image(char *fname){
int imax=60;
int jmax=60;
mglGraph *gr = new mglGraphZB;
mglData dat(imax,jmax);
//dat is declared as 2d with dimensions imax by jmax.
for (int i=0; i<imax; i++){
for (int j=0; j<jmax; j++){
dat.a[i+imax*j]={insert data to array};
}
}
//double *a is a public variable in
//the mglData class where data is
//saved as a 1d array.
//similarly for 3d plotting, dat can be declared as:
//
//mglData dat(imax,jmax,kmax);
//
//and each number in the array can be accessed though:
//
//for (int i=0; i<imax; i++){
// for (int j=0; j<jmax; j++){
// for (int k=0; k<kmax; k++){
// dat.a[i+imax*(j+jmax*k)]={insert data to array};
// }
// }
//}
gr->Box();
gr->Dens(dat);
gr->Colorbar();
cout << fname << " is being saved." << endl;
gr->WritePNG(fname);
return;
}
/////////////////////////////////////////////////////////
The function above produces a 2d density plot which is saved as a png file. For more fancy plots just check out the mathgl manual. Have fun!
In other news, my ppm sweep is not working quite right for unknown reasons. More debugging for me.
Monday, June 14, 2010
[from Pawel]
Some don't work (notably fluidsGL!), and with others we've had system freeze-ups. :-| They may not be related to SDK demos however, it could be a thermal instablility(?) of something on the mobo.
We need to investigate - hardware team, can you spend some time runnig things on cudak4, e.g., multiple SDK demos, to heat up the box and see if this will occur again. Before that, please install additional 14 cm fans (one in the vertical position behind harddisk compartment. The fans should snap into the black plastic frames, which should be attached to the bottom of the case with 3 screws I believe.
* * *
Next thing - networking with infiniband, we need to install drivers.
* * *
NOTICE!!
I've rescheduled the 24 June meeting for 29 June, rm AA401 UTSC, at 10-11am.
Time is very short and we must spent full time now working on this, ok?
I'd like us to deliver the working pair of nodes, with with our software demos, and networking
by the end ot this month. If we fail to deliver any working, useful programs, some of us may see a pay cut. :-(
cheers
P
Friday, June 11, 2010
[From Robert] Installing the hardware
Custom build for high performance, each computer consists of:
1 Thermaltake VH6000BWS tower case
1 Asus P6T7 WS Supercomputer motherboard
1 Intel i7 930 2.8GHz processor
6 OCZ 2GB DDR3-1333MHz Triple Channel memory giving us 12GB in total
1 Western Digital 2TB Green Caviar
3 EVGA Nvidia GTX 480 with 1.5GB Video memory
1 LG GH22N S50 DVD Writer
1 Silverstone Strider ST1500 1.5kW power supply
1 Infiniband InfiniHost III Lx Single-Port MHGS18 20Gb/s card
There were numerous challenges when putting the parts together.
Our first speed bump happened when our newly built system would not recognize more than 8GB RAM, and at other times, no more than 2GB. It was a very peculiar error, as it randomly changed how much the computer saw. It did not help when moving the RAM sticks around, as it sometimes would not even start up. After confirming that it was not the memory modules that was bringing down the machine (by verifying it with the other computer), it was determined that it was the motherboard.
An repair was placed for it, but unfortunately, the repair was said to take 1 - 2 weeks, which we could not afford, and hence, another motherboard was purchased.
Our second speed bump occurred when the replacement motherboard arrived. After configuring the system, the computer would not turn on. After diagnosing and checking everything over numerous times, it was decided, that this new motherboard was also defective. This too was sent in for repair, though fortunately, our first motherboard had been 'fixed' and returned to us.
After receiving back the motherboard, the system was built again, and fortunately this time, everything worked flawlessly.
More updates to come...
Wednesday, June 9, 2010
[from Josh] Progress Report
I would like to say I'm almost done my CUDA conversion process but who knows with debugging. I would say my best guess is that I'll be done in 4 days, so by Monday. It should hopefully include a Cpu version and working CUDA version of a tau modeling polar coord simulation.
Also me and James were discussing how it would be possible for an adaptive leapfrog scheme and came to the conclusion that we would have to adjust the time stepping globally to do it. Individual time step adjustment would lead to skewed time system since some particles would not update in synchronous time with all the rest (causing undesired shadow effects).
Josh,
Monday, June 7, 2010
see how it flies
http://www.av8n.com/how/
* * *
please report difficulties and successes in blog entries. ask for help if you need to.
by now I'd think everybody's got the cpu side working and is deciphering the cuda stuff.
time is T-17 days for being ready with our testbed system running cuda examples and we're not even half way :-|
the infiniband switch is somewhere near - canada customs is already asking for money..
[edit: it's in the UTSC shipping dept!]
Monday, May 31, 2010
[from Pawel] how to compute tau? Party on Sat.
Suppose the disk is 2D, otherwise you'll have to generalize. Then each particle must deposit
its dtau (small contribution to tau).
we talked a lot about this with James amd he's your guru on that (right James?). ask him any questions. you first declare an array (2d: r,phi) that will hold dtau(r,phi)
we established that it is much better to use the so-called "scatter" approach than "gather" approach. in the first, you traverse the list of N particles, and donate their dtau's to the
appropriate grid locations (grid indexes are not stored but found each time you need them for a given particle, it should save time and space in ram). in the second approach, you'd traverse the
dtau grid and look for particles inside each grid cell (that's a lot of searching! ~N.)
we need exp(-tau) in the end, we can get it by integrating tau(r,phi) = integral_0^r dtau(r,phi)
= sum_along_r_from_the_innermost_to_the_particle's_cell of dtau(r,phi).
we've talked about a nice trick to avoid computing exponential functions later on.
suppose you have dtau(r) in an array and want exp(-tau). then you can get a very good approximation to f=exp(-tau) by
f = integral_0^tau (-f) dtau
why? because df(tau) = -f(tau) dtau is the differential equation with solution f=exp(-tau) .
you need to start with initial value f=1 at the inner radius and integrate outside.
what is dtau contributed by each particle? it's dA/(4pi r^2) where dA is its geom. cross section.
in general, we don't want to count particles, all we want is to arrive at tau(r,phi) which
in the initial axisymmetric state has tau(r=outer radius) = tau_infinity = 10 (or 3, or something you input). Therefore do an integration using dtau = C/r^2 where C=1 initially, and after getting
tau(outer) or exp(-tau(outer)) computed with the above algorithm, you recalculate C so in all the
later calls to the tau-computing procedure it gives the right outside exp(-tau_infinity) number.
* * *
We will meet on Friday downtown, meanwhile there seems to be enough work and enough computers already doing or starting to do cuda at UTSC to drop in there the rest of the week.
On Sat, as I mentioned, you are invited for a pool-side party (swimming encouraged,
take your swimming pants). bring a friend if you'd like.
I'll make some bbq, but if you want fancy salads maybe we can make it a potluck = byo.. party) hopefully the weather will cooperate so we can go flying for 20 min with each of you out of buttonville airport (north of 404 & 407). http://planets.utsc.utoronto.ca/~pawel/VANsRV.html
please let me know how many are coming and whether you have a car (it would be good to have
more than just my car for the buttonville trip).
finally, the coordinates (time will still be confirmed, we'll see if the Wx is ok the whole or just part of the day):
37 Fairhill Cres., toronto, 2pm, Sat 26 May. I can give you a 1-2km ride from from Underhill and Lawrence to my place, send me the details of how you're coming so we'll set some time to meet
at the gas station there. the bus that goes on Lawrence is called 54.
Thursday, May 27, 2010
[from Josh] OpenGL code
"look at how I did this under HTML view to see what a pre tag is"
The way that I have built my code everything that is needed to run the simulation (such as "math.h", "freeglut.h",...etc) is "#include"ed in one header file. Also within that header file I define the global variables that will be used for the simulation such as number of particles, current runtime, dt, solar mass,... etc. I also define a class object called particle and define a dynamic array of particle objects. I then include this first header file in another header file called "forces.h". In the header file "forces.h" I define all the different forces that I want to take effect in my simulation (currently just gravitational but its very easy to add in other forces). I then include the forces header file into another header file called "movePos.h". In "movPos.h" all the numerical integration takes place. It calls functions from "forces.h" to do so and updates the particles positions. I then include "movePos.h" into my graphic header called "3Ddisplay.h" which I have posted below. Here, within my render function I call functions in "movePos.h" which update the particle object array that I defined in the very first header file. I then graph the positions of each of the particles adding visualization techniques and also display data.
Up until now everything has been contained in seperate header files that are easily updated, debugged, and managed. You should know that header files release their properties to any file that includes them, (ie. things that include header files inherit all things within the header files). I then have a file (not a header) called "main.cpp". This file includes "3Ddisplay.h" and thus inherits all the other header files functions and global variables. In "main.cpp" I define a single function called "main" that interactivity asks the user what type of simulation they would like to perform, what the parameters of the simulation are and then calls the displaySim function in "3Ddisplay.h" which then displays the simulation as you requested. The program is terminated when the user presses "q" in the display window.
Also my simulation utilizes openMP and will run in parallel across all available CPU cores.
I tested my program using "extern" variables instead of in inherited header variables but the complexity rose considerably and made debugging a nightmare. I also test various pointer array object declaration techniques to see which one was the fastest and I found that the using my current method of memory allocation is the fastest. This is similar to the way we would do things with CUDA too since CUDA works fastest with "pinned memory" (memory that is allocated once and remains static for the remainder of the program).
If you would like to use the below code you will have to either a) modify it completely so that it works with your code format, or b) try and follow my format.
In general you really only need to do two things:
- change displaySim function to "main" and add #include freeglut.h at the top of your program
- in the render function replace the movePos3D with your own code that updates particle positions. (Note where it says pPart[i] is my particle object and you should replace it with what ever your particle is or the corresponding x,y,z,vx,vy,vz of your code.
Note: you will need to have freeglut installed on your system and locatable by your compiler. See my previous post on the topic for those using windows.
#include "movePos.h"
/*openGL vars*/
static int left_click = GLUT_UP;
static int right_click = GLUT_UP;
static int xold;
static int yold;
static int width;
static int height;
static int wh;
static int hw;
static int vidtype = 'n';
static int leavetrail = 0;
static int paused = 0;
static double rotate_x = 30;
static double rotate_y = 15;
static double alpha = 0;
static double beta = 0;
static double scale = 100;
static double trans = -0.1*scale;
/*openGL functions*/
void ReshapeFunc(int new_width, int new_height){
width = (double)new_width;
height = (double)new_height;
hw = height / (double) width;
wh = width / (double) height;
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(20 * sqrt(double(1 + hw * hw)), wh, 8, 12);
glMatrixMode(GL_MODELVIEW);
glutPostRedisplay();
}
void MotionFunc(int x, int y){
if (GLUT_DOWN == left_click){
rotate_y = rotate_y + (y - yold) / 5.f;
rotate_x = rotate_x + (x - xold) / 5.f;
if (rotate_y > 90) rotate_y = 90;
if (rotate_y < -90) rotate_y = -90; glutPostRedisplay(); } if (GLUT_DOWN == right_click){ beta = beta + (y - yold) / 2.f; alpha = alpha + (x - xold) / 2.f; glutPostRedisplay(); } xold = x; yold = y; } void keyCB(unsigned char key, int x, int y){ if( key == 'q' ){ free(pPart); exit(0); } if( key == '+' ){ scale *= 2.0/3.0; } if( key == '-' ){ scale *= 1.5; } if( key == 'r' ){ vidtype = 'r'; } if( key == 'v' ){ vidtype = 'v'; } if( key == 'n' ){ vidtype = 'n'; } if( key == 't' ){ leavetrail = abs(leavetrail - 1); } if( key == 'p' ){ paused = abs(paused - 1); } } void renderBitmapCharacher(float x, float y, float z, void *font,char *string){ char *c; glRasterPos3f(x,y,z); for (c=string; *c != '\0'; c++){ glutBitmapCharacter(font, *c); } } void Render(int face){ double D; int i; /* Axis */ glBegin(GL_LINES); glColor3d(1, 0, 0); glVertex3d(-0.7*scale, -0.7*scale, -0.7*scale); glVertex3d( 0.7*scale, -0.7*scale, -0.7*scale); glColor3d(0, 1, 0); glVertex3d(-0.7*scale, -0.7*scale, -0.7*scale); glVertex3d(-0.7*scale, 0.7*scale, -0.7*scale); glColor3d(0, 0, 1); glVertex3d(-0.7*scale, -0.7*scale, -0.7*scale); glVertex3d(-0.7*scale, -0.7*scale, 0.7*scale); glEnd(); glRotated(beta, 1, 0, 0); glRotated(alpha, 0, 1, 0); glColor3d(1, 1, 1); glBegin(GL_POINTS); if (face==0) movePos3D(); KE = 0.0; #pragma omp parallel for for (i=0;iwidth)
glOrtho(-scale, scale, -scale * hw, scale * hw, -scale, scale);
else
glOrtho(-scale * wh, scale * wh, -scale, scale, -scale, scale);
glMatrixMode(GL_MODELVIEW);
/* Perspective view */
glViewport(0, 0, width / 2, height / 2);
glPushMatrix();
/* Display data */
char Npart_str[33];
sprintf(Npart_str,"N=%d",Npart);
renderBitmapCharacher(-scale,scale - 0*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,Npart_str);
char G_str[33];
sprintf(G_str,"G=%2.4Lf",G);
renderBitmapCharacher(-scale,scale - 1*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,G_str);
char t_str[33];
sprintf(t_str,"t=%2.4Lf",t);
renderBitmapCharacher(-scale,scale - 2*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,t_str);
char h_str[33];
sprintf(h_str,"h=%2.4Lf",h);
renderBitmapCharacher(-scale,scale - 3*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,h_str);
char scale_str[33];
sprintf(scale_str,"axis scale=%2.4Lf",scale*2);
renderBitmapCharacher(-scale,scale - 4*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,scale_str);
char K_str[33];
sprintf(K_str,"Kinetic Energy=%2.4Lf",KE);
renderBitmapCharacher(-scale,scale - 5*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,K_str);
glTranslated(0,0,trans);
glRotated(rotate_y, 1, 0, 0);
glRotated(rotate_x, 0, 1, 0);
Render(0);
glPopMatrix();
/* Right view */
glViewport(0, height / 2 + 1, width / 2 + 1, height / 2);
glPushMatrix();
glRotated(90, 0, -1, 0);
Render(1);
glPopMatrix();
/* Face view */
glViewport(width / 2 + 1, height / 2 + 1, width / 2, height / 2);
glPushMatrix();
Render(2);
glPopMatrix();
/* Top view */
glViewport(width / 2 + 1, 0, width / 2, height / 2);
glPushMatrix();
glRotated(90, 1, 0, 0);
Render(3);
glPopMatrix();
/* Projection matrix is restored */
glMatrixMode(GL_PROJECTION);
glPopMatrix();
glMatrixMode(GL_MODELVIEW);
/* End */
glFlush();
glutSwapBuffers();
}
/* Run upon click */
void MouseFunc(int button, int state, int x, int y)
{
if (GLUT_LEFT_BUTTON == button)
left_click = state;
if (GLUT_RIGHT_BUTTON == button)
right_click = state;
xold = x;
yold = y;
}
/* Timed refresh to control refresh rate */
void TimerdisplayCB(int value){
if (paused == 0) glutPostRedisplay();
glutTimerFunc(33,TimerdisplayCB,value);
}
void displaySim(int *argcp, char **argv){
/* Creation of the window */
glutInit(argcp, argv);
glutInitDisplayMode(GLUT_RGBA);
glutInitWindowSize(glutGet(GLUT_SCREEN_WIDTH),glutGet(GLUT_SCREEN_HEIGHT));
glutCreateWindow("Particle Simulation");
/* OpenGL settings */
glClearColor(0, 0, 0, 0);
glEnable(GL_CULL_FACE);
glCullFace(GL_BACK);
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);
/* Declaration of the callbacks */
glutDisplayFunc(DisplayFunc);
glutReshapeFunc(ReshapeFunc);
glutKeyboardFunc(keyCB);
glutMouseFunc(MouseFunc);
glutMotionFunc(MotionFunc);
glutTimerFunc(33,TimerdisplayCB,0);
/* Main display */
glutMainLoop();
}
Wednesday, May 26, 2010
[from Robert] cudak4 & cudak5 up and running
[from Anthony] some CUDA examples
https://visualization.hpc.mil/wiki/GPGPU
n=10 iterations=1000 time_gpu=6586.000 time_cpu=1601.000
n=10 iterations=100 time_gpu=615.000 time_cpu=150.000
n=60 iterations=100 time_gpu=617.000 time_cpu=5980.000
I won't be posting their codes - it's pretty length-y. They also have a plot where they compare the CPU/GPU times of executing the program.
By the way, the devices I have in my macbook pro are...
Device 0: "GeForce 9600M GT"
CUDA Driver Version: 3.0
CUDA Runtime Version: 3.0
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 268107776 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.25 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)
Device 1: "GeForce 9400M"
CUDA Driver Version: 3.0
CUDA Runtime Version: 3.0
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 266010624 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.10 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)