Cuda

February 9, 2007

Uso de placas gráficas em computação de alto-desempenho

(High-performance computing using GPU´s)

Mario Alexandre Gazziro (YAH!)

Orientador: Jan F. W. Slaets

24/09/08

2/9/07 Course Title 2

Part I: Overview

Definition: Introduced in 2006, the Compute Unified Device Architecture is a

combination of software and hardware architecture (available for NVIDIA G80 GPUs and above) which enables data-parallel general purpose computing on the graphics hardware. It therefore offers a C-like programming API with some language extensions.

Key Points: The architecture offers support for massively multi threaded

applications and provides support for inter-thread communication and memory access.


Why this topic is important?

Data-intensive problems challenge conventional computing architectures with demanding CPU,memory, and I/O requirements.

Emerging hardware technologies, like CUDA architecture can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency.


Where would I encounter this?

Gaming

Raytracing

3D Scanners

Computer Graphics

Number Crunching

Scientific Calculation


CUDA SDK sample applications


CUDA vs Intel

NVIDIA GeForce 8800 GTX vs Intel Xeon E5335 2GHz, L2 cache 8MB


Grid of thread blocks

The computational grid consist of a grid of thread blocks

Each thread executes the kernel

The application specifies the grid and block dimensions

The grid layouts can be 1, 2 or 3-dimensional

The maximal sizes are determined by GPU memory

Each block has a unique block ID

Each thread has a unique thread ID (within the block)


Elementwise Matrix Addition


Elementwise Matrix Addition

The nested for-loops are replaced with an implicit grid


Memory model

CUDA exposes all the different type of memory on GPU:


Part II: Accelerating MATLAB with CUDA

12

Case Study: Initial calculation for solving sparse matrix in the method proposed by professor Guilherme Sipahi, from IFSC

N=1001;K(1:N) = rand(1,N);g1(1:2*N) = rand(1,2*N);k = 1.3;tic;for i=1:N for j=1:N M(i,j) = g1(N+i-j)*(K(i)+k)*(K(j)+k); endendmatlabTime=toc tic; M=guilherme_cuda(K,g1); cudaTime=toc

speedup=matlabTime/cudaTime


Results: Speedup of 4.77 times using a NVIDA 8400M with 128 MB

matlabTime =

10.6880

cudaTime =

2.2406

speedup =

4.7701

>>


The MEX file structure

The main() function is replaced with mexFunction.

#include "mex.h"

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { code that handles interface and calls to computational function return; }

mexFunction arguments:

- nlhs: The number of lhs (output) arguments.

- plhs: Pointer to an array which will hold the output data,each element is type mxArray.

- nrhs: The number of rhs (input) arguments.

- prhs: Pointer to an array which holds the input data, eachelement is type const mxArray.


MX Functions

The collection of functions used to manipulate mxArrays are calledMX-functions and their names begin with mx.Examples:

• mxArray creation functions:mxCreateNumericArray, mxCreateDoubleMatrix,mxCreateString, mxCreateDoubleScalar.

• Access data members of mxArrays:mxGetPr, mxGetPi, mxGetM, mxGetN.

• Modify data members:mxSetPr, mxSetPi.

• Manage mxArray memory:mxMalloc, mxCalloc, mxFree, mxDestroyArray.


Mex file for CUDA used in case study – Part 1

Compilation instructions under MATLAB:

nvmex -f nvmexopts.bat square_me_cuda.cu -IC:\cuda\include -LC:\cuda\lib -lcufft -lcudart

#include "cuda.h"#include "mex.h" /* Kernel to compute elements of the array on the GPU */__global__ void guilherme_kernel(float* K, float* g1, float* M, int N){ int k = 1.3; int i = blockIdx.x*blockDim.x+threadIdx.x; int j = blockIdx.y*blockDim.y+threadIdx.y; if ( i < N && j < N) M[i+j*N]=g1[N+i-j]*(K[i]+k)*(K[j]+k);}



/* Gateway function */void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[]){ int j, m_0, m_1, m_o, n_0, n_1, n_o; double *data1, *data2, *data3; float *data1f, *data2f, *data3f; float *data1f_gpu, *data2f_gpu, *data3f_gpu; mxClassID category; if (nrhs != (nlhs+1)) mexErrMsgTxt("The number of input and output arguments must be the same."); /* Find the dimensions of the data */ m_0 = mxGetM(prhs[0]); n_0 = mxGetN(prhs[0]); /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data1f_gpu,sizeof(float)*m_0*n_0); /* Retrieve the input data */ data1 = mxGetPr(prhs[0]); /* Check if the input array is single or double precision */ category = mxGetClassID(prhs[0]); if( category == mxSINGLE_CLASS) { /* The input array is single precision, it can be sent directly to the card */ cudaMemcpy( data1f_gpu, data1, sizeof(float)*m_0*n_0, cudaMemcpyHostToDevice); }



/* Find the dimensions of the data */ m_1 = mxGetM(prhs[1]); n_1 = mxGetN(prhs[1]); /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data2f_gpu,sizeof(float)*m_1*n_1); /* Retrieve the input data */ data2 = mxGetPr(prhs[1]); /* Check if the input array is single or double precision */ category = mxGetClassID(prhs[1]); if( category == mxSINGLE_CLASS) { /* The input array is single precision, it can be sent directly to the card */ cudaMemcpy( data2f_gpu, data2, sizeof(float)*m_1*n_1, cudaMemcpyHostToDevice); }

/* Find the dimensions of the data */ m_o = n_0; n_o = n_1; /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data3f_gpu,sizeof(float)*m_o*n_o);



/* Compute execution configuration using 128 threads per block */ dim3 dimBlock(128); dim3 dimGrid((m_o*n_o)/dimBlock.x); if ( (n_o*m_o) % 128 !=0 ) dimGrid.x+=1; /* Call function on GPU */ guilherme_kernel<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, data3f_gpu, n_o*m_o); data3f = (float *) mxMalloc(sizeof(float)*m_o*n_o); /* Copy result back to host */ cudaMemcpy( data3f, data3f_gpu, sizeof(float)*n_o*m_o, cudaMemcpyDeviceToHost); /* Create an mxArray for the output data */ plhs[0] = mxCreateDoubleMatrix(m_o, n_o, mxREAL); /* Create a pointer to the output data */ data3 = mxGetPr(plhs[0]);


Part III: Device options

GPU Model Memory Threads Price (R$)

8600 GT 256 MB 3,072 150.00

8600 GT 512 MB 3,072 300.00

8800 GT 512 MB 12,288 800.00

9800 GTX 512 MB(DDR3) 12,288 1,200.00

9800 GX2 1 GB(DDR3) 24,576 2,500.00


References

Gokhale M. et al, Hardware Technologies for High-Performance Data-Intensive Computing, IEEE Computer, 18-9162, pg 60, 2008.

Lietsch S. et al. A CUDA-Supported Approach to Remote Rendering, Lecture Notes in Computer Science. 2007.

Fujimoto N. Faster Matrix-Vector Multiplication on GeForce 8800 GTX, IEEE, 2008.

Book Reference

NVIDIA Corporation, David, NVIDIA CUDA Programming Guide, Version 1.1, 2007.


Questions ?

So long and thanks by all the fish!

Cuda

Education

Transcript of Cuda