Cuda

22
February 9, 2007 Uso de placas gráficas em computação de alto-desempenho (High-performance computing using GPU´s) Mario Alexandre Gazziro (YAH!) Orientador: Jan F. W. Slaets 24/09/08

description

Tutorial de CUDA - Interface com matlab via mesh files

Transcript of Cuda

Page 1: Cuda

February 9, 2007

Uso de placas gráficas em computação de alto-desempenho

(High-performance computing using GPU´s)

Mario Alexandre Gazziro (YAH!)

Orientador: Jan F. W. Slaets

24/09/08

Page 2: Cuda

2/9/07 Course Title 2

Part I: Overview

Definition: Introduced in 2006, the Compute Unified Device Architecture is a

combination of software and hardware architecture (available for NVIDIA G80 GPUs and above) which enables data-parallel general purpose computing on the graphics hardware. It therefore offers a C-like programming API with some language extensions.

Key Points: The architecture offers support for massively multi threaded

applications and provides support for inter-thread communication and memory access.

Page 3: Cuda

2/9/07 Course Title 3

Why this topic is important?

Data-intensive problems challenge conventional computing architectures with demanding CPU,memory, and I/O requirements.

Emerging hardware technologies, like CUDA architecture can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency.

Page 4: Cuda

2/9/07 Course Title 4

Where would I encounter this?

Gaming

Raytracing

3D Scanners

Computer Graphics

Number Crunching

Scientific Calculation

Page 5: Cuda

2/9/07 Course Title 5

CUDA SDK sample applications

Page 6: Cuda

2/9/07 Course Title 6

CUDA SDK sample applications

Page 7: Cuda

2/9/07 Course Title 7

CUDA vs Intel

NVIDIA GeForce 8800 GTX vs Intel Xeon E5335 2GHz, L2 cache 8MB

Page 8: Cuda

2/9/07 Course Title 8

Grid of thread blocks

The computational grid consist of a grid of thread blocks

Each thread executes the kernel

The application specifies the grid and block dimensions

The grid layouts can be 1, 2 or 3-dimensional

The maximal sizes are determined by GPU memory

Each block has a unique block ID

Each thread has a unique thread ID (within the block)

Page 9: Cuda

2/9/07 Course Title 9

Elementwise Matrix Addition

Page 10: Cuda

2/9/07 Course Title 10

Elementwise Matrix Addition

The nested for-loops are replaced with an implicit grid

Page 11: Cuda

2/9/07 Course Title 11

Memory model

CUDA exposes all the different type of memory on GPU:

Page 12: Cuda

2/9/07 Course Title 12

Part II: Accelerating MATLAB with CUDA

12

Case Study: Initial calculation for solving sparse matrix in the method proposed by professor Guilherme Sipahi, from IFSC

N=1001;K(1:N) = rand(1,N);g1(1:2*N) = rand(1,2*N);k = 1.3;tic;for i=1:N for j=1:N M(i,j) = g1(N+i-j)*(K(i)+k)*(K(j)+k); endendmatlabTime=toc tic; M=guilherme_cuda(K,g1); cudaTime=toc

speedup=matlabTime/cudaTime

Page 13: Cuda

2/9/07 Course Title 1313

Results: Speedup of 4.77 times using a NVIDA 8400M with 128 MB

matlabTime =

10.6880

cudaTime =

2.2406

speedup =

4.7701

>>

Page 14: Cuda

2/9/07 Course Title 14

The MEX file structure

The main() function is replaced with mexFunction.

#include "mex.h"

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { code that handles interface and calls to computational function return; }

mexFunction arguments:

- nlhs: The number of lhs (output) arguments.

- plhs: Pointer to an array which will hold the output data,each element is type mxArray.

- nrhs: The number of rhs (input) arguments.

- prhs: Pointer to an array which holds the input data, eachelement is type const mxArray.

Page 15: Cuda

2/9/07 Course Title 15

MX Functions

The collection of functions used to manipulate mxArrays are calledMX-functions and their names begin with mx.Examples:

• mxArray creation functions:mxCreateNumericArray, mxCreateDoubleMatrix,mxCreateString, mxCreateDoubleScalar.

• Access data members of mxArrays:mxGetPr, mxGetPi, mxGetM, mxGetN.

• Modify data members:mxSetPr, mxSetPi.

• Manage mxArray memory:mxMalloc, mxCalloc, mxFree, mxDestroyArray.

Page 16: Cuda

2/9/07 Course Title 16

Mex file for CUDA used in case study – Part 1

Compilation instructions under MATLAB:

nvmex -f nvmexopts.bat square_me_cuda.cu -IC:\cuda\include -LC:\cuda\lib -lcufft -lcudart

#include "cuda.h"#include "mex.h" /* Kernel to compute elements of the array on the GPU */__global__ void guilherme_kernel(float* K, float* g1, float* M, int N){ int k = 1.3; int i = blockIdx.x*blockDim.x+threadIdx.x; int j = blockIdx.y*blockDim.y+threadIdx.y; if ( i < N && j < N) M[i+j*N]=g1[N+i-j]*(K[i]+k)*(K[j]+k);}

Page 17: Cuda

2/9/07 Course Title 17

Mex file for CUDA used in case study – Part 2

/* Gateway function */void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[]){ int j, m_0, m_1, m_o, n_0, n_1, n_o; double *data1, *data2, *data3; float *data1f, *data2f, *data3f; float *data1f_gpu, *data2f_gpu, *data3f_gpu; mxClassID category; if (nrhs != (nlhs+1)) mexErrMsgTxt("The number of input and output arguments must be the same."); /* Find the dimensions of the data */ m_0 = mxGetM(prhs[0]); n_0 = mxGetN(prhs[0]); /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data1f_gpu,sizeof(float)*m_0*n_0); /* Retrieve the input data */ data1 = mxGetPr(prhs[0]); /* Check if the input array is single or double precision */ category = mxGetClassID(prhs[0]); if( category == mxSINGLE_CLASS) { /* The input array is single precision, it can be sent directly to the card */ cudaMemcpy( data1f_gpu, data1, sizeof(float)*m_0*n_0, cudaMemcpyHostToDevice); }

Page 18: Cuda

2/9/07 Course Title 18

Mex file for CUDA used in case study – Part 3

/* Find the dimensions of the data */ m_1 = mxGetM(prhs[1]); n_1 = mxGetN(prhs[1]); /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data2f_gpu,sizeof(float)*m_1*n_1); /* Retrieve the input data */ data2 = mxGetPr(prhs[1]); /* Check if the input array is single or double precision */ category = mxGetClassID(prhs[1]); if( category == mxSINGLE_CLASS) { /* The input array is single precision, it can be sent directly to the card */ cudaMemcpy( data2f_gpu, data2, sizeof(float)*m_1*n_1, cudaMemcpyHostToDevice); }

/* Find the dimensions of the data */ m_o = n_0; n_o = n_1; /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data3f_gpu,sizeof(float)*m_o*n_o);

Page 19: Cuda

2/9/07 Course Title 19

Mex file for CUDA used in case study – Part 4

/* Compute execution configuration using 128 threads per block */ dim3 dimBlock(128); dim3 dimGrid((m_o*n_o)/dimBlock.x); if ( (n_o*m_o) % 128 !=0 ) dimGrid.x+=1; /* Call function on GPU */ guilherme_kernel<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, data3f_gpu, n_o*m_o); data3f = (float *) mxMalloc(sizeof(float)*m_o*n_o); /* Copy result back to host */ cudaMemcpy( data3f, data3f_gpu, sizeof(float)*n_o*m_o, cudaMemcpyDeviceToHost); /* Create an mxArray for the output data */ plhs[0] = mxCreateDoubleMatrix(m_o, n_o, mxREAL); /* Create a pointer to the output data */ data3 = mxGetPr(plhs[0]);

Page 20: Cuda

2/9/07 Course Title 20

Part III: Device options

GPU Model Memory Threads Price (R$)

8600 GT 256 MB 3,072 150.00

8600 GT 512 MB 3,072 300.00

8800 GT 512 MB 12,288 800.00

9800 GTX 512 MB(DDR3) 12,288 1,200.00

9800 GX2 1 GB(DDR3) 24,576 2,500.00

Page 21: Cuda

2/9/07 Course Title 21

References

Gokhale M. et al, Hardware Technologies for High-Performance Data-Intensive Computing, IEEE Computer, 18-9162, pg 60, 2008.

Lietsch S. et al. A CUDA-Supported Approach to Remote Rendering, Lecture Notes in Computer Science. 2007.

Fujimoto N. Faster Matrix-Vector Multiplication on GeForce 8800 GTX, IEEE, 2008.

Book Reference

NVIDIA Corporation, David, NVIDIA CUDA Programming Guide, Version 1.1, 2007.

Page 22: Cuda

2/9/07 Course Title 22

Questions ?

So long and thanks by all the fish!