![]() ![]() Int yIndex = blockIdx.y * blockDim.y + threadIdx. Perform bound checks inside the kernel like this: _global_ void kernel(unsigned char* image, int width, int height) If(xIndex >(., image_width, image_height) Int yIndex = blockIdx.y * blockDim.y + threadIdx.y //image y index of row number Int xIndex = blockIdx.x * blockDim.x + threadIdx.x //image x index or column number _global_ void kernel(unsigned char* image, int width, int height) Following sample kernel shows this process. Of course to do that, we would need to pass image dimensions as arguments to the kernel. The solution for this problem is that we perform bound checks inside the kernel and do processing only with those threads which fall inside the image bounds. This will result in some of the threads accessing memory outside the image bounds causing undefined behavior. Now in this case, we have total number of threads in each dimension greater than the corresponding image dimension. When the grid is created in the above mentioned ways, you will end up creating a grid of size 13 x 29 which will result in total number of threads equal to 416 x 928. Grid.y = (image_height + block.y - 1 )/block.y Grid.x = (image_width + block.x - 1 )/block.x Grid.y = ceil( float(image_height)/block.y ) Īnother smart way is to use the following formula int image_width = 400 Grid.x = ceil( float(image_width)/block.x ) Instead of integer division to calculate the number of blocks, we use floating point division and ceil the results. ![]() Here we are going to have 1D grid with multiple thread blocks so we can calculate global index by summing up block offset for 1D thread block and threadIdx.x values. What we need to do is to round up the number of blocks so that if the image dimension is not a multiple of block dimension, we create an additional block which will cover up the remaining pixels. This kernel takes 4 arguments - Pointer to the input vector A and B, Pointer to the output vector C and size of a vector. (because 32 x 12 = 384 and 32 x 28 = 896).Īs we can see that the total number of threads in each dimension are less than the corresponding image dimensions. For a 2D grid, the number of threads in X dimension is equal to block.x grid.x and in Y dimension equal to block.y grid.y. It means that the total number of threads in a dimension is equal to the product of grid size and block size in that dimension. But neither of the image dimensions are an integer multiple of the corresponding block dimensions, so due to integer division we will end up creating grid of size 12 x 28 which will result in total number of threads equal to 384 x 896. Remember that grid size means the number of block in each dimension. Then the number of blocks for the x and y dimensions of the image should be 400/32 and 900/32. Let's say you choose a block of size (32,32). For a 2D grid, the number of threads in X dimension is equal to block.x * grid.x and in Y dimension equal to block.y * grid.y.Īssuming you have an image of size 400 x 900, then the total number of threads in the corresponding dimension should also be at-least the same. Remember that grid size means the number of block in each dimension. If we want to map a thread for every pixel, then the grid should be created such that the total number of threads in each dimension is at-least equal to the corresponding image dimension. ![]() Next comes the calculation of 2D grid size. Keeping in mind the limit of block size (1024), following are a few examples of valid block sizes. If we want to set maximum possible block size, we have to make sure that the product of its dimensions does not exceed the block size limit. Int Col = blockIdx.x * blockDim.x + threadIdx.When processing 2D images with CUDA, a natural intuition is to use 2D block and grid shape. Int Row = blockIdx.y * blockDim.y + threadIdx.y Int numCRows, int numCColumns) Insert code to implement matrix multiplication here This is how I call kernel in the main function: matrixMultiply>(deviceA, deviceB, deviceC,Īnd the kernel: _global_ void matrixMultiply(float * A, float * B, float * C, What is the reason for that? Assume one block can contain the whole image ( input is 64圆4). The following setup works: int TILE = 8 ĭim3 DimGrid((numCColumns - 1)/TILE + 1, (numCRows - 1)/TILE + 1, 1) īut if I use one block for the whole image, it returns all zero. I am doing a matrix multiplication in CUDA. ![]()
0 Comments
Leave a Reply. |