![]() The thread code is just the set of computations needed on the input data, in order to produce one output data point. Therefore, the total minimum number of threads that I must have is equal to the size of my output array. For example, taking an image and producing a blurred image would be a transformation.įor transformations, a common approach ("thread strategy") to writing a cuda kernel (the thread code) will be to make one unique thread responsible for each point in the output array. For this discussion, we will ignore reductions.Ī transformation is a category of computation where the output data set size (number of elements) is either "large" or "approximately the same" as the input data set size. For example, taking an image and finding the maximum pixel value would be a reduction. One approach to categorizing computation problems is to discuss transformations and reductions.Ī reduction is a category of problem which takes a large input data set size, and produces a small output data set size. PS: My question is not about optimising the code, but understanding how to distribute the threads and grid data over the device. So this is my issue: how do I not exceed the 1024 threads per block, whilst retaining the correct grid size of my data? That is of course, 8320 threads per block, which far exceeds the 1024 per block. Hence, my threadsPerBlock would be: dim3 threadsPerBlock(32,260) Y would have to be multiplied by 8.125 * 32 Implying that using the 32 minimum grid size for X, For example 800 weights / 5 SM, = 160 x's per SMīut I didn't know what to do from there on.įinally, I considered finding the input-weight ratio first: 6500/800 = 8.125 Then I considered calculating the grid threads by using the available SMs. What I have tried, is to calculate my 2D grid size, by dividing my data by 32 wrap size. Does the minimum grid size, regardless of all other parameters need to be at least 32, or a multiple of 32? Do I need at least 32 threads per block, or a grid size where the smallest number is 32?Īny pseudo-code, or explanation of how I should go about this, would be greatly appreciated.Or I need to deduce the number of blocks needed first?įinally, since my thread warp size is 32, I know my maximum threads per block is 1024, but because its a 2D grid, it would more likely be: dim3 threadPerBlock(X,Y) ĭue to the fact that my grid is not a square matrix, I need to calculate the X, Y threads per block in a different way? Does this imply that what I really need, is a 2D grid of 800,6500? As far as I understand, anything else will provide incorrect results?.Let us assume I have a weight vector of 800 items, and an input vector of 6500 items. I am trying to understand how I can deduce/compute the grid size, threads per block, and number of blocks. Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Maximum number of threads per block: 1024 Maximum number of threads per multiprocessor: 2048 ( 5) Multiprocessors, (192) CUDA Cores/MP: 960 CUDA Cores Total amount of global memory: 2047 MBytes I am ussing a GTX 660, which has: CUDA Capability Major/Minor version number: 3.0 I have to compute my threads per block as a 2D with unequal numbers of threads in the grid. I'd like to keep this consistent, and AFAIK using a 2D grid for Weight * Input calculations is reasonable. The reason why I've taken this approach is because it makes sense in ANN computation, when it comes to vector/matrix calculations. Int y = blockIdx.y * blockDim.y + threadIdx.y Int x = blockIdx.x * blockDim.x + threadIdx.x My data is not a square lattice (a matrix) as is often with most examples I've seen, it is instead two vectors producing a matrix, with unequal rows to columns: float x įloat * i_ptr = thrust::raw_pointer_cast( in_vec.data() ) įloat * w_ptr = thrust::raw_pointer_cast( w_vec.data() ) įloat * out_ptr = thrust::raw_pointer_cast( mtx_vec.data() ) Īnd the kernel: _global_ void prop_mtx( float * w, float * i, float * o, int s ) My intention is to try and calculate dynamically (rather than hardcoding values) for a feed-forward neural net library I am developing. Determining threads per block and block per grid.Let me start by saying that I've read carefully all similar questions on SO:
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |