algorithm - OpenCL Cholesky Decomposition -
algorithm - OpenCL Cholesky Decomposition -
i implemented next cholesky decomposition algorithm using opencl. code exhibiting random behavior. matches cpu output times. can please help me figure out wrong implementation.
here algorithm:
procedure cholesky(a) int i, j, k; k := 0 n − 1 /* 1st loop */     /* obtain square root of diagonal element. */     a[k, k] := a[k, k];      j := k + 1 n − 1 /* 2nd loop */     /*  partition step. */         a[k, j] := a[k, j]/a[k, k];     end      := k + 1 n − 1 /* 3rd loop */         j := n − 1 /* 4th loop */             /* elimination step. */             a[i, j] := a[i, j] - a[k, i] × a[k, j];          end     end end    methodology parallelize above algorithm:
from algorithm, elimination step expensive. have outermost loop in host code, , phone call kernel within loop. single run of kernel corresponds single iteration of 3rd loop. therefore, launch (n-1 )- (k+1) + 1 work groups. number of work items within workgroup set n/2. 2nd loop computed within kernel, allow first workgroup it.
relevant host code
// 10 x 10 matrix, matrix_size = 10 localworksize[0] = matrix_size/2; stride = matrix_size/2; cl_event             event;  for(k = 0; k < matrix_size; k++) {      int isize = (matrix_size-1) - (k+1) + 1;     int num_blocks = isize;     if(num_blocks <= 0)             num_blocks = 1;     globalworksize[0] = num_blocks * wa/2;      errcode = clsetkernelarg(clkernel, 0, sizeof(int), (void *)&k);     errcode |= clsetkernelarg(clkernel, 1, sizeof(cl_mem), (void *)&d_a);     errcode |= clsetkernelarg(clkernel, 2, sizeof(int), (void *)&stride);      errcode = clenqueuendrangekernel(clcommandqueue,           clkernel, 1, null, globalworksize,           localworksize, 0, null, &event);      opencl_checkerror(errcode, "clenqueuendrangekernel");     clfinish(clcommandqueue); }    kernel code
__kernel void batchedcholesky(__global float *u, int k, int stride) {      int tx = get_global_id(0);     unsigned int j;     unsigned int num_rows = matrix_size;      if(tx==0)     {             // take square root of diagonal element             u[k * num_rows + k] = sqrt(u[k * num_rows + k]);     }     barrier(clk_global_mem_fence);      int offset = (k+1); //from original loop     int jstart = get_local_id(0) + offset;     int jstep = stride;     int jtop = num_rows - 1;      int jbottom = (k + 1);     //do work iteration     //division step     if(get_group_id(0) == 0)     {             for(j = jstart; (j >= jbottom) && (j <= jtop); j+=jstep)             {                     u[k * num_rows + j] /= u[k * num_rows + k]; //  partition step             }     }     barrier(clk_global_mem_fence);      j = 0;      int = get_group_id(0) + (k+1);      offset = i;      jstart = get_local_id(0) + offset;      jbottom = i;      for( j = jstart; j >= jbottom && j <= jtop; j += jstep)              u[i * num_rows + j] -= u[k * num_rows + i] * u[k * num_rows + j];      barrier(clk_global_mem_fence);    }       
not of work items execute @ same time, may run in batches. code running prior clk_global_mem_fence won't include every value. may source of errors.
if require global synchronization, utilize multiple kernels.
 algorithm opencl blas 
 
  
Comments
Post a Comment