Monday, May 7, 2012

Logistic regression with OpenCL and .NET [part 3 of 4]

In previous posts I showed you in brief how to implement logistic regression using OpenCL and .NET. Unfortunately, that implementation is far from optimal. In fact, it works faster on CPU than on GPU. Let's find out how to make it run faster on my AMD Radeon 5870.

Let's look at float4-optimized grad1 kernel function from part 1:

kernel void grad14(global read_only float* X, global read_only float* Y, 
                        global read_only float4* theta, global write_only float* o, int feature_count)
{
    int j = get_global_id(0);        
    float4 s = 0.0;    
    global float4* x4 = (global float4*)(X+j*feature_count);    
    for(int i=0;i<feature_count>>2;++i)
    {
        s += x4[i] * theta[i];
    }
    o[j] = sigmoid(s.x+s.y+s.z+s.w) - Y[j];
}

It runs 4.5 times faster than float version, but if we look at performance counters collected by AMD APP Profiler, we'll see that FetchUnitStalled is quite high. That means that we've got channel/bank conflicts. Problem is that on GPU we want adjacent work items (e.g. threads) to access adjacent memory addresses. (Refer to AMD APP Programming Guide for details.) To do so, we need to transpose X:

kernel void grad1_ns4(global read_only float* Xtranspose, global read_only float4* Y, 
                       global read_only float* theta, global write_only float4* o, int feature_count, int Xtransposestride)
{
    int j = get_global_id(0);        
    float4 s = 0.0;        
    for(int i=0,k=0;i<feature_count;++i,k+=Xtransposestride)
    {        
        float4 x4 = ((global float4*)(Xtranspose+k))[j];
        s += x4 * theta[i];
    }    
    o[j] = sigmoid4(s) - Y[j];
}

Kernel function grad1_ns4() runs 3 times faster than grad14(). That looks ok, so let's turn our attention to grad2 function.

Let's take grad2 variant that already makes use of transposed X and float4 optimizations:

kernel void grad2_ns(global read_only float* X, global read_only float* o, 
                        global float4* theta, int example_count, float alpha, int Xstride)
{
    int i = get_global_id(0);
    
    float4 s = 0.0;
    
    for(int j=0,k=0;j<example_count;j++,k+=Xstride)
    {
        float4 x4 = ((global float4*)(X+k))[i];
        s += x4 * o[j];
    }    

    theta[i] -= alpha * s / example_count;    
}

Main problem with this function lies with the fact that usually feature count is too low to fully load the device. In my case I have only 292 features. With float4 optimization it's reduced to 73 workitems. But my Radeon 5870 has 20 compute units, each capable of executing 64 work items. To fully load GPU we need 20*64=1280 workitems. That means 5120 features! In case we do not have such a number of features, we need to increase workitem count somehow.

To do so we are going to use local memory and workgroups. Local memory is local to a compute unit, it's shared between all workitems in a workgroup, and it's fast. To keep things easy let's make sure that our example count is a multiple of 64 (I just sample extra items from existing example set). Now we can schedule our kernel like this:

commands.Execute(kernelB, null, new long[] { featureCount/4, 64 }, long[] { 1, 64 }, null);

And the kernel function becomes:

#define SZ 64

kernel void grad2_ns_loc(global read_only float* X, global read_only float* o, 
                            global float4* theta, int example_count, float alpha, int Xstride)
{
    int i = get_global_id(0);       
    int loc = get_local_id(1);        
    
    float4 ls = 0.0;            
    for(int j=0;j<example_count;j+=SZ)
    {        
        float4 x4 = ((global float4*)(X+(j+loc)*Xstride))[i];
        ls += x4 * o[j+loc];        
    }

    local float4 sum[SZ]; // allocate local memory
    sum[loc] = ls;    

    // make sure all work items in the group
    // finished executing sum[loc] = ls
    barrier(CLK_LOCAL_MEM_FENCE);
    
    if(loc == 0)
    {        
        float4 s = 0.0;
        for(int k=0;k<SZ;k++)
        {
            s+=sum[k];
        }
        theta[i] -= s * alpha / example_count;
    }
}

Now we have 64 times more work items and that's enough to fully load Cypress GPU. In case of 292 features it results in ten times faster execution. Performance counters shows fetch stalls for both local and global memory, so I guess, it can be improved further, but I'm going to stop here. My benchmark shows now that it's 17 times faster on GPU than on CPU so I'm satisfied for now.

No comments:

Post a Comment