Tuesday, May 8, 2012

Logistic Regression with OpenCL and .NET [part 4 of 4]

This is the last post about OpenCL implementation of gradient descent algorithm applied to logistic regression model training. And it will be rather short since all I want to do is to give you tangible source code to play with, and here it is: http://simpleclml.codeplex.com/.

Monday, May 7, 2012

Logistic regression with OpenCL and .NET [part 3 of 4]

In previous posts I showed you in brief how to implement logistic regression using OpenCL and .NET. Unfortunately, that implementation is far from optimal. In fact, it works faster on CPU than on GPU. Let's find out how to make it run faster on my AMD Radeon 5870.

Let's look at float4-optimized grad1 kernel function from part 1:

kernel void grad14(global read_only float* X, global read_only float* Y, 
                        global read_only float4* theta, global write_only float* o, int feature_count)
{
    int j = get_global_id(0);        
    float4 s = 0.0;    
    global float4* x4 = (global float4*)(X+j*feature_count);    
    for(int i=0;i<feature_count>>2;++i)
    {
        s += x4[i] * theta[i];
    }
    o[j] = sigmoid(s.x+s.y+s.z+s.w) - Y[j];
}

It runs 4.5 times faster than float version, but if we look at performance counters collected by AMD APP Profiler, we'll see that FetchUnitStalled is quite high. That means that we've got channel/bank conflicts. Problem is that on GPU we want adjacent work items (e.g. threads) to access adjacent memory addresses. (Refer to AMD APP Programming Guide for details.) To do so, we need to transpose X:

kernel void grad1_ns4(global read_only float* Xtranspose, global read_only float4* Y, 
                       global read_only float* theta, global write_only float4* o, int feature_count, int Xtransposestride)
{
    int j = get_global_id(0);        
    float4 s = 0.0;        
    for(int i=0,k=0;i<feature_count;++i,k+=Xtransposestride)
    {        
        float4 x4 = ((global float4*)(Xtranspose+k))[j];
        s += x4 * theta[i];
    }    
    o[j] = sigmoid4(s) - Y[j];
}

Kernel function grad1_ns4() runs 3 times faster than grad14(). That looks ok, so let's turn our attention to grad2 function.

Let's take grad2 variant that already makes use of transposed X and float4 optimizations:

kernel void grad2_ns(global read_only float* X, global read_only float* o, 
                        global float4* theta, int example_count, float alpha, int Xstride)
{
    int i = get_global_id(0);
    
    float4 s = 0.0;
    
    for(int j=0,k=0;j<example_count;j++,k+=Xstride)
    {
        float4 x4 = ((global float4*)(X+k))[i];
        s += x4 * o[j];
    }    

    theta[i] -= alpha * s / example_count;    
}

Main problem with this function lies with the fact that usually feature count is too low to fully load the device. In my case I have only 292 features. With float4 optimization it's reduced to 73 workitems. But my Radeon 5870 has 20 compute units, each capable of executing 64 work items. To fully load GPU we need 20*64=1280 workitems. That means 5120 features! In case we do not have such a number of features, we need to increase workitem count somehow.

To do so we are going to use local memory and workgroups. Local memory is local to a compute unit, it's shared between all workitems in a workgroup, and it's fast. To keep things easy let's make sure that our example count is a multiple of 64 (I just sample extra items from existing example set). Now we can schedule our kernel like this:

commands.Execute(kernelB, null, new long[] { featureCount/4, 64 }, long[] { 1, 64 }, null);

And the kernel function becomes:

#define SZ 64

kernel void grad2_ns_loc(global read_only float* X, global read_only float* o, 
                            global float4* theta, int example_count, float alpha, int Xstride)
{
    int i = get_global_id(0);       
    int loc = get_local_id(1);        
    
    float4 ls = 0.0;            
    for(int j=0;j<example_count;j+=SZ)
    {        
        float4 x4 = ((global float4*)(X+(j+loc)*Xstride))[i];
        ls += x4 * o[j+loc];        
    }

    local float4 sum[SZ]; // allocate local memory
    sum[loc] = ls;    

    // make sure all work items in the group
    // finished executing sum[loc] = ls
    barrier(CLK_LOCAL_MEM_FENCE);
    
    if(loc == 0)
    {        
        float4 s = 0.0;
        for(int k=0;k<SZ;k++)
        {
            s+=sum[k];
        }
        theta[i] -= s * alpha / example_count;
    }
}

Now we have 64 times more work items and that's enough to fully load Cypress GPU. In case of 292 features it results in ten times faster execution. Performance counters shows fetch stalls for both local and global memory, so I guess, it can be improved further, but I'm going to stop here. My benchmark shows now that it's 17 times faster on GPU than on CPU so I'm satisfied for now.

Sunday, May 6, 2012

Logistic regression with OpenCL and .NET [part 2 of 4]

In the previous post I showed two kernel functions for OpenCL. Today we'll learn how to run them on hardware. To use OpenCL from .NET we will use Cloo.

First we need to select a platform and create a context:

if (ComputePlatform.Platforms.Count == 0)
    throw new Exception("No OpenCL platforms available");

var platform = ComputePlatform.Platforms[0];            
ComputeContextPropertyList properties = new ComputeContextPropertyList(platform);
context = new ComputeContext(platform.QueryDevices().ToList(), properties, null, IntPtr.Zero);
I have only one platform, so I don't bother selecting.

After that we need to compile our kernels:

private const string clProgramSource = " ... kernel code written on CL C99 ... ";

/// .....

program = new ComputeProgram(context, clProgramSource);
program.Build(null, null, null, IntPtr.Zero);
If you have an error in your CL C99 source, build will fail without providing any detailed error information. To find out what's wrong with your code, use AMD APP Kernel Analyzer.

To feed our kernels with data we need to create buffers:

float[] X = new float[exampleCount * featureCount];
float[] Xtranspose = new float[featureCount * exampleCount ];
float[] Y = new float[exampleCount];
float[] arrTheta = new float[featureCount];

// ... fill arrays with data ...

bufX = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, X);
bufXtranspose = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, Xtranspose);
bufY = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, Y);
bufTheta = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, arrTheta);
bufO = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite, exampleCount);

And as the last preparation step we need to setup kernel functions:

kernelA = program.CreateKernel("grad1");
kernelA.SetMemoryArgument(0, bufX);
kernelA.SetMemoryArgument(1, bufY);
kernelA.SetMemoryArgument(2, bufTheta);
kernelA.SetMemoryArgument(3, bufO);
kernelA.SetValueArgument(4, featureCount);

kernelB = program.CreateKernel("grad2");
kernelB.SetMemoryArgument(0, bufXtranspose);
kernelB.SetMemoryArgument(1, bufO);
kernelB.SetMemoryArgument(2, bufTheta);
kernelB.SetValueArgument(3, exampleCount);

To schedule these kernels for execution we need CommandQueue:

var device = context.Devices.OrderByDescending(d => d.MaxComputeUnits).First();
using (ComputeCommandQueue commands = new ComputeCommandQueue(context, device, ComputeCommandQueueFlags.None))
{
    // do computations
}

Let's schedule kernels:

double error = Single.PositiveInfinity;
int count = 0;
float[] thetabuf = new float[featureCount];
for (int i = 0; i < maxIterations / smallStep; i++)
{
    kernelB.SetValueArgument(4, alpha);

    // to compensate for ReadFromBuffer cost, execute several steps w/o checking
    for (int j = 0; j < smallStep; j++, count++)
    {
        commands.Execute(kernelA, null, new long[] { exampleCount}, null, null);
        commands.AddBarrier(); // ensures that next commands wait for previous to complete
        commands.Execute(kernelB, null, new long[] { featureCount}, null, null);
        commands.AddBarrier();
    }

    // Read updated theta from GPU
    commands.ReadFromBuffer(bufTheta, ref thetabuf, true, null);

    // ... calculate error, update learning ratio if needed, check exit conditions
}

That's basically it. Note, that all data except for theta stays on the GPU side. But performance is far from stellar. I show you in the next post what can be done about that.

Saturday, May 5, 2012

Logistic regression with OpenCL and .NET [part 1 of 4]

Logistic regression is a simple tool to solve classification problems. It's much easier to understand and implement than neural networks or SVM and it's still pretty powerful. In this post I'd like to show how to train logistic regression model on GPU using simple gradient descent algorithm.

Let me assume that you're familiar with logistic regression and gradient descent algorithm. If not, please take a look at Andrew Ng's course on ML: https://www.coursera.org/course/ml.

First, we need to create OpenCL kernel functions. To do so we have to use CL C99 language.

First kernel function calculates difference between value returned by hypothesis function for training example j and actual label for that example. Vectorized form of this calculation looks like o = sigmoid(X*theta)-Y. That means that to calculate such a difference for example j we need all values from theta vector and only values from X and y that are related to the example, and thus this difference can be calculated for each example in parallel:

float sigmoid(float z){return 1.0/(1.0+exp(-z));}

kernel void grad1(global read_only float* X, global read_only float* Y, global read_only float* theta, global write_only float* o, int feature_count)
{
    int j = get_global_id(0);  /* getting example index for the current thread */
    float s = 0.0;    
    global float* x = X+j*feature_count;
    for(int i=0;i<feature_count;++i)
    {
        s += x[i] * theta[i];
    }
    o[j] = sigmoid(s) - Y[j];
}

Second kernel function calculates gradient vector, multiplies it by learn ratio alpha and then decrements theta by its value. Vectorized form of this update looks like theta = theta - transpose(X)*o*alpha/example_count. Matrix multiplication of transpose(X) and o means that we have data dependency between examples, but we have no dependency between features, so we run update for each element of vector theta in parallel:

kernel void grad2(global read_only float* Xtranspose, global read_only float* o, global float* theta, int example_count, float alpha)
{
    int i = get_global_id(0); /* getting feature index for the current thread */  
    int feature_count = get_global_size(0);
    float sum = 0.0;    
    global float* x = Xtranspose+i*example_count;
    for(int j=0;j<example_count;++j)
    {
        sum += x[j] * o[j];
    }
    theta[i] -= alpha * sum / example_count;
}

If feature count is a multiply of 4, we can use float4 type to fully exploit GPU registers:

kernel void grad1_ff4(global float* X, global float* Y, global float* theta, global float* o, int feature_count)
{
    int j = get_global_id(0);        
    float4 s = (float4)(0.0,0.0,0.0,0.0);    
    global float4* x4 = (global float4*)(X+j*feature_count);
 global float4* theta4 = (global float4*)theta;
    for(int i=0;i<feature_count>>2;++i)
    {
        s += x4[i] * theta4[i];
    }
    o[j] = sigmoid(s.x+s.y+s.z+s.w) - Y[j];
}

kernel void grad2_tf4(global float* X, global float* o, global float* theta, int example_count, float alpha)
{
    int i = get_global_id(0);   
    int feature_count = get_global_size(0);
    float4 sum = (float4)(0.0, 0.0, 0.0, 0.0);    
    global float4* x4 = (global float4*)(X+i*example_count);
 global float4* o4 = (global float4*)o;

    for(int j=0;j<example_count>>2;++j)
    {
        sum += x4[j] * o4[j];
    }
    theta[i] -= alpha * (sum.x+sum.y+sum.z+sum.w) / example_count;
}

That's all for today. In the next part I'll show how to execute these kernels on GPU from .NET using Cloo.

UPDATE: it turns out that OpenCL memory access optimization is a bit tricky, so these kernels in fact use X and Xtranspose incorrectly. Function grad1 should use Xtranspose and grad2 should use X. That leads to cache misses, but it's better when adjacent work items access adjacent addresses... I'll write more on that later.