Sunday, May 6, 2012

Logistic regression with OpenCL and .NET [part 2 of 4]

In the previous post I showed two kernel functions for OpenCL. Today we'll learn how to run them on hardware. To use OpenCL from .NET we will use Cloo.

First we need to select a platform and create a context:

if (ComputePlatform.Platforms.Count == 0)
    throw new Exception("No OpenCL platforms available");

var platform = ComputePlatform.Platforms[0];            
ComputeContextPropertyList properties = new ComputeContextPropertyList(platform);
context = new ComputeContext(platform.QueryDevices().ToList(), properties, null, IntPtr.Zero);
I have only one platform, so I don't bother selecting.

After that we need to compile our kernels:

private const string clProgramSource = " ... kernel code written on CL C99 ... ";

/// .....

program = new ComputeProgram(context, clProgramSource);
program.Build(null, null, null, IntPtr.Zero);
If you have an error in your CL C99 source, build will fail without providing any detailed error information. To find out what's wrong with your code, use AMD APP Kernel Analyzer.

To feed our kernels with data we need to create buffers:

float[] X = new float[exampleCount * featureCount];
float[] Xtranspose = new float[featureCount * exampleCount ];
float[] Y = new float[exampleCount];
float[] arrTheta = new float[featureCount];

// ... fill arrays with data ...

bufX = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, X);
bufXtranspose = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, Xtranspose);
bufY = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, Y);
bufTheta = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, arrTheta);
bufO = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite, exampleCount);

And as the last preparation step we need to setup kernel functions:

kernelA = program.CreateKernel("grad1");
kernelA.SetMemoryArgument(0, bufX);
kernelA.SetMemoryArgument(1, bufY);
kernelA.SetMemoryArgument(2, bufTheta);
kernelA.SetMemoryArgument(3, bufO);
kernelA.SetValueArgument(4, featureCount);

kernelB = program.CreateKernel("grad2");
kernelB.SetMemoryArgument(0, bufXtranspose);
kernelB.SetMemoryArgument(1, bufO);
kernelB.SetMemoryArgument(2, bufTheta);
kernelB.SetValueArgument(3, exampleCount);

To schedule these kernels for execution we need CommandQueue:

var device = context.Devices.OrderByDescending(d => d.MaxComputeUnits).First();
using (ComputeCommandQueue commands = new ComputeCommandQueue(context, device, ComputeCommandQueueFlags.None))
{
    // do computations
}

Let's schedule kernels:

double error = Single.PositiveInfinity;
int count = 0;
float[] thetabuf = new float[featureCount];
for (int i = 0; i < maxIterations / smallStep; i++)
{
    kernelB.SetValueArgument(4, alpha);

    // to compensate for ReadFromBuffer cost, execute several steps w/o checking
    for (int j = 0; j < smallStep; j++, count++)
    {
        commands.Execute(kernelA, null, new long[] { exampleCount}, null, null);
        commands.AddBarrier(); // ensures that next commands wait for previous to complete
        commands.Execute(kernelB, null, new long[] { featureCount}, null, null);
        commands.AddBarrier();
    }

    // Read updated theta from GPU
    commands.ReadFromBuffer(bufTheta, ref thetabuf, true, null);

    // ... calculate error, update learning ratio if needed, check exit conditions
}

That's basically it. Note, that all data except for theta stays on the GPU side. But performance is far from stellar. I show you in the next post what can be done about that.

No comments:

Post a Comment