Logistic regression is a simple tool to solve classification problems. It's much easier to understand and implement than neural networks or SVM and it's still pretty powerful. In this post I'd like to show how to train logistic regression model on GPU using simple gradient descent algorithm.
Let me assume that you're familiar with logistic regression and gradient descent algorithm. If not, please take a look at Andrew Ng's course on ML: https://www.coursera.org/course/ml.
First, we need to create OpenCL kernel functions. To do so we have to use CL C99 language.
First kernel function calculates difference between value returned by hypothesis function for training example j and actual label for that example. Vectorized form of this calculation looks like o = sigmoid(X*theta)-Y. That means that to calculate such a difference for example j we need all values from theta vector and only values from X and y that are related to the example, and thus this difference can be calculated for each example in parallel:
float sigmoid(float z){return 1.0/(1.0+exp(-z));} kernel void grad1(global read_only float* X, global read_only float* Y, global read_only float* theta, global write_only float* o, int feature_count) { int j = get_global_id(0); /* getting example index for the current thread */ float s = 0.0; global float* x = X+j*feature_count; for(int i=0;i<feature_count;++i) { s += x[i] * theta[i]; } o[j] = sigmoid(s) - Y[j]; }
Second kernel function calculates gradient vector, multiplies it by learn ratio alpha and then decrements theta by its value. Vectorized form of this update looks like theta = theta - transpose(X)*o*alpha/example_count. Matrix multiplication of transpose(X) and o means that we have data dependency between examples, but we have no dependency between features, so we run update for each element of vector theta in parallel:
kernel void grad2(global read_only float* Xtranspose, global read_only float* o, global float* theta, int example_count, float alpha) { int i = get_global_id(0); /* getting feature index for the current thread */ int feature_count = get_global_size(0); float sum = 0.0; global float* x = Xtranspose+i*example_count; for(int j=0;j<example_count;++j) { sum += x[j] * o[j]; } theta[i] -= alpha * sum / example_count; }
If feature count is a multiply of 4, we can use float4 type to fully exploit GPU registers:
kernel void grad1_ff4(global float* X, global float* Y, global float* theta, global float* o, int feature_count) { int j = get_global_id(0); float4 s = (float4)(0.0,0.0,0.0,0.0); global float4* x4 = (global float4*)(X+j*feature_count); global float4* theta4 = (global float4*)theta; for(int i=0;i<feature_count>>2;++i) { s += x4[i] * theta4[i]; } o[j] = sigmoid(s.x+s.y+s.z+s.w) - Y[j]; } kernel void grad2_tf4(global float* X, global float* o, global float* theta, int example_count, float alpha) { int i = get_global_id(0); int feature_count = get_global_size(0); float4 sum = (float4)(0.0, 0.0, 0.0, 0.0); global float4* x4 = (global float4*)(X+i*example_count); global float4* o4 = (global float4*)o; for(int j=0;j<example_count>>2;++j) { sum += x4[j] * o4[j]; } theta[i] -= alpha * (sum.x+sum.y+sum.z+sum.w) / example_count; }
That's all for today. In the next part I'll show how to execute these kernels on GPU from .NET using Cloo.
UPDATE: it turns out that OpenCL memory access optimization is a bit tricky, so these kernels in fact use X and Xtranspose incorrectly. Function grad1 should use Xtranspose and grad2 should use X. That leads to cache misses, but it's better when adjacent work items access adjacent addresses... I'll write more on that later.
No comments:
Post a Comment