Donate. I desperately need donations to survive due to my health

Get paid by answering surveys Click here

Click here to donate

Remote/Work from Home jobs

gpu kernel stalls indefinitely using numba with cuda

In an application I need to bin 3D point clouds into a voxelgrid or 3D histogram. Initially scipy.binned_statistic_dd was used for this but it was too slow to keep up with the real-time application we are building. In order to speed up the process the choice was made to try numba with cuda to perform the binning on the gpu. The following program is my first attempt at doing this and it is a nice speedup from what scipy could offer:

import numba
from numba import cuda
from numba import *
import time
@numba.jit(nopython=True) 
def compute_bin(x, n, xmin, xmax):
    # computes the bin that the given location should be in
    # special case to mirror NumPy behavior for last bin
    if x == xmax:
        return n - 1 # a_max always in last bin

    bin = np.int32(n * (x - xmin) / (xmax - xmin))

    if bin < 0 or bin >= n:
        return None
    else:
        return bin

@cuda.jit()
def histogram(dim1, dim2, dim3, dim1min, dim1max, dim2min, dim2max, dim3min, dim3max, val, histogram_out, histogram_sum_out, histogram_mean_out, nbins):
    # Adds the given pointcloud to the three histograms that are on the gpu
    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(start, dim1.shape[0], stride):
        # note that calling a numba.jit function from CUDA automatically
        # compiles an equivalent CUDA device function!
        bin_number_1 = compute_bin(dim1[i], nbins, dim1min, dim1max)
        bin_number_2 = compute_bin(dim2[i], nbins, dim2min, dim2max)
        bin_number_3 = compute_bin(dim3[i], nbins, dim3min, dim3max)

        if (bin_number_1 >= 0 and bin_number_1 < histogram_out.shape[0])\
            and (bin_number_2 >= 0 and bin_number_2 < histogram_out.shape[0])\
            and (bin_number_3 >= 0 and bin_number_3 < histogram_out.shape[0]):
            index = bin_number_3 + bin_number_2 * nbins + bin_number_1 * nbins**2
            cuda.atomic.add(histogram_out, index, 1)
            cuda.atomic.add(histogram_sum_out, index, val[i])
            meanval = histogram_sum_out[index] / histogram_out[index]
            cuda.atomic.min(histogram_mean_out, index, 0)
            cuda.atomic.max(histogram_mean_out, index, meanval)

@cuda.jit
def set_histogram_to_zeros(histogram_out, histogram_sum_out, histogram_mean_out):
    # set all histograms on the gpu back to 0.
    start = cuda.grid(1)
    stride = cuda.gridsize(1)
    for i in range(start, histogram_out.shape[0], stride):
        cuda.atomic.min(histogram_out, i, 0)
        cuda.atomic.min(histogram_sum_out, i, 0)
        cuda.atomic.min(histogram_mean_out, i, 0)

And this class is used to keep track of the binned data where bins is the number of bins in three dimensions and the three different histograms keep track of the summed values, count, and mean value of the input cloud.

class histoholder:
    # class that takes care of the interface to the cuda functions, it creates three histograms of the appropriate size. 
    def __init__(self, bins):
        self.bins = bins
        self.histogram_out = cuda.to_device(np.zeros(shape=(bins * bins * bins,), dtype=np.int32))
        self.histogram_out_sum = cuda.to_device(np.zeros(shape=(bins * bins * bins,), dtype=np.float32))
        self.histogram_out_mean = cuda.to_device(np.zeros(shape=(bins * bins * bins,), dtype=np.float32))

    def numba_gpu_histogram(self, dim1, dim2, dim3, w):

        # Move data to GPU so we can do two operations on it

        dim1_gpu = cuda.to_device(dim1)
        dim2_gpu = cuda.to_device(dim2)
        dim3_gpu = cuda.to_device(dim3)
        w_gpu = cuda.to_device(w)

        # add the pointcloud to the existing histograms
        histogram[128, 512](dim1_gpu, dim2_gpu, dim3_gpu, 0., 50., 0., 50., 0., 50., w_gpu, self.histogram_out,
                         self.histogram_out_sum, self.histogram_out_mean, self.bins)

    def get_histograms(self):
        # copies the histograms on the gpu back to the cpu so they can be used. 
        cuda.synchronize()
        cpu_histogram = self.histogram_out.copy_to_host()
        cpu_histogram_sum = self.histogram_out_sum.copy_to_host()
        cpu_histogram_mean = self.histogram_out_mean.copy_to_host()
        return np.reshape(cpu_histogram, (self.bins, self.bins, self.bins)), \
           np.reshape(cpu_histogram_sum, (self.bins, self.bins, self.bins)), \
           np.reshape(cpu_histogram_mean, (self.bins, self.bins, self.bins))

    def reset_histogram(self):
        # set all bins in the histograms back to 0. 
        set_histogram_to_zeros(self.histogram_out, self.histogram_out_sum, self.histogram_out_mean)
        cuda.synchronize()

When using this in our application the binning seems to stall at random moment for arbitrary amounts of time when running the kernel on the gpu.

Then following test however calls upon the functions indefinitely to try and obtain the behaviour (outside the application) that stalls the application. This example will run indefinitely which seemingly shows me there is not necessarily something wrong with the code.

def testcycles():

    bins = 100
    hh = histoholder(bins=bins)
    reset_counter = 0
    max_time = 0
    while(True):
        xvals = np.random.rand(70000) * 100.
        yvals = np.random.rand(70000) * 100.
        zvals = np.random.rand(70000) * 100.
        wvals = np.random.rand(70000) * 256.
        reset_counter += 1
        tmptime = time.time()
        hh.numba_gpu_histogram(xvals, yvals, zvals, wvals)
        _, _, _ = hh.get_histograms()

        if reset_counter > 5000:
            reset_counter = 0
            hh.reset_histogram()
        time_taken = time.time() - tmptime
        if time_taken > max_time:
            max_time = time_taken
        print '%.4f, %.5f' % (time_taken, max_time)

In this test run this runs perfectly fine. In the application I am using it in however the gpu kernel seems to stall for an unspecified amount of time (varying from 2 seconds to 43 seconds or so long I had to kill the application).

I already tried a clean reinstall of everything.

It will run for half an hour or 10 minutes and then hang all of a sudden (the binning frequency is about 25 Hz with approximately 70.000 datapoints). During the hang the "nvidia-smi" call in the terminal shows me that the "Volatile GPU Util" is constantly at 100% and also the cpu running the process goes to 100%. The temperature of the GPU is around 50 degrees which makes me doubt it is a temperature problem (NVIDIA states that up to 85 is totally fine). Also the used memory on the card does not exceed 100 MB while it has 8 GB of memory.

I've tried different version of CUDA and different version of the nvidia driver but all to no success.

There is however no error or exception raised. The GPU I use to do calculations on is the second one on the the pc. On ubuntu there is no given timeout for kernel calls on the GPU that I am aware of, if the gpu is not connected to a screen.

So my question is how to debug this issue.

the GPU: NVIDIA QUADRO M4000, (the other one is a GeForce GTX 960)

OS: ubuntu 16.04

numba: 0.40.1

cuda: 9.2.148-1

NVIDIA driver: 410.78

Any tips on improving this numba code are also welcome offcourse!

Comments