I am confused at what are my options in the following case:
We have a set of data. First: we run some kernel to calculate some stats of this data. Second: we use this stats to calculate a small table (8 shorts in size). Third: We use this table to transform every datapoint in the set.
My problem is with how to use the table from step two in the third step. The table does not change in step three!
My first approach was: There is data on the device, so I can give the kernel the pointer to the table. But the kernel had to load the desired part of the table when ever it is used in the transform.
Therefore I thought: Its stupid to load the table again and again. Load it once and use it. But now every kernel seems to have its own copy of the table.
Next try: Share the table. That was worse then approch two.
Next idea would be: use constant memory. But as far as I know, this would mean: copy the table to host, copy the table to symbol. I am not familiar with constant memory, and the symbol part looks like more trouble.
Are there any other way I can give my threads access to this const table, that may lead to better performance?
Comments
Post a Comment