get_local_id etc. cannot be called from functions

Bug #905846 reported by Erik Schnetter
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pocl
Fix Released
High
Carlos Sánchez de La Lama

Bug Description

get_local_id(), get_group_id(), and some similar routines cannot be called from functions that are called from kernel functions. The example below illustrates this, using the "noinline" attribute to avoid having to use a much larger function that llvm would not inline.

This OpenCL code does not run; it aborts with the error message

Undefined symbols for architecture x86_64:
  "__local_id_x", referenced from:
      _dot_product1 in parallel.so.o
ld: symbol(s) not found for inferred architecture x86_64
Assertion failed: (error == 0), function pocl_pthread_run, file devices/pthread/pthread.c, line 406.

void
dot_product1 (__global const float4 *a,
       __global const float4 *b, __global float *c)
  __attribute__((__noinline__));
void
dot_product1 (__global const float4 *a,
       __global const float4 *b, __global float *c)
{
  int gid = get_local_id(0);

  c[gid] = dot(a[gid], b[gid]);
}

__kernel void
dot_product (__global const float4 *a,
      __global const float4 *b, __global float *c)
{
  dot_product1 (a, b, c);
}

Related branches

Revision history for this message
Carlos Sánchez de La Lama (csanchezdll) wrote :

I removed the forced inlining from the passes, as in some cases it might not be the optimal to inline everything (for example to make better use of instruction caches) and it is not a strict requirement for the replication algorithm. But then this might happen. The "privatization" step on the passes has to be fixed to either 1) inline always any function leading to a get_xxx or 2) do not privatize the related global variables if there is a get_xxx called from a subfunction. I think I will do 1) as the other option is thread-unsafe.

Changed in pocl:
assignee: nobody → Carlos Sánchez de La Lama (csanchezdll)
importance: Undecided → High
status: New → In Progress
Revision history for this message
Erik Schnetter (schnetter) wrote :

One of the issues I am fighting in my application (solving the Einstein equations) is the limited instruction cache size. We have to split our kernel into to or more (incurring some code duplication), and we are also using some functions explicitly marked "noinline" to reduce code size.

We currently do this in C++, and I want to port this code to OpenCL. Unconditional inlining of all functions would not be good for this application. Would it be possible to skip functions that don't call a get_*() function, or to skip inlining functions marked "noinline"?

Instead of privatizing the code for each thread, is it possible to privatize these variables on which the get_*() functions are based? With hyperthreading or modern AMD processors, it can be beneficial to have several threads executing the same code, even if some expressions cannot be evaluated at build time.

Revision history for this message
Carlos Sánchez de La Lama (csanchezdll) wrote :

I just committed a new pass that adds "always inline" attribute (and removes "no inline" if present) of all functions leading to a use of the global workgroup variables. Can you check if your big function gets inlined now?

Changed in pocl:
status: In Progress → Fix Committed
Revision history for this message
Erik Schnetter (schnetter) wrote :

I still receive this error:

Undefined symbols for architecture x86_64:
  "__group_id_z", referenced from:
      _do_step in parallel.so.o
  "__local_id_z", referenced from:
      _do_step in parallel.so.o
  "__group_id_y", referenced from:
      _do_step in parallel.so.o
  "__local_id_y", referenced from:
      _do_step in parallel.so.o
  "__group_id_x", referenced from:
      _do_step in parallel.so.o
  "__local_id_x", referenced from:
      _do_step in parallel.so.o
ld: symbol(s) not found for inferred architecture x86_64
Assertion failed: (error == 0), function pocl_pthread_run, file devices/pthread/pthread.c, line 406.
Rank 0 with PID 30783 got signal6

When I look at the generated assembler code (parallel.s), I see that the function is not inlined.

Revision history for this message
Carlos Sánchez de La Lama (csanchezdll) wrote :

My bad, it was a mistake in the pass. Now it is fixed & checked.

Changed in pocl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.