Replicating a for-loop with barriers and an update of an out-of-loop variable produces invalid code

Bug #927573 reported by Pekka Jääskeläinen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pocl
Fix Released
Critical
Unassigned

Bug Description

__attribute__((reqd_work_group_size(2, 1, 1)))
kernel
void test_kernel(__global float *input,
                 __global int *result) {
  int gid = get_global_id(0);
  float global_sum = 0.0f;
  int i;

  for (i=0; i < 512; ++i) {
    float value = input[gid+i];
    float product = value * i;
    global_sum += product;
    barrier(CLK_LOCAL_MEM_FENCE);
  }
  global_sum *= 123;
  result[gid] = global_sum;
}

This test case updates a function scope private variable in a for-loop. When replicating, the other than the first work item do not update a replicated global_sum but add always to a constant 0.0 which is the initializer in the global_sum.

pocl-standalone -h header.h -t tce -o pocl-output.bc test_case.cl

(I use TCE target only because reqd_work_group_size doesn't work for other target yet due to LLVM support missing)

The replicated content of the for-loop is as follows:

for.body.phibarrier.latchbarrier.prebarrier.postbarrier.wi_0_0_0: ; preds = %for.body.phibarrier.latchbarrier.prebarrier
  store i32 0, i32* @_local_id_x
  store i32 0, i32* @_local_id_y
  store i32 0, i32* @_local_id_z
  %add = add nsw i32 %i.06, %add1.i
  %arrayidx = getelementptr inbounds float addrspace(3)* %input, i32 %add
  %8 = load float addrspace(3)* %arrayidx, align 4, !tbaa !2
  %conv = sitofp i32 %i.06 to float
  %mul = fmul float %8, %conv
  %add1 = fadd float %global_sum.05, %mul
                                       ----------^ the global_sum for the first WI is updated correctly.

  store i32 1, i32* @_local_id_x
  store i32 0, i32* @_local_id_y
  store i32 0, i32* @_local_id_z
  %add.wi_1_0_0 = add nsw i32 0, %add1.i.wi_1_0_0
  %arrayidx.wi_1_0_0 = getelementptr inbounds float addrspace(3)* %input, i32 %add.wi_1_0_0
  %9 = load float addrspace(3)* %arrayidx.wi_1_0_0, align 4, !tbaa !2
  %conv.wi_1_0_0 = sitofp i32 0 to float
  %mul.wi_1_0_0 = fmul float %9, %conv.wi_1_0_0
  %add1.wi_1_0_0 = fadd float 0.000000e+00, %mul.wi_1_0_0

                                       ----------^ the 2nd WI gets the 0.0 initialization value constant propagated here thus does not update the global_sum

  br label %for.body.phibarrier.latchbarrier

After running -O3 the code for the 2nd WI gets optimized as loop invariant code, it's moved outside the loop.

Related branches

Revision history for this message
Pekka Jääskeläinen (pekka-jaaskelainen) wrote :

It doesn't help to force the initialization of global_sum to be outside the loop (to avoid which I thought was "constant propagation"). Still, the 2nd WI refers directly to the global_sum outside the loop, not a phi-defined variable (thus doesn't accumulate the sum but always overwrite it on every iteration). Seems the phi node for the 'global_sum' is not replicated correctly per WI.

Revision history for this message
Pekka Jääskeläinen (pekka-jaaskelainen) wrote :

Notes for the record: it seems the preheader block is not replicated at all. The block contains the phi statements which must be replicated. Now they are not so only wi0 gets the correct (phi) value. Whether the induction variable(s) could/should be replicated is to be seen. It might be possible to replicate them and let the DCE to remove the extra uses (only wi0 uses it for conditional execution due to the WI-barrier execution semantics) or some other pass merge the value as it's identical for all WIs. Let's see...

Changed in pocl:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.