Comment 63 for bug 43154

Revision history for this message
In , Shankargiri (shankargiri) wrote :

(In reply to comment #2)
> The situation appears to be more complicated than I thought initially. I've did
> additional debugging and now think that all games are suffering from the same
> bug in a driver - the symptoms are quite similar. However, there are many ways
> to "activate" this bug, so every game (or GL app in general) has it's own
> workaround. For instance, in Trigger you should avoid GL_LINEAR_MIPMAP_LINEAR,
> in Torcs you should disable GL_ALPHA_TEST when rendering multitextures (it is
> always connected with textures somehow) and so on. Usually, the program hangs
> between return statement and the next line of code, i.e. in the sample code below:
>
> int some_func() {
> ...
> printf("BEFORE\n");
> return 1;
> }
> ...
> while (some_func) {
> printf("AFTER\n");
> }
>
> you will see "BEFORE" line but not "AFTER".
>
> So, I've prepared a very simple demo program (see attachement) below which hangs
> my computer. Hope it will help debugging driver. More details are in attachment
> comments.

I looked into this a bit more and I infer the following. I realise that you looked at the bug in terms of high-level errors at the mesa DRI code. I however, went lower than that to the exact root cause. My observations may be different from yours, but here I go.
1. Debugging into this using gdb caused a hard lock in the glFlush() portion of glx code, which in turn goes to the __mesa_Flush() in unichrome_dri.so. The locking happens at different points of the code and therefore I figured that it is an asynchronous event driven code that is causing this lock.
2. I finally went into the DRM portion of the code(libdrm) which ioctl's the kernel for running various kernel level code from user space.
3. Adding printk's to DRM code finally isolated the problem. There is a function in via_irq.c called via_driver_vblank_wait(), which is probably serviced when the VIA_IRQ_VBLANK_PENDING interrupt bit is set. It calls viadrv_acknowledge_irqs().
4. This reads the VIA_INTERRUPTS_REG using the VIA_READ macro(which is a readl PCI post), 'or' it with the VIA_IRQ_VBLANK_PENDING bit. QUESTION: If it is interrupt driven, this bit should already be set. Why is it being set during acknowledge? Then it writes the VIA_INTERRUPTS_REG back using VIA_WRITE.
5. Looking at the sequence of printk's I see that VIA_READ and VIA_WRITE happens several times and that at one point VIA_READ simply locks.

Observations:

1. Since this locking is happening in a mmio PCI Posting, it probably means there is some bus arbitration problems(memory space must be mapped to agpgart). So is the bug in agpgart? Or is there something in the hardware that says you cannot read and write to HW registers using PCI posts continuously and maybe you should introduce gaps or delays between READ's and WRITE's?
2. Since the hw is mmio, I would imagine that PCI posting(reading and writing together) although non-blocking would be properly handled by the bus aribitration queue. It would be a great help if we had the manufacturer specs. This is wierder because it happens only to a few via chipsets(Unichrome Pro B).
3. I think it must be related to certain HW timing differences between the chipsets. Matters are not helped by the fact that the bug seems to lie at kernel space where debugging is a lot more difficult. Debugging with Linice seems to be a good way of reducing wastage of time, but I don't think it is stable enough for the latest 2.6.x kernels.
4. Finally, just giving arbitray udelays do not seem to solve the problem. On the other hand, they just slow the system much more. And the VIA_READ still hangs. If it is a timing issue, then there is more to it than just simple delay between reading and writing of HW registers.
5. Would very much like someone, to go further into this, and if possible, get help from the DRI architects, as they maybe the best persons to deal with this problem, with or without HW specs for the chipset.
Hope this helped in some way. I would love for comments or corrections on what I have written. It may happen that your code flow happens entirely differently. Please let me know if so.

Hope this helps.