snd-hda-intel: distorted sound after resume, until the module is reloaded

Bug #100114 reported by wvengen on 2007-04-02
16
Affects Status Importance Assigned to Milestone
linux-source-2.6.20 (Ubuntu)
Undecided
Unassigned
linux-source-2.6.22 (Ubuntu)
Undecided
Unassigned

Bug Description

Binary package hint: linux-source-2.6.20

The ICH7 sound card on this Sony VAIO VGN-FE41E notebook works fine, until it is suspended to RAM and resumed again. When a sound is played, it seems to take about half a second of sound and repeats it lots of times. Then it takes the next half a second and does it again, etc. until the sound is finished.
Unloading/reloading the snd-hda-intel module makes the soundcard work again correctly.

Inspired by bug #55071, I looked at the contents of /proc/asound/card0/codec#0 and found these differences between its contents when working correctly, and after a module reload. There was no difference between a fresh boot and a module reload after suspend:
  --- card0codec0pin_freshboot 2007-04-02 11:47:10.000000000 +0200
  +++ card0codec0pin_resume 2007-04-02 11:45:32.000000000 +0200
  @@ -40,7 +40,7 @@
        0x09
   Node 0x09 [Audio Selector] wcaps 0x300903: Stereo Amp-In
     Amp-In caps: N/A
  - Amp-In vals: [0x0f 0x0f]
  + Amp-In vals: [0x80 0x80]
     Connection: 1
        0x15
   Node 0x0a [Pin Complex] wcaps 0x400181: Stereo

This happens

Andrew Ash (ash211) on 2007-04-14
Changed in linux-source-2.6.20:
assignee: nobody → ubuntu-kernel-team
status: Unconfirmed → Confirmed
TJ (tj) wrote :

I'm hearing the same problem on a Sony Vaio VGN-FE41Z.

I read where you say "Unloading/reloading the snd-hda-intel module makes the soundcard work again correctly". I've been unable to do that because the module is always in-use although I don't know by what:

$ sudo modprobe -r snd-hda-intel
FATAL: Module snd_hda_intel is in use.

wvengen (wvengen) wrote :

What about the volume control? I removed the volume control applet from my panel at the top of the screen; this is no problem since there are special keys for the volume anyway. Running an audio application, or even Firefox with a flash object may also prevent the module from unloading; this is something I haven't solved.

TJ (tj) wrote :

Re: unloading the module. It was in-use even after I stopped GDM and was working from a console.

I've been debugging this issue and have made a little progress, albeit not the kind we want.

The difference in the outputs reported here and in bug #55071 pre- and post-suspend of codex#0 is not of consequence. It represents the stereo-input device amplifier state, where 0x80 is mute. If after resume you use the volume control to alter the Recording->Capture state or level you'll see the change reflected in the Amp-In vals of Node 9.

You can obtain the same value during normal operations by toggling 'audio recording from capture' and reducing the level to zero; Node 9 will show val 0x80.

The user experience of this issue sounds to me like a problem I dealt with back in the 1980's and it is to do with handling of DMA buffers. To my ear at least, our current issue sounds like the DMA buffer isn't being updated when the device exhausts its contents, and the DMA transfer to the device is looping several times from the same buffer. What should happen is the device should raise an interrupt which hda_intel.c::azx_interrupt() should handle by setting up a new DMA transfer from another buffer that is ready to go.

Some observations:

The patch in use on these Sony Vaio FE4x series is Sigmatel.

In hda_codec.c::snd_hda_suspend() the code calls a patch-specific suspend function:

  if (codec->patch_ops.suspend)
   codec->patch_ops.suspend(codec, state);

and during resume hda_codec.c::snd_hda_resume() calls the reciprical resume function:

  if (codec->patch_ops.resume)
   codec->patch_ops.resume(codec);

However (this may or may-not be significant) the Sigmatel patch only defines a resume function (checking the other patches they all seem to omit a suspend function). In patch_sigmatel.c:1854

#ifdef CONFIG_PM
 .resume = stac92xx_resume,
#endif

There is a note in hda_codec.c::snd_hda_suspend() which may or may not be a clue:

/* FIXME: should handle power widget capabilities */

and another at hda_intel.c:1056 in struct snd_pcm_hardware azx_pcm_hw.info

     /* No full-resume yet implemented */
     /* SNDRV_PCM_INFO_RESUME |*/

Which, according to include/sound/asound.h:252:

#define SNDRV_PCM_INFO_RESUME 0x00040000 /* hardware supports stream resume after suspend */

When I find some time I am going to build a debug-kernel with hda_intel.c outputting debug info to test if hda_codec.c::snd_hda_resume() is missing something or possibly snd_hda_suspend() is neglecting to fully save the state.

TJ (tj) wrote :

I decided to track the interrupt/DMA enabling/disabling. I'll put my notes here in case others want to investigate.

I analysed the sequence of operations in hda_intel.c::azx_suspend(), azx_resume() and the device initialisation functions.

I noticed that azx_suspend() calls snd_hda_suspend() then azx_free_cmd_io() which does:

 /* disable ringbuffer DMAs */
 azx_writeb(chip, RIRBCTL, 0);
 azx_writeb(chip, CORBCTL, 0);
}

I then looked at the corresponding resume operations. Immediately prior to azx_resume() calling snd_hda_resume() it calls
 azx_init_chip(). It in turn does:

 /* initialize the codec command I/O */
 if (!chip->single_cmd)
  azx_init_cmd_io(chip);

Which does:

 /* enable corb dma */
 azx_writeb(chip, CORBCTL, ICH6_RBCTL_DMA_EN);

 /* enable rirb dma and response irq */
 azx_writeb(chip, RIRBCTL, ICH6_RBCTL_DMA_EN | ICH6_RBCTL_IRQ_EN);

I then looked at the Sigmatel patch. On the Sony Vaio VGN-FE41Z the chipset reports as:

  Codec: SigmaTel STAC9872AK
  Address: 0
  Vendor Id: 0x83847662
  Subsystem Id: 0x104d0c00
  Revision Id: 0x100201

which corresponds with patch_sigmatel.c struct hda_codec_preset snd_hda_preset_sigmatel[]:

  { .id = 0x83847662, .name = "STAC9872AK", .patch = patch_stac9872 },

In patch_stac9872() the following code is executed:

 switch (board_config) {
 case CXD9872RD_VAIO:
 case STAC9872AK_VAIO: // *** this is the matching config ***
 case STAC9872K_VAIO:

followed by:

 codec->patch_ops = stac9872_patch_ops;

stac9872_patch_ops is defined as:

static struct hda_codec_ops stac9872_patch_ops = {
 .build_controls = stac92xx_build_controls,
 .build_pcms = stac92xx_build_pcms,
 .init = stac92xx_init,
 .free = stac92xx_free,
#ifdef CONFIG_PM
 .resume = stac92xx_resume,
#endif
};

stac9872_patch_ops.resume points to stac92xx_resume().

stac92xx_resume() calls the same function as is used on device init, stac92xx_init() so there doesn't appear to be any difference between cold-start and resume.

wvengen (wvengen) wrote :

Great to see some work on this! I'm not too familiar with DMA, but I'll try to help a bit. I started to trace the DMA transfer process by verifying that the interrupts are correctly setup after resume. It seems this is working: azx_interrupt() is called a number of times proportional to the length of playback. The attached file explains this in more detail.

By the way, the length of playback after resume is not constant for me; a sound of less than a second may take a couple of seconds or may seem to have no end.

TJ (tj) wrote :

Oh cool! Someone else that knows how to hack :)

I've been reading the Intel HDA Programmers Reference Manual (http://www.intel.com/design/chipsets/hdaudio.htm) and from what I've learned so far the problem appears to be with the handling of the output buffers. Here's the description of what should be happening:

ICH6 HDA/AC'97 Programmers Reference Manual

1.2.6.3 Resume Context Recovery
        When the system is placed in a power-down state (S3 or greater) power is removed from the ICH6
        AC '97 controller. In this state, DMA registers lose information regarding current position and
        pointer values. The device driver must be able to save the context of all DMA engines before
        acknowledging a D3 state. Device drivers must assume that D3 state request precedes a total
        system power lost therefore context for DMA engines. Upon system power resume, the device
        driver must restore the DMA Engine register accordingly to saved values prior the suspend event.

4.6.6 Stream Steady State Operation
      Once the stream has been started, the hardware will continue fetching data from the Cyclic Buffer
      described by the Buffer Descriptor List. For an output stream, software is responsible for making
      sure that valid data is present in the buffers before it is fetched by the hardware. For an input
      stream, data must be removed from the buffers before being overwritten by hardware.

     *** When the hardware has reached the end of the Cyclic Buffer, it will automatically wrap back around to the
      beginning of the buffer and continue to process the stream data until the stream is stopped by the
      software by clearing the RUN bit. ***

      Software can either use interrupts at the end of selected buffers by setting the IOC bit in the BDL
      entry or can poll the stream position to determine when to process the stream data.

So it seems to me we should focus our attention on how the BDL and memory allocation is handled, and specifically what is setup prior to a suspend (even when no output is active) that is subsequently used when outputting sound. I suspect what we will discover is that during suspend/resume one or more structures and/or memory relating to Buffer Descriptor Lists or page memory buffers is not being restored, or else the IOC bit isn't being restored or acted upon. Note the text I've highlighted above with *** ... *** which would on the face of it describe the symptoms we are hearing (is the ISR countdow.

We should try to divide our labours: can I suggest that you continue hacking at the kernel, focusing in on the buffer code whilst I continue reading up on HDA and analysing the source-code? For various reasons I'm not keen to be disturbing the kernel right now because of other things I'm working on.

wvengen (wvengen) wrote :

Allright, I can see the DMA pointer not being updated. az_interrupt() calls snd_pcm_period_elapsed(), where the position pointer is read using snd_pcm_update_hw_ptr_pos(), which gets the current position from the driver: azx_pcm_pointer() in our case. Printing pos shows that it its value remains the same after a resume, but it is updated fine after a module reload.
I'm getting to the borders of my DMA knowledge, where is posbuf actually updated, because that's the source of the pointer?

TJ (tj) wrote :

Funny how you can arrive at the same point from different directions. I noticed in your tracing that int_cnt had far more interrupts (for the same file) just after resume than when the module had been freshly loaded (389 vs 11) and had traced through to the posbuf from struct azx_dev:

hda_intel.c:257
 struct azx_dev {
  u32 *posbuf; /* position buffer pointer */

hda_intel.c:334
 struct snd_dma_buffer posbuf;

/include/sound/memalloc.h:57

 struct snd_dma_buffer {
  struct snd_dma_device dev; /* device type */
  unsigned char *area; /* virtual pointer */
  dma_addr_t addr; /* physical address */
  size_t bytes; /* buffer size in bytes */
  void *private_data; /* private for allocator; don't touch */
 };

In azx_init_chip():

 /* program the position buffer */
 azx_writel(chip, DPLBASE, (u32)chip->posbuf.addr);
 azx_writel(chip, DPUBASE, upper_32bit(chip->posbuf.addr));

and in azx_setup_controller():

 /* enable the position buffer */
 if (! (azx_readl(chip, DPLBASE) & ICH6_DPLBASE_ENABLE))
  azx_writel(chip, DPLBASE, (u32)chip->posbuf.addr | ICH6_DPLBASE_ENABLE);

Notice how that code is conditional? I've been looking for conditional code because its the only place where something different could be done during resume compared to device start-up. I'm not suggesting this particular code is the culprit but this or something similar is probably the root cause of the issue.

Can you tell me precisely where in azx_pcm_pointer() you are reporting the value of pos, and also which code-path is being executed (there's an if...else in there) ?

Also, could you show a sample of the pos values for the good and bugged scenarios from a printk (kernel times could be useful) ?

TJ (tj) wrote :

hda_intel.c:1392:

 azx_init_stream() {
  ...
  azx_dev->posbuf = (u32 __iomem *)(chip->posbuf.area + i * 8);

TJ (tj) wrote :

azx_setup_controller() is called azx_pcm_prepare():

 snd_printdd("azx_pcm_prepare: bufsize=0x%x, fragsize=0x%x, format=0x%x\n",
      azx_dev->bufsize, azx_dev->fragsize, azx_dev->format_val);
 azx_setup_periods(azx_dev);
 azx_setup_controller(chip, azx_dev);

Now, we know from your earlier trace that azx_pcm_prepare() is being called 4 times and that the code above must be being executed since it comes directly after the snd_printdd() call.

So I think we're safe to focus on the code in and around azx_setup_controller() where it conditionally enables the position buffer.

TJ (tj) wrote :

Reading this extract from the HDA PRM in conjunction with azx_setup_controller() (note my ***...*** highlights) might suggest a problem with "...the hardware may still have pre-fetched descriptors that will not be flushed when the stream is restarted...":

4.6.6

After the RUN bit has been set, the buffer described by the BDL should not be changed by the
software. The hardware may pre-fetch and/or cache an arbitrary number of BDL Entries from the
list, so there is no way to ensure when or if any changes to the BDL list would be visible to the
hardware.
*** Even when the RUN bit has been cleared to pause the stream after it has been running,
the hardware may still have pre-fetched descriptors that will not be flushed when the stream is
restarted. Therefore, the software should only modify the BDL before the RUN bit has been set for
the first time after a Stream Reset. ***

 /* make sure the run bit is zero for SD */
 azx_sd_writeb(azx_dev, SD_CTL, azx_sd_readb(azx_dev, SD_CTL) & ~SD_CTL_DMA_START);
 /* reset stream */
 azx_sd_writeb(azx_dev, SD_CTL, azx_sd_readb(azx_dev, SD_CTL) | SD_CTL_STREAM_RESET);
 udelay(3);
 timeout = 300;
 while (!((val = azx_sd_readb(azx_dev, SD_CTL)) & SD_CTL_STREAM_RESET) &&
        --timeout)
  ;
 val &= ~SD_CTL_STREAM_RESET;
 azx_sd_writeb(azx_dev, SD_CTL, val);
 udelay(3);

 timeout = 300;
 /* waiting for hardware to report that the stream is out of reset */
 while (((val = azx_sd_readb(azx_dev, SD_CTL)) & SD_CTL_STREAM_RESET) &&
        --timeout);

wvengen (wvengen) wrote :

Wow, you've replied quite a lot already, here the answers to questions you posed.

posbuf was reported in azx_pcm_pointer(), just after the line
  pos = le32_to_cpu(*azx_dev->posbuf);
( http://lxr.linux.no/source/sound/pci/hda/hda_intel.c#L1215 )
I had another printk at read_lpib, but that one wasn't triggered at all. One of the higher functions does some alignment (snd_pcm_update_hw_ptr_pos() iirc), but that has resulted in no change in all my observations.
You can also see a printk in pcm_lib.c, that's at the end of snd_pcm_update_hw_ptr_pos() just before return.

I think azx_dev is an internal structure used to access the different streams. Hardware equivalent is accessible by memory mapped registers (ICH6 paragraph 18.2), beginning at address 80h and 20h bytes long. The macros azx_sd_* are used to access them. You probabely already found this out.

For posbuf values, please see attachment. Before suspend, one can see the values increasing. Output is usually the same (although I have seen a difference sometime I don't understand). After suspend, values are generally different for different suspend/resumes, but equal within one play. There is a point however when the value changes.

TJ (tj) wrote :

Thanks for that report, it was just what I needed. I was already tracking the various code responsible for updating posbuf and your trace has confirmed some of what I was expecting. There are several reasons why posbuf/pos won't be updated so we should try eliminating possibilities.

In snd_pcm_update_hw_ptr_pos() it is possible that a buffer over-run (XRUN) could prevent pos being updated:

  pos = substream->ops->pointer(substream);
  if (pos == SNDRV_PCM_POS_XRUN)
   return pos; /* XRUN */

Because if the function didn't return at that point it would update pos:

  pos -= pos % runtime->min_align;

Using CONFIG_SND_PCM_XRUN_DEBUG, in snd_pcm_update_hw_ptr_interrupt() we might see something useful from:

 #ifdef CONFIG_SND_PCM_XRUN_DEBUG
  if (runtime->periods > 1 && substream->pstr->xrun_debug) {
   snd_printd(KERN_ERR "Unexpected hw_pointer value [1] (stream = %i, delta: -%ld, max jitter = %ld): wrong interrupt acknowledge?\n", substream->stream, (long) delta, runtime->buffer_size / 2);
   if (substream->pstr->xrun_debug > 1)
    dump_stack();
 }
 #endif

assuming that the function didn't return before it got to that point:

 pos = snd_pcm_update_hw_ptr_pos(substream, runtime);
 if (pos == SNDRV_PCM_POS_XRUN) {
  xrun(substream);
  return -EPIPE;
 }

Alternatively, we need to watch snd_pcm_update_hw_ptr_interrupt() where control can branch:

 if (runtime->period_size == runtime->buffer_size)
  goto __next_buf;
 new_hw_ptr = runtime->hw_ptr_base + pos;
 hw_ptr_interrupt = runtime->hw_ptr_interrupt + runtime->period_size;

wvengen (wvengen) wrote :

I don't think the line
    pos -= pos % runtime->min_align;
updates the position buffer each time. You can see in posbuf-trace-01.txt that runtime->min_align=1, so pos remains the same. I can find no place where posbuf is updated, so I think the hardware does that. Before suspend, hardware and software agree about their buffers, after suspend they don't, which may be a result of your ***...*** marking in comment 12. But I still wonder why that would have any impact, since the stream is reset, so the hardware should be ready to accept new settings.

To verify this, I printed the lpib position (value of SD_LPIB register for the stream) in azx_pcm_pointer(), in addition to azx_dev->posbuf (see attachment). One can see that the lpib values are equal before and after suspend, while the computed posbuf is wrong. This probabely means that the hardware and software have a different idea about the dma buffer.

TJ (tj) wrote :

Yes, I agree. I found some time yesterday to build the alsa-core and snd-hda modules myself and have been watching the updating of the hw_ptr_interrupt. I've attached the log. Also, the line

pos -= pos % runtime->min_align;

is, if you read it carefully, not intended to update pos but simply to ensure pos is aligned on the correct byte/word/dword boundary. That is why in normal circumstances it doesn't change at that point.

I was comparing the resume operations to initial setup to see if I could spot any hardware actions (such as writel()) that aren't done during resume. So far, I've not found anything.

I also checked where azx_dev->posbuf is set, and it is only in one place and the way the code reads azx_dev->posbuf should always point to the base address so at the moment I've confused myself as to why pos returned from azx_pcm_pointer() advances.

If the DMA harware registers were responsible for updating it the declaration of the structure should be different, or we should find an azx_dev->posbuf = azx_readl(); type construct.

wvengen (wvengen) wrote :

A short overview of where posbuf is used in hda_intel.c:

In azx_init_chip():
  azx_writel(chip, DPLBASE, (u32)chip->posbuf.addr);
  azx_writel(chip, DPUBASE, upper_32bit(chip->posbuf.addr));

In azx_setup_controller():
  // enable the position buffer
  if (! (azx_readl(chip, DPLBASE) & ICH6_DPLBASE_ENABLE))
    azx_writel(chip, DPLBASE, (u32)chip->posbuf.addr | ICH6_DPLBASE_ENABLE);

In azx_init_stream():
  azx_dev->posbuf = (u32 __iomem *)(chip->posbuf.area + i * 8);

In azx_pcm_pointer():
  pos = le32_to_cpu(*azx_dev->posbuf);

In azx_create():
  snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV, snd_dma_pci_data(chip->pci),
                      chip->num_streams * 8, &chip->posbuf)) < 0)

So, chip->posbuf is a dma buffer that contains stream pointers (DMA Position Base Address), which are initialised in azx_init_stream(). Addresses must be 128-byte aligned. When the lowest bit is set, "the controller will write the DMA positions of each of the DMA engines to the buffer in the main memory periodically (typically once per frame). Software can use this value to know what data in memory is valid data." (18.2.31 DPLBASE) - it seems that the hardware does report back the position.

Note 1 mentioned in 18.2.6 GCTL struck me (about reset bit): "The CORB/RIRB RUN bits and all stream RUN bits must be verified cleared to 0 before writing a 0 to this bit in order to assure a clean re-start." I don't see this happening in/before azx_reset().

wvengen (wvengen) wrote :

Ok, I tracked DPLBASE and posbuf before and after suspend/resume and they look pretty similar. In both cases, the lowest bit of DPLBASE is set, so it should report back the position. Before:
  posbuf 0x340, lpibpos 0x294, posbuf=0x1b881020 DPLBASE=0x1b881001
  posbuf 0xfe0, lpibpos 0xf40, posbuf=0x1b881020 DPLBASE=0x1b881001
  posbuf 0x1fe0, lpibpos 0x1f40, posbuf=0x1b881020 DPLBASE=0x1b881001
and after:
  posbuf 0x340, lpibpos 0x294, posbuf=0x1b881020 DPLBASE=0x1b881001
  posbuf 0x340, lpibpos 0xf40, posbuf=0x1b881020 DPLBASE=0x1b881001
  posbuf 0x340, lpibpos 0x1f40, posbuf=0x1b881020 DPLBASE=0x1b881001
(another test showed that DPUBASE remains zero) So the issue remains: how is the re-initialisation different on resume?

wvengen (wvengen) wrote :

I hope I'm not spamming this bug with information; as the cause is somewhat unclear, all these things may or may not be important in the end. There's two kinds of config data: pci and memory mapped. PCI can be checked using sysfs,
About memory mapped registers. Find attached a dump of these before and after resume (and also a dump of the stream's BDL). Differences:
* pci 0x044 TCSEL traffic class select
* mem 0x030-0x033 wall clock
* mem 0x048-0x04a reserved
* mem 0x058-0x059 RIRB write pointer (constant during stream, differs before/after resume)
* mem 0x104-0x107 LPIB0 is different within stream, that was known already (differences before/after resume probably due to timing)
I shouldn't have missed anything, but I'm a bit tired at the moment.

And now for the really good news: try this after suspend ...
  # cd /sys/bus/pci/devices/0000:00:1b.0
  # dd if=config of=/tmp/foo
  # printf "\x0" | dd of=/tmp/foo bs=1 seek=68
  # dd if=/tmp/foo of=config
and play a sound file ...
Replace "\x0" with "\x7" to go back to the non-working state.

I don't know yet why TCSEL is 7 after resume. This note about TCSEL in the HDA document may have an impact: "These bits are not reset on D3HOT to D0 transition; however, they are reset by PLTRST#." Its default is zero however.

TJ (tj) wrote :

I'm afraid I'm seeing no difference here. I had previously checked TCSEL because of the comment in the head of azx_init_chip(), and I had also made sure that azx_resume() was called, and in-turn was calling azx_init_chip().

I also put a check of the value of TCSEL in azx_init_chip() to see what value it had upon resume, and another at the end of azx_resume() to ensure the reset had been done.

... start-up ...
Jul 11 22:46:03 bagoas kernel: [ 17.736000] hda-intel: Allocated posbuf size=64
Jul 11 22:46:03 bagoas kernel: [ 17.736000] hda-intel: azx_init_chip: TCSEL=7
... Resume ...
Jul 11 22:50:33 bagoas kernel: [ 291.960000] hda-intel: azx_init_chip: TCSEL=7
Jul 11 22:50:33 bagoas kernel: [ 292.024000] hda-intel: azx_resume: TCSEL=0

With TCSEL confirmed as 0 I am still hearing the same problem after resume.

00000060 05 70 80 00 00 00 00 00 00 00 00 00 00 00 00 00 |.p..............|

static void azx_init_chip(struct azx *chip)
{
 unsigned char reg;

 /* Clear bits 0-2 of PCI register TCSEL (at offset 0x44)
  * TCSEL == Traffic Class Select Register, which sets PCI express QOS
  * Ensuring these bits are 0 clears playback static on some HD Audio codecs
  */
 pci_read_config_byte (chip->pci, ICH6_PCIREG_TCSEL, &reg);
 pci_write_config_byte(chip->pci, ICH6_PCIREG_TCSEL, reg & 0xf8);
#ifdef DEBUG_RESUME
 snd_printk(KERN_DEBUG SFX "%s: TCSEL=%X\n", __FUNCTION__, reg);
#endif

static int azx_resume(struct pci_dev *pci)
{
 struct snd_card *card = pci_get_drvdata(pci);
 struct azx *chip = card->private_data;
#ifdef DEBUG_RESUME
 unsigned char reg;
#endif
...
#ifdef DEBUG_RESUME
 pci_read_config_byte (chip->pci, ICH6_PCIREG_TCSEL, &reg);
 snd_printk(KERN_DEBUG SFX "%s: TCSEL=%X\n", __FUNCTION__ , reg);
#endif

 return 0;
}

wvengen (wvengen) wrote :

Hmm I didn't clearly indicate the base of numbers, but I meant 68d=44h. And the value of TCSEL is correctly put back at resume, but somehow it changes between initialisation and start of playback; see attachment for tracing of TCSEL at various places in hda_intel.c. It is interesting how its value is reset to zero, but pops up as zeven later on:
   [ 3.585793] HDA Intel 0000:00:1b.0: resuming
   [ 3.601572] ALSA sound/pci/hda/hda_intel.c:412: TCSEL=0x07 in azx_resume()
   (...)
   [ 3.665477] ALSA sound/pci/hda/hda_intel.c:412: TCSEL=0x00 in azx_resume()
   (...)
   [ 39.396663] ALSA sound/pci/hda/hda_intel.c:412: TCSEL=0x07 in azx_pcm_open()

If you compare your pci registers before and after resume, do you see any difference?

TJ (tj) wrote :

Sorry - I pasted the wrong line from 'config'!

I see what you mean about TCSEL being reset to 07 when a sound is played, and when I do 'hexedt config' and change 0x44 to 0, the sound is corrected - nice one.

So the question is, is there another chip reset operation between azx_resume() and playback that causes TCSEL to set to 0x07, or is something else accidentally writing values to that location?

wvengen (wvengen) wrote :

azx_reset() is only called from azx_init_chip(), to which I added printk's; so I don't think the hda driver does another initialisation. I'm currently writing a kprobe on pci_bus_write_config_* to see if something else is writing. If that's not the case, it may even be the hardware; that's possible at least theoretically.
I'm relieved that we have the same problem, that keeps it less complicated :)

wvengen (wvengen) wrote :

kprobe results: nothing to see. The module (for source see attachment) just monitors calls to pci_bus_write_config_*() so pci configuration writes that bypass this function are not detected.
Hypothesis: a(nother) module or the kernel corrupts the pci configuration of this device. Can a pci configuration write go out of bounds and end up in another device's configuration space?

TJ (tj) wrote :

OK, it seems we have identified the reason for this problem.

Willem and I worked together on tracing the value of TCSEL after resume. The code that finally pinpointed the issue was a kernel timer started in azx_resume() and stopped in azx_pcm_open() with a jiffie interval of 10, which appears to equate to 0.04 seconds. The timer writes the value of TCSEL to kern.log, so we simply look through the log for when the value of TCSEL changed from 0 to 7, and what happened in between the two timer interrupts.

It appears the cause is the 2nd CPU being brought online in the SMP kernel. We are now working on *why* this is - is it a kernel problem or something specific to the Sony Vaio or the ICH7 chipset:

[ 1630.936000] hda-intel: timer: TCSEL=0
[ 1630.944000] Calibrating delay using timer specific routine.. 3990.08 BogoMIPS (lpj=7980178)
[ 1630.944000] CPU: After generic identify, caps: bfebfbff 20100000 00000000 00000000 0000e3bd 00000000 00000001
[ 1630.944000] monitor/mwait feature present.
[ 1630.944000] CPU: L1 I cache: 32K, L1 D cache: 32K
[ 1630.944000] CPU: L2 cache: 4096K
[ 1630.944000] CPU: Physical Processor ID: 0
[ 1630.944000] CPU: Processor Core ID: 1
[ 1630.944000] CPU: After all inits, caps: bfebfbff 20100000 00000000 00003940 0000e3bd 00000000 00000001
[ 1630.944000] CPU1: Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping 06
[ 1630.948000] CPU1 is up
[ 1630.952000] sdb : READ CAPACITY failed.
[ 1630.952000] sdb : status=0, message=00, host=7, driver=00
[ 1630.952000] sdb : sense not available.
[ 1630.964000] sdb: Write Protect is off
[ 1630.964000] sdb: Mode Sense: 00 00 00 00
[ 1630.964000] sdb: assuming drive cache: write through
[ 1630.976000] hda-intel: timer: TCSEL=7

wvengen (wvengen) wrote :

Here's another log that pinpoints to enabling CPU1 beyond doubt:
 [ 9.868250] ALSA sound/pci/hda/hda_intel.c:796: hda-intel: timer: TCSEL=0
 [ 9.908187] ALSA sound/pci/hda/hda_intel.c:796: hda-intel: timer: TCSEL=0
 [ 9.924171] usb 5-8: new high speed USB device using ehci_hcd and address 4
 [ 9.948127] ALSA sound/pci/hda/hda_intel.c:796: hda-intel: timer: TCSEL=0
 [ 9.966250] done.
 [ 9.966292] Enabling non-boot CPUs ...
 [ 9.988065] ALSA sound/pci/hda/hda_intel.c:796: hda-intel: timer: TCSEL=7
 [ 10.028003] ALSA sound/pci/hda/hda_intel.c:796: hda-intel: timer: TCSEL=7
 [ 10.059394] usb 5-8: configuration #1 chosen from 1 choice
this time booted with maxcpus=1

TJ (tj) wrote :

What we should have made clear in the two comments above is that we found by monitoring the PCI config space when the ALSA drivers weren't loaded that the TCSEL value was still being changed to 0x07 from 0x0. That was when we decided to implement a timer to find out when the value was changing.

TJ (tj) wrote :

Is it significant that upon resume the 2nd CPU is resumed *after* PCI config has been written back to the device and it has been reinitialised by its driver? When cold-starting the CPU has been started before the PCI device is configured.

See the attached log-file that compares the relevant sections of kern.log for a resume and cold-start.

TJ (tj) wrote :

I've tested the latest Gutsy kernel from git (2.6.22-8.18) and found it doesn't suffer from this bug. In that, the 2nd CPU is started before the PCI devices are resumed. After resume the HDA device PCI config shows the TCSEL register is 0x00.

Therefore I'm attaching a patch against Feisty's 2.6.20-16 kernel source. In sound/pci/hda/hda_intel.c::azx_pcm_open() it checks the value of TCSEL and if it isn't 0, writes 0 to the TCSEL register. This will provide a transparent fix that won't cause adverse reaction in other aspects of the code.

TJ (tj) wrote :

Changed status to "Fix Committed" now cause is known and patch is available.

Changed in linux-source-2.6.20:
status: Confirmed → Fix Committed
wvengen (wvengen) wrote :

I'd agree with you on the CPU start order.
To pinpoint the offending code, I instrumented some resume code with printk's and an msleep(10) after them (HZ=1000), so the TCSEL timer can jump in and report (it reports only when it detects a difference with the previous value). Result from this kernel which has lots of debugging info too:
  [ 1.240004] CPU1 is up
  [ 1.240008] kernel/power/main.c#145: suspend_finish
  [ 1.262189] ALSA sound/pci/hda/hda_intel.c:805: hda-intel: timer: TCSEL=7
  [ 1.322099] usb 5-8: new high speed USB device using ehci_hcd and address 3
  (...)
  [ 1.759403] ACPI: EC: acpi_ec_wait timeout, status = 0, expect_event = 1
  (...)
  [ 1.761131] kernel/power/main.c#148: suspend_finish
  [ 1.771415] kernel/power/main.c#150: suspend_finish
This tells us that suspend_finish() in kernel/power/main.c has the culprit:
  pm_ops->finish(state);
http://lxr.linux.no/source/kernel/power/main.c#L142 (my line numbers are slightly different)
Looking at the source, the only relevant call to pm_set_ops() is found in ACPI, so we need to look at acpi_pm_finish() http://lxr.linux.no/source/drivers/acpi/sleep/main.c#L138

wvengen (wvengen) wrote :

There is also a workaround that doesn't require patching the kernel, maybe this is of help to someone experiencing the problem right now.
Create a file /etc/acpi/resume.d/66-snd-hda-intel.sh that contains:

#!/bin/sh
# workaround resume bug on 2.6.20 (fixed in .22)
# see https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/100114
# this requires xserver-xorg-core to be installed, which is usually the case
pcitweak -w 00:1b:0 0x44 0

and make it executable (`chmod a+x /etc/acpi/resume.d/66-snd-hda-intel.sh`)
Please check that your audio controller is indeed on address 00:1b:0 using ` lspci | grep 'High Definition Audio Controller' `; if not, adjust accordingly.

TJ (tj) wrote :

Nice work-around Willem!

I've enhanced your script to make it more modular. Users can install this script and not have to check and set the PCI device ID as the script will figure it out.

#!/bin/bash
# this file is /etc/acpi/resume.d/66-snd-hda-intel.sh
# workaround for HDA resume bug on 2.6.20 (fixed in .22)
# see https://bugs.launchpad.net/ubuntu/+bug/100114
# this requires xserver-xorg-core to be installed, which is usually the case

DEVICEPATH="`ls -d /sys/module/snd_hda_intel/drivers/pci\:HDA\ Intel/0*`"
if [ "x$DEVICEPATH" != "x" ]; then
 # extract PCI device ID from path
 PCI=`expr match "$DEVICEPATH" '.*\/[0-9]\+:\(.*$\)'`
 # replace dot with colon
 PCI=`echo $PCI | tr "." ":"`
 # read the register
 TCSEL=`pcitweak -r $PCI 0x44`
 echo "TCSEL=$TCSEL"
 # test TCSEL against required value
 if [ "$TCSEL" != "0x00000000" ]; then
  echo "HDA TCSEL register needs correcting"
  pcitweak -w $PCI 0x44 0
 else
  echo "HDA TCSEL register okay"
 fi
fi

wvengen (wvengen) wrote :

Thanks TJ, that's much better :)

Changed in linux-source-2.6.22:
status: New → Fix Released
Changed in linux-source-2.6.20:
status: Fix Committed → Won't Fix
MattBlack (oli-ellis) wrote :

I've got Ubuntu 8.04 with AC97 sound, this seems to be happening to me. If I run the above script as root after resume from standby it reports

HDA TCSEL register needs correcting

But pcitweak doesn't seem to be available with hardy; neither is the above script installed

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers