Ubuntu

faxgetty segfault

Reported by Valentijn Sessink on 2010-06-30
66
This bug affects 10 people
Affects Status Importance Assigned to Milestone
HylaFAX
Fix Released
Medium
hylafax (Debian)
Incomplete
Unknown
hylafax (Ubuntu)
Undecided
Giuseppe Sacco

Bug Description

Until recently, my HylaFax installation worked flawlessly. Since upgrade from Dapper, through Hardy, to Lucid, the Faxgetty process segfaults. The message is always the same:

 Jun 30 15:13:18 machinename kernel: [4917851.100056] faxgetty[26509]: segfault at a3c ip 0805c083 sp bf8304b0 error 4 in faxgetty[8048000+72000]

Peter Childs (pchilds-bcs) wrote :

This happens while receiving a fax. I've managed to get this exact bug to happen on more than one computer.

If you need any more detail let me know.

Jun 29 16:20:49.30: [ 9678]: SESSION BEGIN 000000090 442074031930
Jun 29 16:20:49.30: [ 9678]: HylaFAX (tm) Version 6.0.3
Jun 29 16:20:49.30: [ 9678]: <-- [4:ATA\r]
Jun 29 16:20:56.26: [ 9678]: --> [7:CONNECT]
Jun 29 16:20:56.26: [ 9678]: ANSWER: FAX CONNECTION DEVICE '/dev/ttyS0'
Jun 29 16:20:56.26: [ 9678]: RECV FAX: begin
Jun 29 16:20:56.26: [ 9678]: <-- data [32]
Jun 29 16:20:56.26: [ 9678]: <-- data [2]
Jun 29 16:20:58.24: [ 9678]: --> [7:CONNECT]
Jun 29 16:20:58.24: [ 9678]: <-- data [23]
Jun 29 16:20:58.24: [ 9678]: <-- data [2]
Jun 29 16:20:58.99: [ 9678]: --> [7:CONNECT]
Jun 29 16:20:58.99: [ 9678]: <-- data [13]
Jun 29 16:20:58.99: [ 9678]: <-- data [2]
Jun 29 16:20:59.48: [ 9678]: --> [2:OK]
Jun 29 16:20:59.48: [ 9678]: <-- [9:AT+FRH=3\r]
Jun 29 16:20:59.94: [ 9678]: --> [7:CONNECT]
Jun 29 16:21:01.69: [ 9678]: --> [2:OK]
Jun 29 16:21:01.69: [ 9678]: RECV recv TSI (sender id)
Jun 29 16:21:01.69: [ 9678]: REMOTE TSI ""
Jun 29 16:21:01.69: [ 9678]: <-- [9:AT+FRH=3\r]
Jun 29 16:21:01.70: [ 9678]: --> [7:CONNECT]
Jun 29 16:21:01.96: [ 9678]: --> [2:OK]
Jun 29 16:21:01.96: [ 9678]: RECV recv DCS (command signal)
Jun 29 16:21:01.96: [ 9678]: REMOTE wants 14400 bit/s
Jun 29 16:21:01.96: [ 9678]: REMOTE wants A4 page width (215 mm)
Jun 29 16:21:01.96: [ 9678]: REMOTE wants unlimited page length
Jun 29 16:21:01.96: [ 9678]: REMOTE wants 7.7 line/mm
Jun 29 16:21:01.96: [ 9678]: REMOTE wants 1-D MH
Jun 29 16:21:01.96: [ 9678]: RECV training at v.17 14400 bit/s
Jun 29 16:21:01.96: [ 9678]: <-- [11:AT+FRM=145\r]
Jun 29 16:21:03.60: [ 9678]: --> [7:CONNECT]
Jun 29 16:21:05.17: [ 9678]: RECV: TCF 2774 bytes, 1% non-zero, 2699 zero-run
Jun 29 16:21:05.17: [ 9678]: --> [10:NO CARRIER]
Jun 29 16:21:05.17: [ 9678]: DELAY 70 ms
Jun 29 16:21:05.24: [ 9678]: TRAINING succeeded
Jun 29 16:21:05.24: [ 9678]: <-- [9:AT+FTH=3\r]
Jun 29 16:21:05.26: [ 9678]: --> [7:CONNECT]
Jun 29 16:21:05.26: [ 9678]: <-- data [3]
Jun 29 16:21:05.26: [ 9678]: <-- data [2]
Jun 29 16:21:06.47: [ 9678]: --> [2:OK]
Jun 29 16:21:06.47: [ 9678]: <-- [11:AT+FRM=146\r]
Jun 29 16:21:07.36: [ 9678]: --> [7:CONNECT]
Jun 29 16:21:07.36: [ 9678]: RECV: begin page

Giuseppe Sacco (eppesuig) wrote :

Could you provide a complete session log created with sessiontracing 0x08FFF ?

Thanks,
Giuseppe

Changed in hylafax (Ubuntu):
assignee: nobody → Giuseppe Sacco (giuseppe-eppesuigoccas)
yaztromo (tromo) wrote :

Just to add I am also experiencing this since upgrading from hardy to lucid server. faxgetty will segfault just after Begin Page

Sep 01 12:32:20.60: [23473]: SESSION BEGIN 000022359 441709369264
Sep 01 12:32:20.60: [23473]: HylaFAX (tm) Version 6.0.3
Sep 01 12:32:20.60: [23473]: <-- [4:ATA\r]
Sep 01 12:32:25.81: [23473]: --> [7:CONNECT]
Sep 01 12:32:25.81: [23473]: ANSWER: FAX CONNECTION DEVICE '/dev/ttyS0'
Sep 01 12:32:25.81: [23473]: RECV FAX: begin
Sep 01 12:32:25.83: [23473]: <-- data [32]
Sep 01 12:32:25.83: [23473]: <-- data [2]
Sep 01 12:32:26.80: [23473]: --> [7:CONNECT]
Sep 01 12:32:26.80: [23473]: <-- data [23]
Sep 01 12:32:26.80: [23473]: <-- data [2]
Sep 01 12:32:26.83: [23473]: --> [7:CONNECT]
Sep 01 12:32:26.83: [23473]: <-- data [13]
Sep 01 12:32:26.83: [23473]: <-- data [2]
Sep 01 12:32:28.96: [23473]: --> [2:OK]
Sep 01 12:32:28.96: [23473]: <-- [9:AT+FRH=3\r]
Sep 01 12:32:29.50: [23473]: --> [7:CONNECT]
Sep 01 12:32:30.62: [23473]: --> [2:OK]
Sep 01 12:32:30.62: [23473]: RECV recv DCS (command signal)
Sep 01 12:32:30.62: [23473]: REMOTE wants 14400 bit/s
Sep 01 12:32:30.62: [23473]: REMOTE wants A4 page width (215 mm)
Sep 01 12:32:30.62: [23473]: REMOTE wants unlimited page length
Sep 01 12:32:30.62: [23473]: REMOTE wants 7.7 line/mm
Sep 01 12:32:30.62: [23473]: REMOTE wants 1-D MH
Sep 01 12:32:30.62: [23473]: RECV training at v.17 14400 bit/s
Sep 01 12:32:30.62: [23473]: <-- [11:AT+FRM=145\r]
Sep 01 12:32:32.13: [23473]: --> [7:CONNECT]
Sep 01 12:32:33.95: [23473]: RECV: TCF 2934 bytes, 6% non-zero, 2732 zero-run
Sep 01 12:32:34.16: [23473]: --> [10:NO CARRIER]
Sep 01 12:32:34.16: [23473]: <-- [9:AT+FRS=7\r]
Sep 01 12:32:34.17: [23473]: --> [2:OK]
Sep 01 12:32:34.17: [23473]: TRAINING succeeded
Sep 01 12:32:34.17: [23473]: <-- [9:AT+FTH=3\r]
Sep 01 12:32:34.22: [23473]: --> [7:CONNECT]
Sep 01 12:32:34.22: [23473]: <-- data [3]
Sep 01 12:32:34.22: [23473]: <-- data [2]
Sep 01 12:32:35.46: [23473]: --> [2:OK]
Sep 01 12:32:35.46: [23473]: <-- [11:AT+FRM=146\r]
Sep 01 12:32:37.86: [23473]: --> [7:CONNECT]
Sep 01 12:32:37.86: [23473]: RECV: begin page

And from dmesg:
[316471.104213] faxgetty[23473]: segfault at a3c ip 0805c083 sp bf88e390 error 4 in faxgetty[8048000+72000]

I have changed sessiontracing to 0x08FFF in config.tty0. There's also config and config.sav, do I need to change it there too? I can give you a couple of days output then I will have to revert back to hardy from a dd image since I risk losing the company faxed purchase orders.

Leonardo (rnalrd) wrote :

Disabling/stopping apparmor "fixes" the issue for me

yaztromo (tromo) wrote :
Download full text (10.4 KiB)

Here are logs with session tracing set to 0x08FFF and server tracing set to 0x0FFFF. Leonardo I will try your solution tomorrow, I have no use for apparmor anyway.

From the faxlog:
Sep 02 12:22:29.69: [ 4541]: SESSION BEGIN <number hidden>
Sep 02 12:22:29.69: [ 4541]: HylaFAX (tm) Version 6.0.3
Sep 02 12:22:29.69: [ 4541]: <-- [4:ATA\r]
Sep 02 12:22:34.94: [ 4541]: --> [7:CONNECT]
Sep 02 12:22:34.94: [ 4541]: ANSWER: FAX CONNECTION DEVICE '/dev/ttyS0'
Sep 02 12:22:34.94: [ 4541]: STATE CHANGE: ANSWERING -> RECEIVING
Sep 02 12:22:34.94: [ 4541]: RECV FAX: begin
Sep 02 12:22:34.95: [ 4541]: <-- HDLC<32:FF C0 04 B5 00 AA 12 9E 36 86 62 82 1A 04 14 2E B6 94 04 6A A6 4E CE 96 F6 76 04 6C 74 0C 74 CC>
Sep 02 12:22:34.95: [ 4541]: <-- data [32]
Sep 02 12:22:34.95: [ 4541]: <-- data [2]
Sep 02 12:22:35.94: [ 4541]: --> [7:CONNECT]
Sep 02 12:22:35.94: [ 4541]: <-- HDLC<23:FF C0 02 CE 4E A6 76 2E 4E 86 0A 64 2E 66 F6 4E C6 A6 A6 42 04 04 04>
Sep 02 12:22:35.94: [ 4541]: <-- data [23]
Sep 02 12:22:35.94: [ 4541]: <-- data [2]
Sep 02 12:22:35.96: [ 4541]: --> [7:CONNECT]
Sep 02 12:22:35.96: [ 4541]: <-- HDLC<13:FF C8 01 00 77 5F 23 01 FB C1 01 01 18>
Sep 02 12:22:35.96: [ 4541]: <-- data [13]
Sep 02 12:22:35.96: [ 4541]: <-- data [2]
Sep 02 12:22:38.08: [ 4541]: --> [2:OK]
Sep 02 12:22:38.08: [ 4541]: <-- [9:AT+FRH=3\r]
Sep 02 12:22:38.57: [ 4541]: --> [7:CONNECT]
Sep 02 12:22:39.74: [ 4541]: --> HDLC<9:FF C8 C1 00 46 1F 00 3D 9B>
Sep 02 12:22:39.75: [ 4541]: --> [2:OK]
Sep 02 12:22:39.75: [ 4541]: RECV recv DCS (command signal)
Sep 02 12:22:39.75: [ 4541]: REMOTE wants 14400 bit/s
Sep 02 12:22:39.75: [ 4541]: REMOTE wants A4 page width (215 mm)
Sep 02 12:22:39.75: [ 4541]: REMOTE wants unlimited page length
Sep 02 12:22:39.75: [ 4541]: REMOTE wants 7.7 line/mm
Sep 02 12:22:39.75: [ 4541]: REMOTE wants 1-D MH
Sep 02 12:22:39.75: [ 4541]: RECV training at v.17 14400 bit/s
Sep 02 12:22:39.75: [ 4541]: <-- [11:AT+FRM=145\r]
Sep 02 12:22:41.24: [ 4541]: --> [7:CONNECT]
Sep 02 12:22:43.06: [ 4541]: RECV: TCF 2934 bytes, 6% non-zero, 2732 zero-run
Sep 02 12:22:43.27: [ 4541]: --> [10:NO CARRIER]
Sep 02 12:22:43.27: [ 4541]: <-- [9:AT+FRS=7\r]
Sep 02 12:22:43.28: [ 4541]: --> [2:OK]
Sep 02 12:22:43.28: [ 4541]: TRAINING succeeded
Sep 02 12:22:43.28: [ 4541]: <-- [9:AT+FTH=3\r]
Sep 02 12:22:43.33: [ 4541]: --> [7:CONNECT]
Sep 02 12:22:43.33: [ 4541]: <-- HDLC<3:FF C8 21>
Sep 02 12:22:43.33: [ 4541]: <-- data [3]
Sep 02 12:22:43.33: [ 4541]: <-- data [2]
Sep 02 12:22:44.56: [ 4541]: --> [2:OK]
Sep 02 12:22:44.56: [ 4541]: <-- [11:AT+FRM=146\r]
Sep 02 12:22:46.96: [ 4541]: --> [7:CONNECT]
Sep 02 12:22:46.96: [ 4541]: MODEM input buffering enabled
Sep 02 12:22:46.96: [ 4541]: RECV: begin page
Sep 02 12:22:47.09: [ 4541]: RECV/CQ: Bad 1D pixel count, row 0, got 1843, expected 1728
Sep 02 12:22:47.38: [ 4541]: RECV/CQ: Bad 1D pixel count, row 1, got 1830, expected 1728
Sep 02 12:22:47.52: [ 4541]: RECV/CQ: Bad 1D pixel count, row 2, got 1745, expected 1728
Sep 02 12:22:47.52: [ 4541]: RECV/CQ: Bad 1D pixel count, row 3, got 3722, expected 1728
Sep 02 12:22:47.66: [ 4541]: RECV/CQ: Bad 1D pixel count, row 4, got 3187, expected 1728
Sep 02 12:22:48.01: [ 4541]: RECV/...

yaztromo (tromo) wrote :

Disabling apparmor does not work for me.

I'm reverting the server back to hardy, before I get in trouble for lost faxes :)

Giuseppe Sacco (eppesuig) wrote :

Could you try 6.0.4-10 as published in Debian? It should work out of the box on lucyd.

Thanks,
Giuseppe

yaztromo (tromo) wrote :

Tested. Seems to work perfectly.

Giuseppe Sacco (eppesuig) wrote :

Let's wait a couple of days, then I would close this bug report.

Changed in hylafax (Ubuntu):
status: New → Fix Committed
yaztromo (tromo) wrote :

Will this fix be released as an updated set of packages for Lucid?

I will keep the hardy hylafax installed until then since it I trust it.

Francesco (francesco-colista) wrote :

Installing backport from maverick to lucid does not work.
Hylafax server and client 6.0.4-10

Giuseppe Sacco (eppesuig) wrote :

In order to collect more information, could you change ServerTracing and SessionTracing to 8FFFF, restart hylafax, produce the error, and then send the relevant log from /var/log/daemon and/or /var/log/messages ?
Thanks

Francesco (francesco-colista) wrote :

That's my /var/log/messages.
__________________________________________________________________________________________

Sep 16 11:01:51 itras01 FaxGetty[32411]: ANSWER: FAX CONNECTION DEVICE '/dev/tt
yS11'
Sep 16 11:02:03 itras01 ntpd[798]: kernel time sync status change 6001
Sep 16 11:02:19 itras01 kernel: [854106.814226] faxgetty[32411]: segfault at a3c ip 0805b6a3 sp bf911c60 error 4 in faxgetty[8048000+72000]
Sep 16 11:02:25 itras01 HylaFAX[7364]: checkHostIdentity("itmon01")
Sep 16 11:03:25 itras01 HylaFAX[7365]: checkHostIdentity("itmon01")

The error is recorder to dmesg too:

[854106.814226] faxgetty[32411]: segfault at a3c ip 0805b6a3 sp bf911c60 error 4 in faxgetty[8048000+72000]

I cannot reproduce the error. Anyway, seems that segfaults happen only with ttyS11
I've a multi-modem card with chipset ST16654.
Everything works perfectly with 8.04.

Giuseppe Sacco (eppesuig) wrote :

[I will ask something somewhat peculiar, I hope not so difficult to execute]
Could you check if you may get any core dumped from faxgetty? Usually segfault would create such file is ulimit does not forbid it.
ulimit is specified in /etc/security/limits.conf, you should add to that file two lines:

uucp soft core 100000
uucp hard core 100000

since I think that file is read at login time, you should logout, login again, and restart hylafax server.
once your faxgetty segfault, you should find its core somewhere, probably in /var/spool/hylafax, then run gdb against the core file. Once in gdb please type the command "bt" for having a backtrace. Send to me its output.

Thank you very much,
Giuseppe

Giuseppe Sacco (eppesuig) wrote :

Francesco,
another (simpler) way to check it using gdb is to follow this instruction http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=553027#15

Bye,
Giuseppe

Francesco (francesco-colista) wrote :

Well, with upgrade to kernel 2.6.32-24-generic-pae seems that the segfault doesn't occours.
Anyway, i cannot reproduce the bug.

Mike Nielsen (mike-getbent) wrote :

Updating to kernel 2.6.32-24-generic-pae seems to have helped but not fixed my issue. Instead of segfaulting every hour or so I can now get nearly a day out of Hylfax before it crashes.

Apparmor is disabled.

The system is question has 3 modems attached, all different make and models. Regardless of the modem or PID the error is always the same

Sep 25 14:15:11 ITMFAX-XX kernel: [51933.292051] faxgetty[2432]: segfault at a3c ip 0805c083 sp bffdd890 error 4 in faxgetty[8048000+72000]
Sep 25 14:17:29 ITMFAX-XX kernel: [52071.504050] faxgetty[2433]: segfault at a3c ip 0805c083 sp bfb335e0 error 4 in faxgetty[8048000+72000]

Giuseppe Sacco (eppesuig) wrote :

Hi Mike, could you please check it using gdb, as explained on previous messages on this bug report?

Thanks,
Giuseppe

Mike Nielsen (mike-getbent) wrote :

Just got a different error, Hylafax was in the middle of receiving at the time.

Sep 27 09:06:17 ITMFAX-XX kernel: [206199.152049] faxgetty[9480]: segfault at 1850 ip b7499785 sp bff8c038 error 4 in libc-2.11.1.so[b7426000+153000]

Mike Nielsen (mike-getbent) wrote :

I'm set the limits.conf lines above. Any guess on where I might find the core?

Giuseppe Sacco (eppesuig) wrote :

If you follow instruction on http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=553027#15 then you should find them in /tmp/faxsend-*

Mike Nielsen (mike-getbent) wrote :

I have the faxsend wrappers in place but still get no core dumps on subsequent segfaults. I'm unsure if it matters but I have never segfaulted on a send, only on reception of a fax.

I also upgraded to kernel 2.6.32-25-generic-pae this morning with no change in the problem.

What other information can I get you? Is there another way to wrap the process for debugging?

Giuseppe Sacco (eppesuig) wrote :

Hi Mike, you are right! I didn't mean to use faxsend script for faxsend, bug faxgetty.
In should «killall faxgetty», download the general scrip gdb-wrapper.sh from thttp://people.ifax.com/~aidan/hylafax/gdb/ and run it with args "ttyS0" (or the device name your modem is available).

yaztromo (tromo) wrote :

What is the status of this bug. It seems to be ongoing, yet status is marked as fix comitted??

I would like to be able use the lucid hylafax package instead of having hardy packages installed on a lucid server.

Changed in hylafax (Debian):
status: Unknown → Incomplete

Hello

Fresh Lucid server 64-bit install, 2.6.32-25-generic-pae, system hangs one per day after receiving a fax.

Best regards
Antonio

Giuseppe Sacco (eppesuig) wrote :

Hi all,
I just uploaded a new hylafax package, version 6.0.5-4, with two new binary packages with debugging symbols for client and server. They are currently sitting in the NEW queue for Debian, since they require manual approval for those new binary packages. Once approved it should migrate to Debian unstable and start being rebuilt for all architecture.

You may wait a few days and get it for your architecture, or you may download its source from the NEW queue and rebuild it yourself. Have a look at http://ftp-master.debian.org/new.html for information on the NEW queue.

Please, install all binary packages, included the new ones with debugging information, and try again to use gdb in order to procude a backtrace. The new backtrace should hopefully show all debugging information, and in turns, it should make simpler to find this problem.

Good morning

Regarding #25, the apparent cause for the described hang does not appear to be in hylafax, the machine in question exhibited an intermittent hardware problem which was corrected. Further investigation is being performed, will have more data available by the beginning of next week.

All the best

Antonio

Changed in hylafax (Ubuntu):
status: Fix Committed → Confirmed
Changed in hylafax:
status: Unknown → Confirmed
Andreas Oster (aoster) wrote :

Hello all,

what is the current status, is someone working on the problem ?
I have recently upgraded our hylafax installation from an old 4.X
version to 6.0.4 (Ubuntu 10.10). I am using t38modem 1.2.0 and
are now facing the exact same problem with faxgetty segfaulting.
I have recognized, that most of the time this happens, the peer
machine is a software FAX. With hardware based fax machines
this does not seem to happen very often.

I have tested several hylafax versions 6.04, 6.05 and 6.1git all
with the same issues.

Has someone built a working hylafax 4.4.7 deb package for Ubuntu 10.10
that I could try ?

Thank you for your kind help

Andraes

Giuseppe Sacco (eppesuig) wrote :

Hi Andreas,
I received a good news a few hours ago. An ubuntu user told me that Natty packages are working fine on Ubuntu 10.10. Could you try them (and install also -dbg package in order to get gdb traces eventually)?

Thanks,
Giuseppe

Andreas Oster (aoster) wrote :

Hello Giuseppe,

thank you for the fast reply.

i will try the new Natty packages and will report my findings.

Unfortunately I have no experience with debugging. How do
I get the traces ? Is there a howto I can read ?

regards

Andreas

Giuseppe Sacco (eppesuig) wrote :

Hi Andreas, check comment #23 for a few instruction about debugging.

Andreas Oster (aoster) wrote :

Hello Giuseppe,

I've tested the new packages yesterday evening and the segfault seems
to have gone :-)

Unfortunately I was still not able to resceive my twelve pages test facsimile
from a software fax solution. I have tried several times but always got the
following error message in the logs:

"V.21 signal reception timeout; expected page possibly not received in full"

Any idea what could cause this ?

Thank you for your kind help

Andreas

Andreas Oster (aoster) wrote :

Hello Giuseppe,

seems like I have found the problem. While testing, I did
disable ECM on the Cisco gateway (dial-peer). After re-
enabling ECM everything does work as expeceted.

Thanks

Andreas

Changed in hylafax:
importance: Unknown → Medium
Simon G. Stikkelorum (iah) wrote :

Can I please ask for the situation at this moment? I am running a fully updated server Ubuntu 10.04 Linux 2.6.32-29 and still have the problem.
Thank you

Simon

On Tue, 15 Mar 2011 12:33:30 -0000, "Simon G. Stikkelorum"
<email address hidden> wrote:
> Can I please ask for the situation at this moment? I am running a
> fully updated server Ubuntu 10.04 Linux 2.6.32-29 and still have the
> problem.
> Thank you
>
> Simon

Because of this bug (opened several month ago) i've removed 10.04 and
rolled back to 8.04.

8.04 is not affected.

--
:: Francesco ::
Jabber: <email address hidden>
E-Mail: <email address hidden>
GnuPG: FE9DDD5F

Giuseppe Sacco (eppesuig) wrote :

Hi,
this bug reports about many different problems. The only way to fix each of them is to supply all information using a debugger as explained in comments #15, #21, and #23. If you may produce that data, then problably you bug will be found and fix.

You may add your data to this report or you may send it directly to me.

Bye,
Giuseppe

Hi

I am running 10.04 server on a 32-bit pc, and everything is running flawlwssly, my previous problem was an intermittent hardware one.

Cheers

Antonio

Simon G. Stikkelorum (iah) wrote :

Gentlemen,

Thank you for your prompt response. I did not subscribe to this topic, so I am a bit late reacting, sorry. I am happy to see commitment here.

@Giuseppe: I will run faxgetty in the wrapper and get back here with
           results. Although I should be able to dig down into the
           code as well. The problem occures about once every 2
           weeks with a station that does not send ID. Other faxes
           we receive without ID are all spam-advertisements.

@Francesco: It sure is a good thing to know the old version works.
           I have been running hylafax (with this modem) for over
           10 years, so I know it can be as solid as a rock. I can
           allow for this situation to continue a bit longer. So
           I can contribute to solving this. I'd rather go this
           way than to revert back to an old version. I understand
           why you would choose for that, though.

@Antonio: I will check, I have been moving the server. But would
           you not agree that software should not SEGVIO as a
           result of hardware problems?

More information is that the failure seems to occur at the end of the page, not at the beginning as you'd think from the logging. In the recvq I find a tif file of a reasonable length for one page.
But there are two things wrong with the file:
It has the wrong rights (rwx------ to user uucp) and when I do chmod 777 the file still produces no picture. This drives me to the idea that something goes wrong at the end of the page.

That's all for now. I'll start faxgetty in the wrapper and I'll be back with the results. Maybe should take a look at the source; I should have some two weeks before the next failure.

Regards,

Simon.

Simon G. Stikkelorum (iah) wrote :

Giuseppe,

I have tried the wrapper but it seems not to work. So I tried to jest run the gdb command found in the wrapper from the command line and found that gdb complains that it does not have debugging symbols. I am sorry but I have little experience with debugging with gdb.
I think I have to obtain the source, compile it with specific options and then run it in this gdb command. It is no problem for me to run this directly from the command line, if needed.

Would you please have some pointers on how to go forward?

Regards,

Simon.

Giuseppe Sacco (eppesuig) wrote :

Simon,
you should get packages from Debian and rebuild them. I am unsure if you may install them as they are. They might work. If you get any problem while installing them via dpkg, then you'll ha ve to rebuild them.

In Debian you will find two more packages with names ending in -dbg, that include all debug symbols.

http://packages.debian.org/source/testing/hylafax

Bye
Giuseppe

50 comments hidden view all 130 comments
yaztromo (tromo) wrote :

Picked up a segfault!

yaztromo (tromo) wrote :
Giuseppe Sacco (eppesuig) wrote :

Hi Yaztromo,
the trace is about exactly the same problem I am referring to, so I think this is not solved at all. I still think the problem being in g++ so I rebuilt all packages without optimizing at all. They are available at:
http://eppesuigoccas.homedns.org/~giuseppe/debian/hylafax/lucid-i386/bug%23600219-no-optimization/

Could you please use them and test it again?

Thanks,
Giuseppe

yaztromo (tromo) wrote :

Installed and running :) Let's see!

Just for reference my CPU specs and kernel in case they are the problem.

Linux Shula 2.6.32-39-generic-pae #86-Ubuntu SMP Mon Feb 13 23:05:11 UTC 2012 i686 GNU/Linux

processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 3
model name : AMD Duron(tm) Processor
stepping : 1
cpu MHz : 796.431
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow up
bogomips : 1592.86
clflush size : 32
cache_alignment : 32
address sizes : 36 bits physical, 32 bits virtual

MarkA (deeptht69-4) wrote :

Giuseppe:

Just to confirm, I am coming up on one week since installing the "-O" compiled packages with no segfaults so far. Could this bug be depending on the CPU type? My system is also an AMD processor, but a more recent one. (Athlon, maybe?)

Anyway, I'm hoping I've seen my last segfault.

Mark

Simon G. Stikkelorum (iah) wrote :

Giuseppe,

I have been digging into the 6.0.5 code using the input Yaztromo provided with the SEGVIO he logged. And I would like to make two statements and have your view on them:

1. I found that the actual code on that famous line 950 is just if (log) { although this "log" could be a handle to a log file, and yes it may be NULL. For example in line 700 it is set to NULL. Could it simply be that log is set to NULL and then makes line 950 do nasty things?

2. If you say it is a compiler error, then you should always have this with the same fax-partner as it will use the same protocol. For my feeling the problem occurs more randomly. It seems more a" once per fifty faxes" thing. Not a structural compiler blunder.

Looking forward to your angle on this.

Simon.

Giuseppe Sacco (eppesuig) wrote :

Hi Simon,
I try to reply to your questions.

1. The problem about log. The error message is due to "this" being NULL. Basically, in C++, you do not address member variable at a fixed address, so you need a base pointer for your instance (called "this") and then you offset your variable starting at "this" address. Let's say it is a pointer to a complex data. "log" is a variable in this complex data, so, once you have "this" you may address "log". Since "this" is null, then your offset is wrong and you get segmentation violation. So, the value of "log" is not even found and it might be NULL.

2. About the compiler error. "this" is not an explicit variable that you change in your code. You cannot change it once created. You may create an instance of an object, and you may delete these instances. This is why user code cannot change "this". and this is why it may be a compiler error. I do not have an answer about this happening randomly even when communicating with the same fax-partner.

Of course I might be wrong, but this is my opinion so far.

Bye,
Giuseppe

yaztromo (tromo) wrote :

I was thinking maybe the crash is only triggered one in every x faxes because most of the time the bad compiled code is not touched upon often. Possibly in this case it is "Adjusting for RTC (real time clock?) found at row" in "recvPageDLEData". Maybe this does not happen often, and when it does, then "this" is corrupted, and we get segfault.

It would look like a compiler bug, since it appears other distro's do not appear to be suffering from it. Although I do not know how much hylafax is used in the wild these days. Many people now use hylafax+ too.

Of course I am speaking as a total layman, but just looking at the gdb log against the source code, it is hard to see why the source would corrupt "this" pointer in the run up to the segfault.

I am suspicious that it is only one sender that is causing the segfault for me. The sender is using a USR modem with a very old version of VSI-FAX on SCO UNIX. Both USR modems and VSI fax are known to be problematic. That sender has always been troublesome, one in every 5 or so transmissions has always failed without segfault.

MarkA (deeptht69-4) wrote :

Adding my two cents.....

As yaztromo says, segfaults seem to be more likely when the fax is originating from a particular sender. However, for me, when the "problem sender" would try to re-send the same fax, it could segfault in a different place each time. This would suggest that it is nothing in the data stream itself that triggers the error, but something in the control signals being exchanged perhaps? The people in the sending office told me that they have "a lot of problems with that machine (a Konica/Minolta 2900)", but I don't know exactly what that means. What office worker doesn't like to complain about their equipment? I looked up the brochure for the 2900, and it looks like a quality device, and not that old. Perhaps the control/status signals are causing an unexpected condition, or triggering the execution of badly-optimized code that clobbers the "this" pointer?

Simon G. Stikkelorum (iah) wrote :

Hi All,

Thank you all for your input. I'll put a bit more detail in: The RTC means Return To Command (mode). What I know about faxes is that they start up and negotiate how they are going to do "it". So the exact protocol, page size, and transfer speed are agreed. This happens with a bi-directional communication of 1200 or 2400 Baud. Then the actual transfer is done, say in 9600 Baud which was blistering fast in those days, but at that moment the transfer is one-way only. So the sender pumps the data and the receiver simply has to chew it. Back then telephone connections were slightly more troublesome. That is why bad lines are counted.
At the end of a page the fax goes back to command mode and negotiates further with the sender. Eventually another page is sent, and after the last page the whole thing is concluded. Again in command mode.
The switchback at the end of the page is triggered by a particular sequence of data sent in high speed transfer. Once this RTC code is seen, the "offending code" is executed.

So, that drives me to the insight that this code is not seldom executed. At least it is used each fax that is send in a particular mode. And that is why I have trouble believing it is a compiler problem. And I sure agree with Giuseppe that the NULL value for this is wrong. But on the other hand we also get the remark "No such file or directory." And I cannot see how " if (log) { " can result in this additional "No such file or directory." I'll do some more digging in the code tonight.

Bye for now,
Simon.

Deeptht69 (deeptht69) wrote :

I have reached one week since installing the "-O" packages. 77 faxes have been received during that time, with no segfaults. The OTHER problem I was having (the computer occasionally freezing until the mouse is moved) also seems to be fixed. I hope everything continues to work when I install Ubuntu 12.04 LTS next month!

Deeptht69 (deeptht69) wrote :

Simon:

I'm not a compiler expert, but it seems to me that "the offending code" being executed is only a problem when some other process has previously clobbered the "this" pointer. The offending code itself is probably fine, but somewhere, perhaps in some data buffer management process elsewhere, the address of the current "ModemServer" instance is getting replaced by "0". Once that happens, any attempt to access a member of ModemServer will cause a segfault. The problem isn't at line 950 (if (log) { ), it's somewhere before that. Line 950 is just the first time the program has a chance to fail because of the error.

Mark

Giuseppe Sacco (eppesuig) wrote :

Hi Mark,
this is exactly what hylafax author wrote in http://bugs.hylafax.org/show_bug.cgi?id=941#c9

Bye,
Giuseppe

Deeptht69 (deeptht69) wrote :

I was just browsing the source code for ModemServer.c++, and I notice that ModemServer::vtraceStatus is called by ModemServer::traceStatus, which has, as one of its arguments, a variable length format string. It makes me wonder if the format string trailing NULL is what's clobbering the "this" pointer when it makes the call to vtraceStatus? Just a thought.

Mark

Simon G. Stikkelorum (iah) wrote :

Hi Mark, hi Guiseppe,

Yes, sure, we are all on one line here. And Mark, in my mind there is no doubt that "if (log) {" is not causing the problem. And yes, the this pointer is damaged some other way. BUT if this would be due to a compiler issue I have trouble to accept that this happens only once every 50 faxes. So, what I try to say is, could it not be a combination of programming error (buffer overrun), malicious input (damaged/ill formatted input data) and the optimalisation used by the compiler (different -O option produces different code that may or may not fail).

So, I think that by more analysis we may learn more. We may be able to corner the problem further it we could more exactly describe when this happens. Is it for example only in multi-page faxes? Or is it only after the last page?
That was also why I noted that this code is running often. Every fax received does at least once do a RTC. So, I tried to ask, what is the extraordinary (or additional) circumstance on top of doing this RTC logging that causes the failure? Sorry for not having been clearer before.

At the same time I still do not understand where the "no such file or directory". What is that trying to tell?

Regards, Simon.

Giuseppe Sacco (eppesuig) wrote :

Hi Simon,
I think the "No such file" is a message from gdb that alert user that gdb does not find the source file.

Bye,
Giuseppe

Deeptht69 (deeptht69) wrote :

Simon:

Good points. As I said earlier, back when I was having segfaults, these patterns emerged: 1) segfaults were very common when receiving from one particular sender, 2) when a segfault occurred from the problem sender, the sender would automatically try to send the fax again. It would often segfault again, but on a different page, 3) I have had segfaults occur during receipt of the first page, or during receipt of later pages in a multi-page fax.

Note that I'm looking at a fairly small sample size: most of the segfaults I analyzed were on a single day. Being in a live office setting, if too many segfaults were occurring, my office staff would hook up the dedicated fax machine, and that was the end of data collection (as well as hylafax fax reception).

If a dedicated fax machine, like the Minolta 2900, tries to re-send a fax that failed because the receiver (hylafax) crashed during the transfer, would it not send *exactly* the same data as the first time? That would make me think that it is not related to the actual data being sent, but definitely has something to do with the control signals.

Is there any way to monitor the actual data transfer process, to see exactly what is going on just before a segfault?

Mark

yaztromo (tromo) wrote :

I'm going to tentatively say that the no optimisations package is bug free.

Two days running and no segfaults. Normally I would have one or two by now.

Deeptht69 (deeptht69) wrote :

yaztromo:

Glad to hear your setup is working. I am now at 8 days of continuous running without a segfault, using the "-O" compiled packages. I wonder why the "-O" compiled packages didn't work for you? Hardware difference? (My fax server uses an AMD Athlon CPU). Anyway, I hope this is the end of this bug, and that it doesn't re-surface when Ubuntu 12.04 LTS comes out next month!

Mark

yaztromo (tromo) wrote :

Hi Deep,

I've been wondering the same too, and the only difference I can find is that Athlon supports SSE and Duron doesn't. Does -O enable SSE support? I don't know anything about gcc but would've thought enabling SSE would need a seperate flag.

A mystery!

Deeptht69 (deeptht69) wrote :

BAD NEWS!!!

After going for 8 days without a problem, I just had 3 segfaults back-to-back! All occurred before the originating phone number was logged, so I can't tell if they were all from the same sender. The last fax received before the segfaults was from the sender that was causing problems before, but the most recent fax completed OK.

I will, over the weekend, install the completely un-optimized packages, and see if they can run fault free.

Mark

Giuseppe Sacco (eppesuig) wrote :

Hi all,
I changed the source code in order to create a new logfile in /tmp. This file contains the "this" value in many lines of function "recvPageDLEData". Could you please test it and check the new logfile?

Packages are available at http://eppesuigoccas.homedns.org/~giuseppe/debian/hylafax/lucid-i386/bug%23600219-extralogging/ .

Please note that package version number change from 6.0.5-5 to 6.0.5-5.1.

Thanks,
Giuseppe

yaztromo (tromo) wrote :

Ok Guiseppe, will do.

Is this packages unoptimised? No -O?

Do I need to run with gdb to get the log in tmp?

Welcome aboard the unoptimised train Mark! So far so good here.

Giuseppe Sacco (eppesuig) wrote :

Hi Yaztromo,
the latest package has standard optimization options (should be -O2). The source code has been changed in order to create a logfile for each recvPageDLEData() call. There is no need to run gdb, and I do not have any estimate on how many log files will be written.

Bye,
Giuseppe

Hi all,
this is a request for help for fixing a bug in g++ compiler shipped with
Lucid.

The problem is shown in bug https://bugs.launchpad.net/bugs/600219 and
it has also shortly reported against g++ as launchpad bug #955013. A
quick summary of the g++ relevant part is also on hylafax bugzilla:
http://bugs.hylafax.org/show_bug.cgi?id=941#c9 .

What happens is that a pointer to an object instance, "this", is trashed
during a method execution.

Is there anyone that may help on fixing this?

Thank you very much,
Giuseppe

yaztromo (tromo) wrote :

Well I already got a segfault with the new packages. However the log doesn't look like its recorded the final of this before the crash.

Simon G. Stikkelorum (iah) wrote :

Guiseppe,

Did you flush(fd) the logging every time you log something? Writing to a file is heavily asynchronous.

Simon

Giuseppe Sacco (eppesuig) wrote :

Hi Simon,
I hope it is a line buffered output. I would like to be sure that the logs hit the disk before the process crashes. Is it creating too much disk I/O? I tough about putting it in a tmpfs but that would have involved you setting up a file system for that.

Bye,
Giuseppe

Deeptht69 (deeptht69) wrote :

Any new developments with this bug? I have not (yet) installed the "enhanced logging" packages, as it seemed from Yaztromo's message that they may not be providing the required information. As before, I am running the non-optimized packages, and have not yet had another segfault. If it will help the effort, I could install the enhanced logging packages, which will presumably be more likely to segfault, and send along whatever logs they generate.

Mark

yaztromo (tromo) wrote :

I left the logging going for a few more days. But even though I had several segfaults, the value for 'this' never showed anything different. So I then reverted to the completely unopitimised package, not having any segfaults since.

Can we cautiously say the fix has been found then? If the answer is 'yes'. What happens next?

Alsop if you discover the cause of why the logging doesn't catch when 'this' gets trashed, Guiseppe, I will install any new try at logging whats happening.

Deeptht69 (deeptht69) wrote :

Yaztromo:

I looks as though nobody has seen any segfaults with the completely un-optimized packages? I am coming up on a week of running un-optimized packages without any segfaults, though it took 8 days to catch one when I was running the "-O" packages. I will have to run unoptimized right up until I upgrade to 12.04 LTS in another month, before considering it error-free. Then, we can start all over again!

Mark

Giuseppe Sacco (eppesuig) wrote :

Hi,
I am still waiting for a reply on the ubuntu developer list for the g++ bug. Today I sent a second message to the list. If anyone reply, I'll update this bug report.

Anyway, I do not expect the problem to be present in any other ubuntu version since the compiler has been fixed in later versions.

Bye,
Giuseppe

MarkA (deeptht69-4) wrote :

Giuseppe:

Do we know that this bug is no longer present in the g++ compiler that will ship with 12.04? If that's the case, I'll just wait it out, running the un-optimized packages, until 12.04 is released. There's no point in chasing a bug that's already been squashed!

Mark

Giuseppe Sacco (eppesuig) wrote :

Hi Mark,
this is a guess: there are no later Ubuntu releases reporting this error. And no Debian user reported such a problem.

But 10.04 is LTS, so it should be supported for a lot of years. That's why I really do not understand why ubuntu developers ignore this problem. I know Hylafax is not included in this support, but g++ is.

Bye,
Giuseppe

Deeptht69 (deeptht69) wrote :

Giuseppe:

I see your point. My hylafax server is running on a computer dedicated to that function. I am looking forward to upgrading to 12.04 LTS specifically because of this bug.

Mark

Matthias Klose (doko) wrote :

I would suggest fixing this by building the problematic file with -O0. Can you provide such a patch for hylafax?

Matthias Klose (doko) wrote :

see as well bug 955013

Changed in hylafax:
status: Confirmed → Fix Released
Giuseppe Sacco (eppesuig) wrote :

Hi Matthias,
I can certainly provide a patch. Would you sponsor the upload?

Thanks,
Giuseppe

yaztromo (tromo) wrote :

Could some explain what O0 does? From the man pages:

"Reduce compilation time and make debugging produce the expected results. This is the default. "

So is that same as not using -O at all? Or is it the same as just -O, or something entirely different?

Giuseppe Sacco (eppesuig) wrote :

Hi,
I believe that -O0 means "no optimization at all".

As you may see at the very beginning of http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, "Without any optimization option, the compiler's goal is to reduce the cost of compilation and to make debugging produce the expected results". So "-O0" is equivalent to "without optimization option".

Bye,
Giuseppe

Displaying first 40 and last 40 comments. View all 130 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.