Wireless drops outside world, but seems alive internally

Bug #1321078 reported by Loren Amelang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
wicd
New
Undecided
Unassigned

Bug Description

For the first couple of months after installation, wicd wrote nice details to /var/log/wicd/wicd.log. Then it just stopped... The network still (almost always) works, and the user interface seems normal, but where are the log messages? I'm certainly not aware of changing anything.

I ask because twice now, while running unattended updating a cloud server with occasional sensor data, the system loses access to the server, and when I try to SSH or VNC in to investigate, I'm ignored. But sniffing the network shows it is successfully requesting and receiving NTP updates... I doubt this has anything to do with wicd, but I'm digging for any log info I can get!

In the old logs, it looks like wicd does its own DHCP, or at least logged it in detail. I do notice in my router logs that there seem to be gaps in the DHCP renew sequence, and that recovery from the network problem involves a new DHCPDISCOVER instead of just the REQUEST/ACK.

Is there some way to probe into wicd historical events from the command line or python?

More Information:

I've captured info on six of these network dropouts now. A few details vary (as noted below), but the pattern is pretty clear. The Wi-Fi is convenient, but not essential to my project, so I've plugged in the wired ethernet and moved on. Still, it would be nice to have wireless under control, and maybe someone else could be helped by my sharing this information.

This is a BeagleBone Black with Ubuntu 12.04 and wicd 1.7.2.3 (bzr-r761).
Wi-Fi is the MediaTek (Ralink) UWN100 Wi-Fi adapter with MT7601 chip and RT2870STA.

Wireless connection drops for no obvious reason, typically overnight.
It is being used at least every five minutes to update a remote cloud site.
Router logs show it has plenty of DHCP lease left.

Netstat sometimes shows a remnant of the proper IP addresses:
udp 0 0 10.1.1.121:123 0.0.0.0:* <-- missing from some earlier logs
udp6 0 0 fe80::20c:43ff:fe00:123 :::* <-- missing from some earlier logs
But all other entries have changed to 0.0.0.0.

Most ps displays show all of the wicd python modules running, but one event is missing this one:
ubuntu 21845 1 0 May15 ? 00:05:05 /usr/bin/python -O /usr/share/wicd/gtk/wicd-client.py

Ifconfig and iwconfig always seem normal, and report increasing byte counts even though nothing is getting through to the outside world.

"iwlist ra0 s" shows the other local Wi-Fi AP, but not the one this board was previously connected to.
"wicd-cli --wireless -l" and wicd-curses always show both Wi-Fi APs

"iwlist ra0 ap" shows only the "other" AP, and gives it a strangely bogus MAC address, that changes somewhat between the different failures:
---
$ iwlist ra0 ap
ra0 Peers/Access-Points in range:
    48:DF:A8:5E:07:DE : Quality=63/100 Signal level=-65 dBm Noise level=-92 dBm
---
--> Similar strange MAC appears in next day's report as well, but nowhere else in any document relating to my networks
    2A:DE:A8:5E:54:DE : Quality=57/100 Signal level=-67 dBm Noise level=-92 dBm
--> previous day:
    50:DF:A8:7E:4A:DF : Quality=73/100 Signal level=-61 dBm Noise level=-92 dBm

In wicd-curses, the 'R' refresh command produces the "Connected" message, but no connection.
The 'C' connect command restores the network connection properly.

Next level of detail...

stopped updating emoncms
5/17/2014 8:55:00 PM GMT-7

Router connections:
eth1 00:21:6A:74:C4:A4
[oui] [static] [wfilter] 10.1.1.108 J3400 -83 dBm 16 0 days, 22:32:16
br0 00:0C:43:00:7D:7F
[oui] [static] 10.1.1.121 ubuntu-armhf 0 days, 12:56:15

Router log:
May 16 23:24:09 Linksys3929 daemon.info dnsmasq-dhcp[99]: DHCPREQUEST(br0) 10.1.1.121 00:0c:43:00:7d:7f
May 16 23:24:09 Linksys3929 daemon.info dnsmasq-dhcp[99]: DHCPACK(br0) 10.1.1.121 00:0c:43:00:7d:7f ubuntu-armhf

May 17 08:50:15 Linksys3929 daemon.info dnsmasq-dhcp[99]: DHCPREQUEST(br0) 10.1.1.121 00:0c:43:00:7d:7f
May 17 08:50:15 Linksys3929 daemon.info dnsmasq-dhcp[99]: DHCPACK(br0) 10.1.1.121 00:0c:43:00:7d:7f ubuntu-armhf

May 17 11:30:06 Linksys3929 daemon.info dnsmasq-dhcp[99]: DHCPDISCOVER(br0) 10.1.1.121 00:0c:43:00:7d:7f
May 17 11:30:06 Linksys3929 daemon.info dnsmasq-dhcp[99]: DHCPOFFER(br0) 10.1.1.121 00:0c:43:00:7d:7f
May 17 11:30:06 Linksys3929 daemon.info dnsmasq-dhcp[99]: DHCPREQUEST(br0) 10.1.1.121 00:0c:43:00:7d:7f
May 17 11:30:06 Linksys3929 daemon.info dnsmasq-dhcp[99]: DHCPACK(br0) 10.1.1.121 00:0c:43:00:7d:7f ubuntu-armhf
[nothing beyond there]

Last login: Sat May 17 11:31:03 PDT 2014 from j3400.crr on pts/0
Welcome to Ubuntu 12.04.4 LTS (GNU/Linux 3.8.13-bone30 armv7l)

 * Documentation: https://help.ubuntu.com/
ubuntu@ubuntu-armhf:~$ who
ubuntu ttyO0 2014-05-17 22:40

--> in earlier tests, (BBB Ubuntu 12_04 uSD 140126 CanopyOut, wicd-curses log.txt),
    when "Link is up", ==>rt_ioctl_giwmode(mode=2)

--> Just realized that 140126 event was similar - the upstream Canopy node between me and the internet dropped offline, and the BBB Wi-Fi was unable to recover on its own when connectivity returned. In that case, "sudo ifconfig ra0 down" and then back up worked. In the current cases, "sudo ifconfig ra0 down" does not appear to take the interface down (according to ifconfig), but does fix the connection.

ubuntu@ubuntu-armhf:~$ sudo tail /var/log/syslog
May 17 22:41:38 ubuntu-armhf kernel: [40442.089562] ==>rt_ioctl_giwmode(mode=0)
May 17 22:41:38 ubuntu-armhf kernel: [40442.089618] ===>rt_ioctl_giwrange
May 17 22:41:38 ubuntu-armhf kernel: [40442.089806] rt28xx_get_wireless_stats --->
May 17 22:41:38 ubuntu-armhf kernel: [40442.089843] <--- rt28xx_get_wireless_stats
May 17 22:41:43 ubuntu-armhf kernel: [40447.085611] ==>rt_ioctl_giwfreq 1
May 17 22:41:43 ubuntu-armhf kernel: [40447.085797] ===>rt_ioctl_giwencode 0
May 17 22:41:43 ubuntu-armhf kernel: [40447.085874] ==>rt_ioctl_giwmode(mode=0)
May 17 22:41:43 ubuntu-armhf kernel: [40447.085944] ===>rt_ioctl_giwrange
May 17 22:41:43 ubuntu-armhf kernel: [40447.086138] rt28xx_get_wireless_stats --->
May 17 22:41:43 ubuntu-armhf kernel: [40447.086175] <--- rt28xx_get_wireless_stats
ubuntu@ubuntu-armhf:~$

--> Unless noted, results are the same in all events:

ubuntu@ubuntu-armhf:~$ netstat -lntu
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:1880 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:2120 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:2121 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:3306 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:4304 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp6 0 0 :::80 :::* LISTEN
tcp6 0 0 :::22 :::* LISTEN
udp 0 0 10.1.1.121:123 0.0.0.0:* <-- missing from some earlier logs
udp 0 0 127.0.0.1:123 0.0.0.0:*
udp 0 0 0.0.0.0:123 0.0.0.0:*
udp 0 0 0.0.0.0:68 0.0.0.0:*
udp 0 0 0.0.0.0:68 0.0.0.0:*
udp6 0 0 fe80::20c:43ff:fe00:123 :::* <-- missing from some earlier logs
udp6 0 0 ::1:123 :::*
udp6 0 0 :::123 :::*
ubuntu@ubuntu-armhf:~$ netstat -punt
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 1 0 127.0.0.1:50255 127.0.0.1:4304 CLOSE_WAIT -
tcp 1 0 127.0.0.1:50263 127.0.0.1:4304 CLOSE_WAIT -
ubuntu@ubuntu-armhf:~$

--> During first event, when it seemed Node-RED really had crashed:
tcp 358 0 10.1.1.121:1880 10.1.1.108:14505 ESTABLISHED -
--> Killed and restarted it (before fixing network), still not receiving:
tcp 440 0 10.1.1.121:1880 10.1.1.108:14502 CLOSE_WAIT -

ubuntu@ubuntu-armhf:~$ ps -ef | grep .py
root 649 1 1 11:29 ? 00:13:03 /usr/bin/python -O /usr/share/wicd/daemon/wicd-daemon.py
root 731 649 0 11:29 ? 00:04:49 /usr/bin/python -O /usr/share/wicd/daemon/monitor.py
ubuntu 26159 25997 0 22:44 ttyO0 00:00:00 grep --color=auto .py
ubuntu@ubuntu-armhf:~$

--> Included in previous logs, missing here:
ubuntu 21845 1 0 May15 ? 00:05:05 /usr/bin/python -O /usr/share/wicd/gtk/wicd-client.py

ubuntu@ubuntu-armhf:~$ ifconfig ra0
ra0 Link encap:Ethernet HWaddr 00:0c:43:00:7d:7f
          inet addr:10.1.1.121 Bcast:10.1.1.255 Mask:255.255.255.0
          inet6 addr: fe80::20c:43ff:fe00:7d7f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:411667 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5331 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:39706150 (39.7 MB) TX bytes:373956 (373.9 KB)
--> Even when network seems unresponsive from outside, the byte counts here increase!

ubuntu@ubuntu-armhf:~$ iwconfig ra0
ra0 Ralink STA ESSID:"101Netlink2" Nickname:"MT7601STA"
          Mode:Auto Frequency=2.412 GHz Access Point: 00:14:BF:05:4A:4D
          Bit Rate=1 Mb/s <-- goes to 54 when working
          RTS thr:off Fragment thr:off
          Link Quality=90/100 Signal level:-78 dBm Noise level:-78 dBm
          Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
          Tx excessive retries:0 Invalid misc:0 Missed beacon:0

ubuntu@ubuntu-armhf:~$ iwlist ra0 s
ra0 Scan completed :
          Cell 01 - Address: 00:14:BF:05:4A:50
                    Protocol:11b/g BW20
                    ESSID:"Talisman"
                    Mode:Managed
                    Frequency:2.437 GHz (Channel 6)
                    Quality=63/100 Signal level=-65 dBm Noise level=-92 dBm
                    Encryption key:off
                    Bit Rates:54 Mb/s

ubuntu@ubuntu-armhf:~$

ubuntu@ubuntu-armhf:~$ iwlist ra0 ap
ra0 Peers/Access-Points in range:
    48:DF:A8:5E:07:DE : Quality=63/100 Signal level=-65 dBm Noise level=-92 dBm

--> ??? Similar strange MAC appears in next day's report as well, but nowhere else in any document relating to my networks ???
    2A:DE:A8:5E:54:DE : Quality=57/100 Signal level=-67 dBm Noise level=-92 dBm
--> previous day:
    50:DF:A8:7E:4A:DF : Quality=73/100 Signal level=-61 dBm Noise level=-92 dBm

ubuntu@ubuntu-armhf:~$ wicd-cli --wireless -l
# BSSID Channel ESSID
0 00:14:BF:05:4A:50 6 Talisman
1 00:14:BF:05:4A:4D 1 101Netlink2
ubuntu@ubuntu-armhf:~$ wicd-cli --wireless -n 1 -d
Essid: 101Netlink2
Bssid: 00:14:BF:05:4A:4D
Encryption: Off
Quality: 31
Mode: Managed
Channel: 1
Bit Rates: 54 Mb/s
ubuntu@ubuntu-armhf:~$ wicd-curses

                                                           Wicd Curses Interface
Wired Networks
   0 W-Static-10.1.1.4 vvv
Wireless Networks
C STR ESSID ENCRYPT BSSID MODE CHNL
  63% Talisman Unsecured 00:14:BF:05:4A:50 Managed 6
  31% 101Netlink2 Unsecured 00:14:BF:05:4A:4D Managed 1

H:Help ->:ConfiK:RfKillC:ConnecD:DisconR:RefresP:Prefs I:HiddenA:About Q:Quit
           g t n h
Connected to 101Netlink2" Nickname:"MT7601STA at 61% (IP: 10.1.1.121)

--> 'R' refresh command produces same "Connected" message, but no connection.
--> 'C' connect command works.

First event top display:
===
top - 20:35:46 up 4 days, 4:39, 1 user, load average: 1.02, 1.07, 1.12
Tasks: 112 total, 2 running, 110 sleeping, 0 stopped, 0 zombie
Cpu(s): 96.7%us, 3.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 507444k total, 494500k used, 12944k free, 37768k buffers
Swap: 0k total, 0k used, 0k free, 159276k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15923 ubuntu 20 0 251m 183m 4552 R 93.7 37.0 1264:09 node
  656 root 20 0 23036 6796 1564 S 1.6 1.3 109:21.63 wicd <-- seems a lot of time?
 1413 root 20 0 0 0 0 S 0.7 0.0 6:57.74 w1_bus_master1
 6403 ubuntu 20 0 2152 1020 760 R 0.7 0.2 0:00.17 top
  317 syslog 20 0 29452 1556 784 S 0.3 0.3 35:14.34 rsyslogd
  318 messageb 20 0 2600 1056 616 S 0.3 0.2 46:00.28 dbus-daemon
  491 mysql 20 0 315m 29m 1388 S 0.3 5.9 25:55.39 mysqld
  738 root 20 0 12664 6168 2216 S 0.3 1.2 42:19.33 wicd-monitor
===

ubuntu@ubuntu-armhf:~$ ps aux <-- later, reconnected to network?
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
...
syslog 317 0.5 0.3 29612 1612 ? Sl May08 35:40 rsyslogd -c5
103 318 0.7 0.2 2600 1056 ? Ss May08 46:25 dbus-daemon --s
...
root 656 1.7 1.3 23168 6968 ? S May08 110:20 /usr/bin/python
root 738 0.6 1.2 12664 6188 ? S May08 42:41 /usr/bin/python
...
ubuntu 5937 0.0 0.1 2872 596 ? S May08 0:00 /usr/bin/dbus-l
ubuntu 5938 0.0 0.1 2748 844 ? Ss May08 0:00 //bin/dbus-daem
...
ubuntu 5954 0.0 2.2 25572 11208 ? S May08 2:18 /usr/bin/python
...
ubuntu 15923 27.5 36.9 257728 187500 pts/1 Tl May09 1350:40 node --max-old <-- looks like Node really was maxed out...
...
ubuntu@ubuntu-armhf:~$

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.