juju bootstrap fails to successfully configure the bridge juju-br0 when deploying with wily 4.2 kernel

Bug #1496972 reported by Sean Feole
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Andrew McDermott
1.25
Fix Released
High
Andrew McDermott
Ubuntu
Invalid
High
Joseph Salisbury
Wily
Invalid
High
Joseph Salisbury

Bug Description

Maas: MAAS Version 1.8.0+bzr4001-0ubuntu2 (yarmouth.2)
JuJu Version: 1.24.4-0ubuntu1~14.04.1~juju1
User Space: Trusty:
HW : Iin development ARM64 platform (Host) and HP Moonshot m400 (McDivitt) -- (Host1) - Also ARM64

Problem Description:

NOTE: The problem described below is also reproducible on a shipping ARM64 system (HP Moonshot Mcdivitt - m400) with Trusty userspace + 4.2 kernel form Wily.
Upon issuing a juju-bootstrap the state server on currently in-development ARM64 hardware platform, it creates a bridge device bound to the pxe nic (eth1) as expected. eth1 should then release its IP address and the bridge should assume priority and route all traffic. This occurs reliably when using a trusty cloud image and appropriate trusty kernel.

In this case, we are enabling some hardware, and I need to specifically use a hacked cloud root-tgz (modified to include the wily kernel (4.2) to a trusty userspace.) I have done all that correctly and able to land the image onto its assigned hardware using MAAS 1.8.

$ uname -a
Linux ms10-39-host 4.2.0-10-generic #11-Ubuntu SMP Sun Sep 13 11:26:21 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

Now when I use juju to bootstrap the image onto the assigned hardware, I appear to have a problem with the juju bridge, and default pxe nic, The assigned interface appears to not want to let go of the assigned ipv4 address and hand it over to the bridge. Almost as if it's never successfully runnig "$sudo ifdown eth0"

 We constantly see the message "received packet on eth1 with own address as source address" in syslog

$ ifconfig
eth0 Link encap:Ethernet HWaddr fc:15:b4:21:00:c2
          inet addr:10.229.65.139 Bcast:10.229.255.255 Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:2210 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1627 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:208450 (208.4 KB) TX bytes:297812 (297.8 KB)

juju-br0 Link encap:Ethernet HWaddr fc:15:b4:21:00:c2
          inet addr:10.229.65.139 Bcast:10.229.255.255 Mask:255.255.0.0
          inet6 addr: fe80::fe15:b4ff:fe21:c2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:2212 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1478 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:177722 (177.7 KB) TX bytes:288314 (288.3 KB)

I also noticed that /etc/network/interfaces was written to and modified:

$ cat /etc/network/interfaces
auto lo

iface eth1 inet dhcp

# Primary interface (defining the default route)
iface eth0 inet manual

# Bridge to use for LXC/KVM containers
auto juju-br0
iface juju-br0 inet dhcp
    bridge_ports eth0

--------------------------------------------------------------------------------------------------------

Here is the syslog output from the 2 different stateserver attempts. The first set of logs from 'host' is running a Trusty userspace with wily kernel. Which displays the failure.

The 2nd snippet of syslog 'host1' displays a Trusty userspace and Trusty Kernel, which eventually completes the bootstrap as expected.

Aug 24 18:15:14 host acpid: 1 rule loaded
Aug 24 18:15:14 host acpid: waiting for events: event logging is off
Aug 24 18:15:15 host kernel: [ 46.174096] init: plymouth-upstart-bridge main process ended, respawning
Aug 24 18:15:17 host ntpdate[1216]: adjust time server 91.189.89.199 offset 0.000248 sec
Aug 24 18:15:36 host dhclient: receive_packet failed on eth1: Network is down
Aug 24 18:15:36 host kernel: [ 66.764788] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.
Aug 24 18:15:36 host kernel: [ 66.772004] device eth1 entered promiscuous mode
Aug 24 18:15:37 host kernel: [ 68.144483] juju-br0: port 1(eth1) entered forwarding state
Aug 24 18:15:37 host kernel: [ 68.144504] juju-br0: port 1(eth1) entered forwarding state
Aug 24 18:15:37 host kernel: [ 68.160693] juju-br0: received packet on eth1 with own address as source address
Aug 24 18:15:37 host dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 3 (xid=0x40e77812)
Aug 24 18:15:37 host kernel: [ 68.189099] juju-br0: received packet on eth1 with own address as source address
Aug 24 18:15:37 host dhclient: DHCPREQUEST of 10.110.24.114 on eth1 to 255.255.255.255 port 67 (xid=0x1278e740)
Aug 24 18:15:37 host dhclient: DHCPOFFER of 10.110.24.114 from 10.110.24.210
Aug 24 18:15:37 host kernel: [ 68.189891] juju-br0: received packet on eth1 with own address as source address
Aug 24 18:15:37 host dhclient: DHCPACK of 10.110.24.114 from 10.110.24.210
Aug 24 18:15:37 host dhclient: bound to 10.110.24.114 -- renewal in 298 seconds.
Aug 24 18:15:37 host kernel: [ 68.390614] thunder-nicvf 0002:01:00.2 eth1: eth1: Link is Up 10000 Mbps Full duplex
Aug 24 18:15:37 host dhclient: Internet Systems Consortium DHCP Client 4.2.4
Aug 24 18:15:37 host dhclient: Copyright 2004-2012 Internet Systems Consortium.
Aug 24 18:15:37 host dhclient: All rights reserved.

-----------------------------------------------

Below is the output from a "SUCCESFULL" bootstrap using a trusty user space and trusty kernel:

Aug 25 19:02:59 ms10-33-host1 acpid: 1 rule loaded
Aug 25 19:02:59 ms10-33-host1 acpid: waiting for events: event logging is off
Aug 25 19:02:59 ms10-33-host1 cron[1298]: (CRON) INFO (Running @reboot jobs)
Aug 25 19:02:59 ms10-33-host1 iscsid: iSCSI daemon with pid=1196 started!
Aug 25 19:03:00 ms10-33-host1 kernel: [ 34.028770] init: plymouth-upstart-bridge main process ended, respawning
Aug 25 19:03:07 ms10-33-host1 ntpdate[1392]: adjust time server 91.189.89.199 offset 0.000016 sec
Aug 25 19:03:07 ms10-33-host1 kernel: [ 41.596548] mlx4_en: eth0: Close port called
Aug 25 19:03:09 ms10-33-host1 dhclient: receive_packet failed on eth0: Network is down
Aug 25 19:03:09 ms10-33-host1 kernel: [ 43.114195] mlx4_en: eth0: Link Down
Aug 25 19:03:09 ms10-33-host1 kernel: [ 43.135229] Bridge firewalling registered
Aug 25 19:03:09 ms10-33-host1 kernel: [ 43.139025] device eth0 entered promiscuous mode
Aug 25 19:03:09 ms10-33-host1 kernel: [ 43.140380] mlx4_en: eth0: frag:0 - size:1526 prefix:0 align:2 stride:1536
Aug 25 19:03:09 ms10-33-host1 kernel: [ 43.289820] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Aug 25 19:03:09 ms10-33-host1 kernel: [ 43.291284] IPv6: ADDRCONF(NETDEV_UP): juju-br0: link is not ready
Aug 25 19:03:10 ms10-33-host1 kernel: [ 44.487804] mlx4_en: eth0: Link Up
Aug 25 19:03:10 ms10-33-host1 kernel: [ 44.487887] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Aug 25 19:03:10 ms10-33-host1 kernel: [ 44.488297] juju-br0: port 1(eth0) entered forwarding state
Aug 25 19:03:10 ms10-33-host1 kernel: [ 44.488305] juju-br0: port 1(eth0) entered forwarding state
Aug 25 19:03:10 ms10-33-host1 kernel: [ 44.488321] IPv6: ADDRCONF(NETDEV_CHANGE): juju-br0: link becomes ready
Aug 25 19:03:10 ms10-33-host1 dhclient: Internet Systems Consortium DHCP Client 4.2.4
Aug 25 19:03:10 ms10-33-host1 dhclient: Copyright 2004-2012 Internet Systems Consortium.
Aug 25 19:03:10 ms10-33-host1 dhclient: All rights reserved.
Aug 25 19:03:10 ms10-33-host1 dhclient: For info, please visit https://www.isc.org/software/dhcp/
Aug 25 19:03:10 ms10-33-host1 dhclient:
Aug 25 19:03:11 ms10-33-host1 dhclient: Listening on LPF/juju-br0/14:58:d0:58:b3:92
Aug 25 19:03:11 ms10-33-host1 dhclient: Sending on LPF/juju-br0/14:58:d0:58:b3:92

-----------------------------------------------------------------------------------------------------------------------------------

Now , after this problem usually occurs, there is somewhat of a workaround: 1.) Restart the host, which will then boot the system with it's correct network config as outlined in /etc/network/interfaces. Which will then allow network traffic outbound.
2.) Manually ifdown / ifup eth1. easier than workaround 1.

After restarting the host at least once, the route tables appear to fix themselves and I can ssh into the host from a system outside of the 10.229/16 net (if vpn allows)

I can provide hardware access for anyone who requests it.

Sean Feole (sfeole)
Changed in juju-core:
status: New → Confirmed
Revision history for this message
Andrew McDermott (frobware) wrote :

Is this issue related to arm64 only; can juju bootstrap on x86 wily today?

Revision history for this message
Raghuram Kota (rkota) wrote :

Hi Andrew,

We haven't tested whether this occurs on x86 wily.
Sean's team works on ARM64 platform enablement and this bug is reproducible both on a shipping ARM64 platform ( HP moonshot McDivitt m400) and a currently in development ARM64 platform.

Raghu

Revision history for this message
Andrew McDermott (frobware) wrote :

Does it bootstrap OK if you force the machine to have only eth0 (if that's at all possible)?

Curtis Hovey (sinzui)
Changed in juju-core:
status: Confirmed → Triaged
importance: Undecided → High
milestone: none → 1.25-beta2
Revision history for this message
Sean Feole (sfeole) wrote :

Hey Andrew, I did try try disconnected the 2nd nic from the maas controller. that did not appear to have any affect on the outcome.

Revision history for this message
Raghuram Kota (rkota) wrote :

Hi Andrew,

If you need direct access to hw for other experiments, please let us know and we'd be happy to provide that. ]

Raghu

Curtis Hovey (sinzui)
tags: added: network
Revision history for this message
Andrew McDermott (frobware) wrote :

Please could you provide access to your system and I will start to investigate. Instructions via email would be fine if not appropriate to share here.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Andrew, Sean has raised rt#84995 to give you access to our lab in 1SS with the McDivitt hardware you'll need. I think IS has already completed that ticket, so if you just ping Sean Feole on IRC he should be able to give you a MAAS account to access a node.
Thanks, Andy.

Changed in juju-core:
assignee: nobody → Andrew McDermott (frobware)
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

It seems Juju discovered the primary NIC of the machine should be eth0, according to the default route early at first boot. Can we see the contents of /var/log/cloud-init-output.log from the machine after the issue happened?

Revision history for this message
Sean Feole (sfeole) wrote :
Download full text (3.9 KiB)

Here is the output from some various networking commands on a cavium host deployed with juju, not the juju-br0 issues as discussed with andrew and dimiter

ubuntu@cvm6:~$ netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.110.24.1 0.0.0.0 UG 0 0 0 eth1
10.110.24.0 0.0.0.0 255.255.248.0 U 0 0 0 juju-br0
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$ cat /etc/network/interfaces
auto lo

iface eth0 inet dhcp

iface eth2 inet dhcp

# Primary interface (defining the default route)
iface eth1 inet manual

# Bridge to use for LXC/KVM containers
auto juju-br0
iface juju-br0 inet dhcp
    bridge_ports eth1
ubuntu@cvm6:~$ ip route
default via 10.110.24.1 dev eth1
10.110.24.0/21 dev juju-br0 proto kernel scope link src 10.110.24.116
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$ sudo less /var/log/syslog
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$
ubuntu@cvm6:~$ ifconfig
eth1 Link encap:Ethernet HWaddr 48:ce:e4:26:d0:c0
          inet addr:10.110.24.116 Bcast:10.110.255.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:6183 errors:0 dropped:48 overruns:0 frame:0
          TX packets:4767 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:619071 (619.0 KB) TX bytes:732605 (732.6 KB)

juju-br0 Link encap:Ethernet HWaddr 48:ce:e4:26:d0:c0
          inet addr:10.110.24.116 Bcast:10.110.255.255 Mask:255.255.248.0
          inet6 addr: fe80::4ace:e4ff:fe26:d0c0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:5660 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3605 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:399291 (399.2 KB) TX bytes:628905 (628.9 KB)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:482 errors:0 dropped:0 overruns:0 frame:0
          TX packets:482 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:42532 (42.5 KB) TX bytes:42532 (42.5 KB)

ubuntu@cvm6:~$ sudo dhclient -v ^C
ubuntu@cvm6:~$ sudo brtctl show
sudo: brtctl: command not found
ubuntu@cvm6:~$ sudo brctl show
bridge name bridge id STP enabled interfaces
juju-br0 8000.48cee426d0c0 no eth1
ubuntu@cvm6:~$ sudo brctl show macs
bridge name bridge id STP enabled interfaces
macs can't get info No such device
ubuntu@cvm6:~$ sudo brctl showmacs
Incorrect number of arguments for command
Usage: brctl showmacs <bridge> show a list of mac addrs
ubuntu@cvm6:~$ sudo brctl showmacs juju-br0
port no mac addr is local? ageing timer
  1 18:35:75:f5:ed:d0 no 0.07
  1 40:a8:f0:20:29:ad no 0.67
  1 48:ce:e4:26:d0:c0 yes 0.00
  1 48:ce:e4:26:d0:c0 yes 0.00
  1 52:40:1e:...

Read more...

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

 I've read the bug log a bit, am I reading it correctly that the
trusty (good) vs wily (bad) cases are selecting different interfaces to
use as bridge ports? E.g., from the syslog excerpt in the bug:

Aug 24 18:15:36 host dhclient: receive_packet failed on eth1: Network is down
Aug 24 18:15:36 host kernel: [ 66.764788] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.
Aug 24 18:15:36 host kernel: [ 66.772004] device eth1 entered promiscuous mode
Aug 24 18:15:37 host kernel: [ 68.144483] juju-br0: port 1(eth1) entered forwarding state
Aug 24 18:15:37 host kernel: [ 68.144504] juju-br0: port 1(eth1) entered forwarding state
Aug 24 18:15:37 host kernel: [ 68.160693] juju-br0: received packet on eth1 with own address as source address
Aug 24 18:15:37 host dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 3 (xid=0x40e77812)

 The above is the bad one, eth1 is the bridge port. That seems
to contradict the /etc/network/interfaces included with the bug (which
shows eth0 as the bridge port for juju-br0) and the "good" boot output.

 Also note that the DHCPDISCOVER is on eth1, not on juju-br0.
That is plausibly related to the observation that eth1 still has the IP
address information that (as I understand things) should nominally be
associated with juju-br0.

 I'm curious to know what the command line arguments for dhclient
are, and when it was started. Some of that might be in the log, prior
to the excerpt in the bug.

 In that case, it may be that the flaw lies in whatever logic it
is that moves the IP address from the underlying interface (eth1, here)
to the bridge itself. That doesn't happen automatically, something
somewhere has to perform that action.

 The bug seems to imply that "ifdown" should be run on the
interface; that looks like it does happen in the "working" case, as
there are link up messages:

Aug 25 19:03:10 ms10-33-host1 kernel: [ 44.487804] mlx4_en: eth0: Link Up
Aug 25 19:03:10 ms10-33-host1 kernel: [ 44.487887] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

 that are not in the "failing" case.

 It may also be that whatever it is that starts dhclient is
similarly confused; it seems that dhclient is already running before
eth1 is added to the bridge. Note the error (first line of the syslog
chunk, above) showing that dhclient is attempting activity on eth1, then
"device eth1 entered promiscuous mode"; that suggests that either the
dhclient is being run with incorrect arguments, or perhaps something is
taking place out of order.

 There is also a bug,

https://bugs.launchpad.net/ubuntu/+source/ifupdown/+bug/1337873

 that might be related to this; there is a race condition within
ifupdown itself that seems to manifest mostly with bonding setup.
Perhaps you've found a new way to make it happen, and the race is only
hit due to some subtle timing change between kernels.

Revision history for this message
Sean Feole (sfeole) wrote :

So something that's not best described in the bug description is that both examples are taking from 2 different hardware platforms , which explains once example showing eth0 as the primary interface and the other example showing eth1.

I thought that was clear in the following blurb ..

2 different hardware platforms, 2 different userspaces and kernels.

"
Here is the syslog output from the 2 different stateserver attempts. The first set of logs from 'host' is running a Trusty userspace with wily kernel. Which displays the failure.

The 2nd snippet of syslog 'host1' displays a Trusty userspace and Trusty Kernel, which eventually completes the bootstrap as expected.
"

sorry if it caused any confusion.

Furthermore , after a lengthy discussion with Andrew and Dimiter today, it was decided I would try to reproduce this problem, using wily userspace/kernel env (Wily epemeral daily maas image) . Due to some existing bugs, i'm blocked on that as well for testing on arm64

https://bugs.launchpad.net/maas/+bug/1499869

(Curtin Fixes)
https://bugs.launchpad.net/maas/+bug/1402042

Below is a full pastebin of the host logs, after curtin copies the root-tgz image over and reboots the host, (i know you were asking in the previous comments for this)

http://pastebin.ubuntu.com/12606125/

Revision history for this message
Andrew McDermott (frobware) wrote :
Download full text (4.2 KiB)

I tried to reproduce the net effect of juju bootstrap without using juju.

I booted cvm3 and changed /etc/network/interfaces to:

ubuntu@cvm3:~$ cat /etc/network/interfaces
auto lo

iface eth0 inet dhcp
iface eth2 inet dhcp

# Primary interface (defining the default route)
iface eth1 inet manual

# Bridge to use for LXC/KVM containers
auto juju-br0
iface juju-br0 inet dhcp
    bridge_ports eth1

I then added thr bridge:

ubuntu@cvm3:~$ history |grep 'sudo ip link add'
   10 sudo ip link add name juju-br0 type bridge

Then noted that I get the following ifconfig:

ubuntu@cvm3:~$ history |grep 'sudo ip link add'
   10 sudo ip link add name juju-br0 type bridge
   22 history |grep 'sudo ip link add'
ubuntu@cvm3:~$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 18:35:75:f5:ed:cf
          BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:26 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

eth1 Link encap:Ethernet HWaddr 18:35:75:f5:ed:d0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:1697 errors:0 dropped:14 overruns:0 frame:0
          TX packets:1036 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:220782 (220.7 KB) TX bytes:147586 (147.5 KB)

eth2 Link encap:Ethernet HWaddr 18:35:75:f5:ed:d1
          BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

juju-br0 Link encap:Ethernet HWaddr 18:35:75:f5:ed:d0
          inet addr:10.110.24.113 Bcast:10.110.255.255 Mask:255.255.248.0
          inet6 addr: fe80::1a35:75ff:fef5:edd0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:483 errors:0 dropped:0 overruns:0 frame:0
          TX packets:278 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:37936 (37.9 KB) TX bytes:40272 (40.2 KB)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

and eth1 is now on the juju-br0:

ubuntu@cvm3:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 18:35:75:f5:ed:cf brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master juju-br0 state UP group de...

Read more...

Revision history for this message
Andrew McDermott (frobware) wrote :

And since my last comment in #12 I noticed that the machine has hung. Nothing reported on the console either. Coincident? (Don't know!)

Revision history for this message
Andrew McDermott (frobware) wrote :
Download full text (39.9 KiB)

@sfeole I'm having trouble booting cm3 today. It worked once today - see my comments in #12 but thereafter deploying via MAAS I don't see the boot complete when watching the console.

EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 4.2.0-7.7-generic (buildd@magic) (gcc version 5.2.1 20150903 (Ubuntu 5.2.1-16ubuntu1) ) #7+thunder.1-Ubuntu SMP Wed Sep 9 00:11:11 UTC 2015 (Ubuntu 4.2.0-7.7.7+thunder.1-generic 4.2.0)
[ 0.000000] CPU: AArch64 Processor [430f0a10] revision 0
[ 0.000000] Detected VIPT I-cache on CPU0
[ 0.000000] alternatives: detected feature GIC system register CPU interface
[ 0.000000] efi: Getting EFI parameters from FDT:
[ 0.000000] EFI v2.40 by Cavium Thunder cn88xx EFI Aug 12 2015 18:25:45
[ 0.000000] efi: ACPI=0xfffff000 ACPI 2.0=0xfffff014 SMBIOS 3.0=0xffaa72000
[ 0.000000] psci: probing for conduit method from DT.
[ 0.000000] psci: PSCIv0.2 detected in firmware.
[ 0.000000] psci: Using standard PSCI v0.2 function IDs
[ 0.000000] psci: Trusted OS migration not required
[ 0.000000] PERCPU: Embedded 17 pages/cpu @ffffffcffe6c7000 s30976 r8192 d30464 u69632
[ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 16510032
[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.2.0-7.7-generic root=UUID=615addfa-5d6b-46e4-86df-539658c4c804 ro console=ttyAMA0
[ 0.000000] log_buf_len individual max cpu contribution: 4096 bytes
[ 0.000000] log_buf_len total cpu_extra contributions: 192512 bytes
[ 0.000000] log_buf_len min size: 16384 bytes
[ 0.000000] log_buf_len: 262144 bytes
[ 0.000000] early log buf free: 13904(84%)
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes)
[ 0.000000] Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes)
[ 0.000000] software IO TLB [mem 0xfbff1000-0xffff1000] (64MB) mapped at [ffffffc0fabf1000-ffffffc0febf0fff]
[ 0.000000] Memory: 65831136K/67088384K available (8127K kernel code, 958K rwdata, 3552K rodata, 696K init, 781K bss, 1257248K reserved, 0K cma-reserved)
[ 0.000000] Virtual kernel memory layout:
[ 0.000000] vmalloc : 0xffffff8000000000 - 0xffffffbdbfff0000 ( 246 GB)
[ 0.000000] vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000 ( 8 GB maximum)
[ 0.000000] 0xffffffbdc0050000 - 0xffffffbe00000000 ( 1023 MB actual)
[ 0.000000] fixed : 0xffffffbffa7fd000 - 0xffffffbffac00000 ( 4108 KB)
[ 0.000000] PCI I/O : 0xffffffbffae00000 - 0xffffffbffbe00000 ( 16 MB)
[ 0.000000] modules : 0xffffffbffc000000 - 0xffffffc000000000 ( 64 MB)
[ 0.000000] memory : 0xffffffc000000000 - 0xffffffcffec00000 ( 65516 MB)
[ 0.000000] .init : 0xffffffc000bea000 - 0xffffffc000c98000 ( 696 KB)
[ 0...

Revision history for this message
Andrew McDermott (frobware) wrote :

I let the machine boot, then did:

ubuntu@cvm3:~$ ping 10.110.24.1
PING 10.110.24.1 (10.110.24.1) 56(84) bytes of data.
64 bytes from 10.110.24.1: icmp_seq=1 ttl=64 time=0.133 ms
64 bytes from 10.110.24.1: icmp_seq=2 ttl=64 time=0.271 ms
64 bytes from 10.110.24.1: icmp_seq=3 ttl=64 time=0.201 ms
64 bytes from 10.110.24.1: icmp_seq=4 ttl=64 time=0.181 ms
64 bytes from 10.110.24.1: icmp_seq=5 ttl=64 time=0.202 ms
64 bytes from 10.110.24.1: icmp_seq=6 ttl=64 time=0.166 ms
64 bytes from 10.110.24.1: icmp_seq=7 ttl=64 time=0.189 ms
64 bytes from 10.110.24.1: icmp_seq=8 ttl=64 time=0.186 ms
64 bytes from 10.110.24.1: icmp_seq=9 ttl=64 time=0.162 ms
64 bytes from 10.110.24.1: icmp_seq=10 ttl=64 time=0.252 ms
64 bytes from 10.110.24.1: icmp_seq=11 ttl=64 time=0.192 ms
64 bytes from 10.110.24.1: icmp_seq=12 ttl=64 time=0.185 ms
64 bytes from 10.110.24.1: icmp_seq=13 ttl=64 time=0.275 ms
64 bytes from 10.110.24.1: icmp_seq=14 ttl=64 time=0.189 ms
64 bytes from 10.110.24.1: icmp_seq=15 ttl=64 time=0.279 ms
64 bytes from 10.110.24.1: icmp_seq=16 ttl=64 time=0.192 ms
64 bytes from 10.110.24.1: icmp_seq=17 ttl=64 time=0.162 ms
64 bytes from 10.110.24.1: icmp_seq=18 ttl=64 time=0.253 ms
64 bytes from 10.110.24.1: icmp_seq=19 ttl=64 time=0.180 ms
64 bytes from 10.110.24.1: icmp_seq=20 ttl=64 time=0.274 ms
64 bytes from 10.110.24.1: icmp_seq=21 ttl=64 time=0.189 ms
64 bytes from 10.110.24.1: icmp_seq=22 ttl=64 time=0.181 ms
64 bytes from 10.110.24.1: icmp_seq=23 ttl=64 time=0.274 ms
64 bytes from 10.110.24.1: icmp_seq=24 ttl=64 time=0.188 ms
64 bytes from 10.110.24.1: icmp_seq=25 ttl=64 time=0.276 ms
64 bytes from 10.110.24.1: icmp_seq=26 ttl=64 time=0.162 ms
64 bytes from 10.110.24.1: icmp_seq=27 ttl=64 time=0.258 ms
64 bytes from 10.110.24.1: icmp_seq=28 ttl=64 time=0.163 ms
64 bytes from 10.110.24.1: icmp_seq=29 ttl=64 time=0.162 ms
64 bytes from 10.110.24.1: icmp_seq=30 ttl=64 time=0.255 ms
64 bytes from 10.110.24.1: icmp_seq=31 ttl=64 time=0.199 ms
64 bytes from 10.110.24.1: icmp_seq=32 ttl=64 time=0.192 ms
64 bytes from 10.110.24.1: icmp_seq=33 ttl=64 time=0.207 ms
64 bytes from 10.110.24.1: icmp_seq=34 ttl=64 time=0.188 ms
64 bytes from 10.110.24.1: icmp_seq=35 ttl=64 time=0.160 ms
64 bytes from 10.110.24.1: icmp_seq=36 ttl=64 time=0.253 ms
64 bytes from 10.110.24.1: icmp_seq=37 ttl=64 time=0.158 ms
64 bytes from 10.110.24.1: icmp_seq=38 ttl=64 time=0.122 ms
64 bytes from 10.110.24.1: icmp_seq=39 ttl=64 time=0.223 ms

and at this point it stopped. So no other changes other than logging in and I previously ran a `apt-get upgade' but purely to generate some network traffic. Hitting return on the console give me no 'login:' prompt.

Revision history for this message
Andrew McDermott (frobware) wrote :
Download full text (3.6 KiB)

I tried to repeat my experiment in #15 but this time on cvm4 -- #15 was done on cvm3.

However, during the deploy and watching the console I saw the following:

[Enter `^Ec?' for help]
**********************************************************************************
Node 0, Core 00: Unhandled Exception
ESR EC=0x0000000000000025(DATA_ABORT_CURRENT_EL) ISS=0x0000000000000250( EXTERNAL WRITE EXTERNAL_ABORT)
**********************************************************************************
pc : 0x00000000000124bc esr: 0x0000000096000250
far: 0x0000040000000000 thread: 0x0000000000047340
x00: 0x0000000000000000 x16: 0x0000000000039bf8
x01: 0x0000040000000000 x17: 0x0000000000039bd0
x02: 0xfffffffffff00000 x18: 0x0000000000000000
x03: 0x000001e1000001e1 x19: 0x0000000000000000
x04: 0x0000000000000000 x20: 0x0000000000440000
x05: 0x0000000000038b60 x21: 0x0000000000440000
x06: 0xffffffffffffffff x22: 0x00000000ffffffff
x07: 0x0000000000000001 x23: 0x0000000000039b48
x08: 0x0000000000000001 x24: 0x0000000000000000
x09: 0x00000008a00a08a0 x25: 0x0000000000000000
x10: 0x0000000000000101 x26: 0x0000000000000000
x11: 0x000000000003a8f8 x27: 0x0000000000000000
x12: 0x0000000001010404 x28: 0x0000000000000000
x13: 0x0000280880280880 x29: 0x000000000004ba70
x14: 0x0000000000000002 x30: 0x0000000000012454
x15: 0x0000000000039cd8 x31: 0x000000000004ba60

q00: 0x00000000000023a1_0x0000000000000000 q16: 0x0000000000000000_0x0000000000000000
q01: 0x0000000000000000_0x0000000000000000 q17: 0x0000000000000000_0x0000000000000000
q02: 0x0000000000000000_0x0000000000000000 q18: 0x0000000000000000_0x0000000000000000
q03: 0x0000000000000000_0x0000000000000000 q19: 0x0000000000000000_0x0000000000000000
q04: 0x0000000000000000_0x0000000000000000 q20: 0x0000000000000000_0x0000000000000000
q05: 0x0000000000000000_0x0000000000000000 q21: 0x0000000000000000_0x0000000000000000
q06: 0x0000000000000000_0x0000000000000000 q22: 0x0000000000000000_0x0000000000000000
q07: 0x0000000000000000_0x0000000000000000 q23: 0x0000000000000000_0x0000000000000000
q08: 0x0000000000000000_0x0000000000000000 q24: 0x0000000000000000_0x0000000000000000
q09: 0x0000000000000000_0x0000000000000000 q25: 0x0000000000000000_0x0000000000000000
q10: 0x0000000000000000_0x0000000000000000 q26: 0x0000000000000000_0x0000000000000000
q11: 0x0000000000000000_0x0000000000000000 q27: 0x0000000000000000_0x0000000000000000
q12: 0x0000000000000000_0x0000000000000000 q28: 0x0000000000000000_0x0000000000000000
q13: 0x0000000000000000_0x0000000000000000 q29: 0x0000000000000000_0x0000000000000000
q14: 0x0000000000000000_0x0000000000000000 q30: 0x0000000000000000_0x0000000000000000
q15: 0x0000000000000000_0x0000000000000000 q31: 0x0000000000000000_0x0000000000000000

stack[0x000000000004ba60] = 0x0000000000000000
stack[0x000000000004ba68] = 0x0000000000000000
stack[0x000000000004ba70] = 0x000000000004bad0
stack[0x000000000004ba78] = 0x0000000000010348
stack[0x000000000004ba80] = 0x0000000000000000
stack[0x000000000004ba88] = 0x0000000000000000
stack[0x000000000004ba90] = 0x0000000000000000
stack[0x000000000004ba98] = 0x0000000000039b40
stack[0x00000000000...

Read more...

Revision history for this message
Sean Feole (sfeole) wrote :

Hey Andy, The reason why you see the stacktrace is because you need to manually power cycle the mode after maas initially provisions it. When the reboot occurs, issue the following:

$ipmitool -I lanplus -H 10.110.24.14 -U admin -P admin chassis power reset

Revision history for this message
Sean Feole (sfeole) wrote :
Download full text (5.5 KiB)

Today I was able to boostrap to a x64 node using JUJU with the Wily image.

I made sure to bootstrap the latest wily ephemeral image, which not is using the correct wily userspace + kernel.

The following pastebin is the output of the host wily bootstrapping:
http://pastebin.ubuntu.com/12624990/

The following pastebin is the output of the host syslog after it comes up:
http://pastebin.ubuntu.com/12625026/

The following pastebin is the output of the juju client --debug logs:
http://pastebin.ubuntu.com/12625067/

Take note, eth0 and juju-br0: both have the same ipv4 address , in the syslog, you can see dhclient trying to grab an ip for both eth0 && juju-br0

However, unlike the arm64 counterparts, i'm able to query the network host via icmp (ping) where this does not work on arm64
$ ping 10.229.65.108
PING 10.229.65.108 (10.229.65.108) 56(84) bytes of data.
64 bytes from 10.229.65.108: icmp_seq=1 ttl=64 time=0.154 ms
64 bytes from 10.229.65.108: icmp_seq=2 ttl=64 time=0.160 ms
64 bytes from 10.229.65.108: icmp_seq=3 ttl=64 time=0.149 ms
64 bytes from 10.229.65.108: icmp_seq=4 ttl=64 time=0.143 ms
^C
--- 10.229.65.108 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2997ms
rtt min/avg/max/mdev = 0.143/0.151/0.160/0.013 ms
sfeole@bates:~$

dhclient still tries to request address via eth0

Sep 30 15:12:54 ms10-08-avaton dhclient: DHCPREQUEST of 10.229.65.108 on juju-br0 to 10.229.0.106 port 67 (xid=0x129821b9)
Sep 30 15:13:01 ms10-08-avaton mongod.37017[6863]: Wed Sep 30 15:13:01.793 [conn56] authenticate db: admin { authenticate: 1, nonce: "60937017b563ce91", user: "machine-0", key: "1998d289125cc15784fd00
eba7bd72c0" }
Sep 30 15:13:04 ms10-08-avaton dhclient: DHCPREQUEST of 10.229.65.108 on eth0 to 10.229.0.106 port 67 (xid=0x47f45aa2)
Sep 30 15:13:09 ms10-08-avaton dhclient: DHCPREQUEST of 10.229.65.108 on juju-br0 to 10.229.0.106 port 67 (xid=0x129821b9)
Sep 30 15:13:15 ms10-08-avaton mongod.37017[6863]: Wed Sep 30 15:13:15.537 [conn56] authenticate db: admin { authenticate: 1, nonce: "b38c26a17954eabe", user: "machine-0", key: "b9e69e4c4b57548617a9c1
8096b6284d" }
Sep 30 15:13:19 ms10-08-avaton dhclient: DHCPREQUEST of 10.229.65.108 on eth0 to 10.229.0.106 port 67 (xid=0x47f45aa2)
Sep 30 15:13:19 ms10-08-avaton dhclient: DHCPREQUEST of 10.229.65.108 on juju-br0 to 10.229.0.106 port 67 (xid=0x129821b9)
Sep 30 15:13:23 ms10-08-avaton mongod.37017[6863]: Wed Sep 30 15:13:23.448 [conn53] authenticate db: admin { authenticate: 1, nonce: "ec98fea42648e711", user: "machine-0", key: "505f0e760df002df952bfa
047cd01713" }
Sep 30 15:13:23 ms10-08-avaton mongod.37017[6863]: Wed Sep 30 15:13:23.587 [conn56] authenticate db: admin { authenticate: 1, nonce: "89829a096b653da4", user: "machine-0", key: "ad6df464863e3634fcb0b7
8cc915c4e2" }
Sep 30 15:13:33 ms10-08-avaton dhclient: DHCPREQUEST of 10.229.65.108 on juju-br0 to 10.229.0.106 port 67 (xid=0x129821b9)
Sep 30 15:13:35 ms10-08-avaton dhclient: DHCPREQUEST of 10.229.65.108 on eth0 to 10.229.0.106 port 67 (xid=0x47f45aa2)

---------------------------------------------------------------

$ ifconfig
eth0 Link encap:Ethernet HWaddr f0:92:1c:b4:d4:44
       ...

Read more...

Revision history for this message
Sean Feole (sfeole) wrote :

Andrew using your steps as outlined in comment #12, here is my results:

#I modified the following to reflect the added bridge.

$ cat /etc/network/interfaces
iface eth1 inet dhcp

iface eth0 inet dhcp
iface eth1 inet dhcp

# Primary interface (defining the default route)
iface eth0 inet manual

# Bridge to use for LXC/KVM containers
auto juju-br0
iface juju-br0 inet dhcp
    bridge_ports eth0

Followed by:

$ sudo ip link add name juju-br0 type bridge

This command will add the juju-br0 interface, and enable it, which at that point i lose network connectivity. The console is not active due to cloud init && the node is under maas control.

So at this time i would suspect the lockup is due to both interfaces being up and having the same IP. I was not able to have that fluid transition as you experienced in comment #12

If i am to reboot the host, then at that time all functions as expected.

$ ifconfig
eth0 Link encap:Ethernet HWaddr f0:92:1c:b4:d5:b8
          inet6 addr: fe80::f292:1cff:feb4:d5b8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:175 errors:0 dropped:0 overruns:0 frame:0
          TX packets:95 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:19356 (19.3 KB) TX bytes:13250 (13.2 KB)
          Memory:fbf40000-fbf5ffff

juju-br0 Link encap:Ethernet HWaddr f0:92:1c:b4:d5:b8
          inet addr:10.229.65.110 Bcast:10.229.255.255 Mask:255.255.0.0
          inet6 addr: fe80::f292:1cff:feb4:d5b8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:172 errors:0 dropped:0 overruns:0 frame:0
          TX packets:87 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:16762 (16.7 KB) TX bytes:12602 (12.6 KB)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:80 errors:0 dropped:0 overruns:0 frame:0
          TX packets:80 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:5920 (5.9 KB) TX bytes:5920 (5.9 KB)

Revision history for this message
Sean Feole (sfeole) wrote :

$ cat /etc/network/interfaces
iface eth0 inet dhcp
iface eth1 inet dhcp

# Primary interface (defining the default route)
iface eth0 inet manual

# Bridge to use for LXC/KVM containers
auto juju-br0
iface juju-br0 inet dhcp
    bridge_ports eth0

I fat fingered the copy paste on the last comment

Revision history for this message
Sean Feole (sfeole) wrote :
Download full text (4.2 KiB)

I took the time to bootthe same x64 system with Trusty, userspace&kernel, performed the same exact steps outlined

I was able to successfully transition from eth0 / juju-br0 without any drop in service. it would appear to be possibly an issue with wily at this point.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty
ubuntu@ms10-10-avaton:~$

$ sudo ip link add name juju-br0 type bridge

$ ifconfig
eth0 Link encap:Ethernet HWaddr f0:92:1c:b4:d5:b8
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:2273 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1351 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1829932 (1.8 MB) TX bytes:169936 (169.9 KB)
          Memory:fbf40000-fbf60000

juju-br0 Link encap:Ethernet HWaddr f0:92:1c:b4:d5:b8
          inet addr:10.229.65.110 Bcast:10.229.255.255 Mask:255.255.0.0
          inet6 addr: fe80::f292:1cff:feb4:d5b8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:33 errors:0 dropped:0 overruns:0 frame:0
          TX packets:26 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3142 (3.1 KB) TX bytes:3020 (3.0 KB)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

Oct 1 19:13:29 ms10-10-avaton dhclient: DHCPREQUEST of 10.229.65.110 on eth0 to 10.229.0.106 port 67 (xid=0x7dbf0650)
Oct 1 19:17:01 ms10-10-avaton CRON[1475]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Oct 1 19:17:27 ms10-10-avaton dhclient: message repeated 18 times: [ DHCPREQUEST of 10.229.65.110 on eth0 to 10.229.0.106 port 67 (xid=0x7dbf0650)]
Oct 1 19:17:46 ms10-10-avaton dhclient: DHCPREQUEST of 10.229.65.110 on eth0 to 255.255.255.255 port 67 (xid=0x7dbf0650)
Oct 1 19:17:46 ms10-10-avaton dhclient: DHCPACK of 10.229.65.110 from 10.229.0.101
Oct 1 19:17:46 ms10-10-avaton dhclient: bound to 10.229.65.110 -- renewal in 276 seconds.
Oct 1 19:18:49 ms10-10-avaton kernel: [ 616.901230] Bridge firewalling registered
Oct 1 19:18:49 ms10-10-avaton kernel: [ 616.931620] device eth0 entered promiscuous mode
Oct 1 19:18:49 ms10-10-avaton kernel: [ 616.935514] juju-br0: port 1(eth0) entered forwarding state
Oct 1 19:18:49 ms10-10-avaton kernel: [ 616.935538] juju-br0: port 1(eth0) entered forwarding state
Oct 1 19:18:49 ms10-10-avaton dhclient: Internet Systems Consortium DHCP Client 4.2.4
Oct 1 19:18:49 ms10-10-avaton dhclient: Copyright 2004-2012 Internet Systems Consortium.
Oct 1 19:18:49 ms10-10-avaton dhclient: All rights reserved.
Oct 1 19:18:49 ms10-10-avaton dhclient: For info, please visit https://www.isc.org/software/dhcp/
Oct 1 19:18:49 ms10-10-avaton dhclient:
Oct 1 19:18:49 ms10-10-a...

Read more...

Revision history for this message
Andrew McDermott (frobware) wrote :

@sfeole - you mentioned "it would appear to be possibly an issue with wily at this point"

Given that we are just using commands from iproute2 can we also assume that this is not a juju bug?

tags: added: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the latest prior kernel that did not have this issue.

Can you test the following kernels and post back? We are looking for the first kernel version that exhibits this bug:

3.19 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19-vivid/
4.0 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.0-vivid/
4.1 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.1-unstable/
4.2 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2-unstable/

You don't have to test every kernel, just up until the kernel that first has this bug. If the 3.19 Final kernel exhibits the bug, then we will have to test older kernels closer to Trusty.

These are all upstream kernels, so if none exhibit the bug, then that tells us its an Ubuntu kernel specific issue.

Thanks in advance!

affects: juju-core → linux
Changed in linux:
assignee: Andrew McDermott (frobware) → Joseph Salisbury (jsalisbury)
milestone: 1.25-beta2 → none
Revision history for this message
Sean Feole (sfeole) wrote :

Hey Joe,

Can you give me some test kernels for arm64 if all possible?

So last night was more or less running through what you gave me on amd64. I was able to run the 3.19-vivid kernel on amd64, however I would like to see what the arm64 equivalent does.

On ARM64
The core of the problem starts in between here: 3.13.0-65 -> 3.19.0-30.

Linux ms10-38-mcdivittA3 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:14:48 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux
Linux ms10-40-mcdivittA3 3.19.0-30-generic #34-Ubuntu SMP Fri Oct 2 22:09:54 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux

On AMD64
I am able to bootstrap 3.13 -> 3.19 however do see some concerns that may or may not be a problem

Revision history for this message
Sean Feole (sfeole) wrote :

Hey Joe,

So far I tested all of the trusty kernels on arm64. even 3.13.0-66 in -proposed. All of them pass on my arm64 tests.

The first one to fail is the vivid release kernel, 3.19.0.15.14.

Will you be able to build the earlier versions, at least one in each release, 3.17/3.16 and I can test them if you like, I was not able to find prebuilt debs in the older series. I would like to start with older than 3.18, since I first observed this problem on a custom build of the 3.18 kernel.

So just 4 left...

3.14 / 3.15 / 3.16 / 3.17

I can start backwards with 3.17 if possible. Just need to know where to grab the built debs from.

Revision history for this message
Sean Feole (sfeole) wrote :

I actually found some here which I can play with now: https://launchpad.net/ubuntu/vivid/+source/linux

Let me know if there is a different location you would like me to pull from.

Revision history for this message
Sean Feole (sfeole) wrote :

Update to comment # 25,

I was able to test with few kernels from utopic.
The last one tested was from -proposed '3.16.0-44-generic' this did appear to bootstrap as expected.

As of now:

Last working kernel: 3.16-44-generic
Last Discovered Known broken kernel: 3.19.0-30

I mentioned in a previous comment that I had seen the problem with 3.18 although I have not officially tested it. Will do so next.

Narrowed the gap to 3.17 / 3.18

Revision history for this message
Sean Feole (sfeole) wrote :

As of today, here is where we stand:

Last Working kernel: 3.16-44-generic (utopic -proposed)
First Discovered Known Broken Kernel: 3.19.0-30 (vivid-release)

I am testing on hp moonshot m400 cartridges for now.

To test a 3.18 kernel we will need to build one for the m400. The kernel team has already confirmed they will do that for me, we owe the kernel team a list of patches to backport from 3.19 in order to build a working kernel in 3.18. This way we can identify if this problem originated from the 3.18 series or 3.19.

making progress slowly.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hey Sean,

I'm still in the process of backporting and identifying all of the prereqs needed to apply the McDivitt patches to a 3.18 kernel.

I noticed that you first discovered the issue in 3.19.0-30. All of the McDivitt patches were added Vivid kernel in 3.19.0-26. I was wondering if you could give that kernel a test while I continue the backport?

The 3.19.0-26 kernel can be downloaded from:
https://launchpad.net/ubuntu/+source/linux/3.19.0-26.27

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I was also able to build a 3.18.0-11 kernel with the McDivitt patches. Ming's patches required some backporting and prerequisite commits. The list of patches needed for 3.18 are:

Config changes for 3.18:
aaec082 UBUNTU: [Config] switch on "all" dtbs
cbdfce2 UBUNTU: [Config] follow move of arm64 dts' into vendor directories

Patches from Ming:
e82a3ff (no-up) arm64: dts: add APM Merlin Board device tree
597fad6 dts, arm64: Move dts files to vendor subdirs
147c601 drivers: net: xgene: Add SGMII based 1GbE support with ring manager v2
987cb75 drivers: net: xgene: Add 10GbE support with ring manager v2
2bd7cda drivers: net: xgene: Add ring manager v2 functions
f905136 drivers: net: xgene: Change ring manager to use function pointers
3e92a9f drivers: net: xgene: Add separate tx completion ring
60eabc2 dtb: xgene: Add interrupt for Tx completion
096ea4d Documentation: dts: xgene: Update interrupt field description
0b09f89 ata: ahci_xgene: Add AHCI Support for 2nd HW version of APM X-Gene SoC AHCI SATA Host controller.
3a25c40 libahci: Add support to handle HOST_IRQ_STAT as edge trigger latch.
016ea5d libahci: Refactoring of ahci_single_irq_intr function.
e0c1b5b ata: ahci_platform: fix owner module reference mismatch for scsi host

3.18 specific prerequisites:
0fd1ac7 drivers: net: xgene: Add second SGMII based 1G interface
f47c638 drivers: net: xgene: Make xgene_enet_of_match depend on CONFIG_OF
46fd949 net: eth: xgene: change APM X-Gene SoC platform ethernet to support ACPI
b76ec33 Driver core: Unified device properties interface for platform firmware

The 3.18 test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1496972/3.18.0-11.12-Patched/

Revision history for this message
Sean Feole (sfeole) wrote :
Download full text (8.0 KiB)

Hey Joe,

I tested kernel 3.19.0-26.27 and we still have the issue:

$ ifconfig
eth0 Link encap:Ethernet HWaddr fc:15:b4:21:00:c2
          inet addr:10.229.65.139 Bcast:10.229.255.255 Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:219 errors:0 dropped:0 overruns:0 frame:0
          TX packets:164 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:49226 (49.2 KB) TX bytes:23267 (23.2 KB)

juju-br0 Link encap:Ethernet HWaddr fc:15:b4:21:00:c2
          inet addr:10.229.65.139 Bcast:10.229.255.255 Mask:255.255.0.0
          inet6 addr: fe80::fe15:b4ff:fe21:c2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:213 errors:0 dropped:0 overruns:0 frame:0
          TX packets:134 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:45928 (45.9 KB) TX bytes:20959 (20.9 KB)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:18 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1744 (1.7 KB) TX bytes:1744 (1.7 KB)

ubuntu@ms10-39-mcdivittB0:~$ netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.229.0.1 0.0.0.0 UG 0 0 0 eth0
10.229.0.0 0.0.0.0 255.255.0.0 U 0 0 0 juju-br0
ubuntu@ms10-39-mcdivittB0:~$ uname -a
Linux ms10-39-mcdivittB0 3.19.0-26-generic #27-Ubuntu SMP Tue Jul 28 18:47:37 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux

from the juju side:

2015-10-23 12:14:37 INFO juju.provider.maas environ.go:1088 could not acquire a node in zone "avaton", trying another zone
2015-10-23 12:14:43 INFO juju.provider.maas environ.go:1088 could not acquire a node in zone "default", trying another zone
2015-10-23 12:14:43 DEBUG juju.provider.maas environ.go:1819 node "/MAAS/api/1.0/nodes/node-ae1dfe5e-5bc1-11e5-872f-00163ec335e8/" skipping disabled network interface "eth1"
2015-10-23 12:14:43 DEBUG juju.provider.maas environ.go:1811 node "/MAAS/api/1.0/nodes/node-ae1dfe5e-5bc1-11e5-872f-00163ec335e8/" primary network interface is "eth0"
2015-10-23 12:14:43 DEBUG juju.provider.maas environ.go:773 node "/MAAS/api/1.0/nodes/node-ae1dfe5e-5bc1-11e5-872f-00163ec335e8/" has network interfaces map[fc:15:b4:21:00:c3:{0 eth1 true} fc:15:b4:21:00:c2:{1 eth0 false}]
2015-10-23 12:14:43 DEBUG juju.provider.maas environ.go:778 node "/MAAS/api/1.0/nodes/node-ae1dfe5e-5bc1-11e5-872f-00163ec335e8/" has networks []
2015-10-23 12:14:43 DEBUG juju.provider.maas environ.go:820 node "/MAAS/api/1.0/nodes/node-ae1dfe5e-5bc1-11e5-872f-00163ec335e8/" network information: []network.InterfaceInfo(nil)
2015-10-23 12:14:43 DEBUG juju.cloudconfig.instancecfg instancecfg.go:521 Setting numa ctl preference to false
2015-10-23 12:14:43 DEBUG juju.service discovery.go:65 discovered init system "upst...

Read more...

Revision history for this message
Sean Feole (sfeole) wrote :

Joe,

I can't appear to get the 3.18.0-11.12-Patched kernel to boot, unless it is booting or takes a longer than usual time to see the kernel messages, I'll leave it again for about 5 minutes and see what happens.

On the console this is all i see... As you can see, 3.18.0-11-generic is booting.

Booting M.2
252 bytes read in 24 ms (9.8 KiB/s)
## Executing script at 4004000000
11905464 bytes read in 316 ms (35.9 MiB/s)
25623916 bytes read in 658 ms (37.1 MiB/s)
## Booting kernel from Legacy Image at 4002000000 ...
   Image Name: kernel 3.18.0-11-generic
   Created: 2015-10-23 12:55:23 UTC
   Image Type: ARM Linux Kernel Image (uncompressed)
   Data Size: 11905400 Bytes = 11.4 MiB
   Load Address: 00080000
   Entry Point: 00080000
   Verifying Checksum ... OK
## Loading init Ramdisk from Legacy Image at 4005000000 ...
   Image Name: ramdisk 3.18.0-11-generic
   Created: 2015-10-23 12:55:23 UTC
   Image Type: ARM Linux RAMDisk Image (gzip compressed)
   Data Size: 25623852 Bytes = 24.4 MiB
   Load Address: 00000000
   Entry Point: 00000000
   Verifying Checksum ... OK
## Flattened Device Tree blob at 4003000000
   Booting using the fdt blob at 0x0000004003000000
   Loading Kernel Image ... OK
OK
   Loading Ramdisk to 4fee790000, end 4feffffd2c ... OK
   Loading Device Tree to 0000004000ff8000, end 0000004000fffb0d ... OK

Starting kernel ...

L3C: 8MB

affects: linux → ubuntu
Changed in ubuntu:
status: Triaged → In Progress
Revision history for this message
dann frazier (dannf) wrote :

I took a look at this bug today since it's bubbling up to the critical level.

I could easily reproduce it (trusty + a 3.18 kernel). I noticed that, after juju does the ifdown, dhclient is still running for the primary interface. Even if ifdown successfully deconfigured the interface, dhclient is just going to bring it back up/configure it when it refreshes its lease.

So the question is, why didn't ifdown kill dhclient? Well, I looked at the juju code and I see that it generates a new /e/n/interfaces file before running ifdown. I've experienced issues with this before. I don't believe that ifdown keeps track of how interfaces were brought up itself, it instead consults /e/n/interfaces again. Since /e/n/i was already updated, ifdown doesn't know how the primary interface was brought up, so fails to bring it down cleanly.

My suggestion is to bring down the primary interface *before* updating /e/n/i, then bring up the bridge after.

Changed in ubuntu:
status: In Progress → Invalid
Revision history for this message
Raghuram Kota (rkota) wrote :

Re-assigning the bug to Andrew M based on comment #33

Changed in juju-core:
assignee: nobody → Andrew McDermott (frobware)
Revision history for this message
Raghuram Kota (rkota) wrote :

Hi Andrew,

Dann attached a tested fix patch to comm#33. Can you please help test+merge this into appropriate juju releases soon ? This bug has been blocking us for sometime now..

Thanks,
Raghu

Revision history for this message
Andrew McDermott (frobware) wrote :

Will look at the patch. Thanks.

Revision history for this message
Andrew McDermott (frobware) wrote :

I see in comment #33 that the bug became invalid and reassigned. However, it doesn't explain the behaviour I previously saw in comment #15 where pinging a gateway interface appeared to lock up the machine - there was no other changes made to /e/n/i in that case.

Changed in juju-core:
milestone: none → 1.24.8
Changed in juju-core:
milestone: 1.24.8 → 1.25.1
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Andrew McDermott (frobware) wrote :

I can see this same issue on amd64.

ubuntu@maas-node2:~$ uname -a
Linux maas-node2 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 15:35:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@maas-node2:~$ lsb_release -cs
wily

And, having bootstrapped a node, I see the following:

ubuntu@maas-node2:~$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 52:54:00:c3:72:59
          inet addr:10.17.17.103 Bcast:10.17.17.255 Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:28657 errors:0 dropped:2 overruns:0 frame:0
          TX packets:20914 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:55451555 (55.4 MB) TX bytes:1666875 (1.6 MB)

juju-br0 Link encap:Ethernet HWaddr 52:54:00:c3:72:59
          inet addr:10.17.17.103 Bcast:10.17.17.255 Mask:255.255.255.0
          inet6 addr: fe80::5054:ff:fec3:7259/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:18112 errors:0 dropped:0 overruns:0 frame:0
          TX packets:17050 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:40259427 (40.2 MB) TX bytes:1402232 (1.4 MB)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:9584 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9584 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:19054312 (19.0 MB) TX bytes:19054312 (19.0 MB)

lxcbr0 Link encap:Ethernet HWaddr 3a:f6:90:64:c4:73
          inet addr:10.0.3.1 Bcast:0.0.0.0 Mask:255.255.255.0
          inet6 addr: fe80::38f6:90ff:fe64:c473/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B) TX bytes:570 (570.0 B)

networking does appear to continue working on amd64, but not on arm64.

Changed in juju-core:
milestone: 1.25.1 → 1.26-alpha1
Revision history for this message
Andrew McDermott (frobware) wrote :
Changed in juju-core:
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.26-alpha1 → 1.26-alpha2
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.