After unlock controller-0 is not possible to log in

Bug #1829941 reported by Maria Guadalupe Perez Ibara
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
yong hu

Bug Description

Brief Description
-----------------

Is not possible to get a shell prompt after login via ssh or serial over lan.

Severity
--------
Major

Steps to Reproduce
------------------
After running sanity test and completed successfully, then try to login to the system through ssh or serial.

Expected Behavior
------------------
The system should show the shell prompt.

Actual Behavior
----------------
The system gets stuck in the following screen without showing the prompt:

WARNING: Unauthorized access to this system is forbidden and will be
prosecuted by law. By accessing this system, you agree that your
actions may be monitored if unauthorized usage is suspected.

Reproducibility
---------------
100% in standard, standard with external storage and duplex.

System Configuration
--------------------
Duplex, standard and external storage.

Branch/Pull Time/Commit
-----------------------
OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190521T132734Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="111"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-21 13:27:34 +0000"

Last Pass
---------
2019-05-20

Timestamp/Logs
--------------
is not possible to get logs as the system is unreachable.

Test Activity
-------------
[Sanity]

Ghada Khalil (gkhalil)
tags: added: stx.sanity
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As discussed in the stx community call, there isn't enough information to investigate this issue. Christopher Lemus took the action to update. Marking as Incomplete.

Changed in starlingx:
status: New → Incomplete
assignee: nobody → Christopher Lemus (j3t3r-23)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Any update on this issue?

Revision history for this message
Erich Cordoba (ericho) wrote :
Download full text (6.7 KiB)

We hit another instance of this bug in a standard configuration after the an unlock was attempted.

With "ssh wrsroot@192.168.200.82" is not possible to get the prompt, however after doing a "ssh wrsroot@192.168.200.82 -t bash" works.

Once logged in the controller-0 some kernel oops where discovered. (see complete kern.log in the attached controller-0.tar.gz)

[11854.061985] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11854.069802] jbd2/rbd0-8 D ffff895e278b0ba0 0 915137 2 0x00000000
[11854.069806] Call Trace:
[11854.069814] [<ffffffffb5ad5ca1>] ? cpuacct_charge+0x61/0x70
[11854.069834] [<ffffffffb620fb30>] ? bit_wait+0x50/0x50
[11854.069837] [<ffffffffb6212709>] schedule+0x29/0x70
[11854.069841] [<ffffffffb620f4f9>] schedule_timeout+0x289/0x330
[11854.069844] [<ffffffffb5ae5cc2>] ? ktime_get_ts64+0x52/0xf0
[11854.069846] [<ffffffffb620fb30>] ? bit_wait+0x50/0x50
[11854.069848] [<ffffffffb621121d>] io_schedule_timeout+0xad/0x130
[11854.069850] [<ffffffffb62112b8>] io_schedule+0x18/0x20
[11854.069851] [<ffffffffb620fb41>] bit_wait_io+0x11/0x50
[11854.069852] [<ffffffffb620f6f1>] __wait_on_bit_lock+0x61/0xc0
[11854.069855] [<ffffffffb5b83c14>] __lock_page+0x74/0x90
[11854.069859] [<ffffffffb5aa7060>] ? wake_bit_function+0x40/0x40 ...

Read more...

Revision history for this message
Erich Cordoba (ericho) wrote :

collect --all failed with a timeout error so I tar -czvf /var/log

Revision history for this message
Erich Cordoba (ericho) wrote :

Changing to Confirmed as the issue was saw again.

Changed in starlingx:
status: Incomplete → Confirmed
Revision history for this message
Erich Cordoba (ericho) wrote :

The kernel oops looks similar to https://bugs.launchpad.net/starlingx/+bug/1814595 but the symptoms are different.

Revision history for this message
Erich Cordoba (ericho) wrote :
summary: - Cannot get prompt after login in ssh or serial
+ After unlock controller-0 is not possible to log in
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

The issue is present with build:20190602T233000Z

The workflow to reproduce this issue is:

1. Swact to controller-1
2. lock controller-0
3. unlock controller-0
   This reboots the node
4. Try to login.

This is what we see when we try to login:

```
ssh 192.168.200.82
Warning: Permanently added '192.168.200.82' (ECDSA) to the list of known hosts.
Release 19.01
------------------------------------------------------------------------
W A R N I N G *** W A R N I N G *** W A R N I N G *** W A R N I N G ***
------------------------------------------------------------------------
THIS IS A PRIVATE COMPUTER SYSTEM.
This computer system including all related equipment, network devices
(specifically including Internet access), are provided only for authorized use.
All computer systems may be monitored for all lawful purposes, including to
ensure that their use is authorized, for management of the system, to
facilitate protection against unauthorized access, and to verify security
procedures, survivability and operational security. Monitoring includes active
attacks by authorized personnel and their entities to test or verify the
security of the system. During monitoring, information may be examined,
recorded, copied and used for authorized purposes. All information including
personal information, placed on or sent over this system may be monitored. Uses
of this system, authorized or unauthorized, constitutes consent to monitoring
of this system. Unauthorized use may subject you to criminal prosecution.
Evidence of any such unauthorized use collected during monitoring may be used
for administrative, criminal or other adverse action. Use of this system
constitutes consent to monitoring for these purposes.

wrsroot@192.168.200.82's password:
Last login: Tue Jun 4 10:57:40 2019 from 192.168.200.3
/etc/motd.d/00-header:

WARNING: Unauthorized access to this system is forbidden and will be
prosecuted by law. By accessing this system, you agree that your
actions may be monitored if unauthorized usage is suspected.

```

As you can see above, we are able to open the SSH connection, and also, we are able to put the password, but, that's it. There is no shell, nor error message.

As Erich pointed out, with `ssh 192.168.200.82 -t bash` we can login into the system, but mysql has a huge CPU usage. I attached the output of top.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

This is suspected to be a storage or low-level issue; assigning to the stx.distro.other team.

Marking as release gating as the issue is consistently reported from sanity.

tags: added: stx.2.0 stx.distro.other
Changed in starlingx:
importance: Undecided → High
assignee: Christopher Lemus (j3t3r-23) → Cindy Xie (xxie1)
Revision history for this message
Cindy Xie (xxie1) wrote :

how often we can reproduce it? Do you have a setup which we can access from Shanghai?

>> Reproducibility
---------------
100% in standard, standard with external storage and duplex.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

I send via email the details of a configuration with this behavior. Is a standard configuration (2+2) with ISO: 20190604T144018Z.

I'm not sure if the prompt will work after some hours, but it has been like that for ~1 hour and it still doesn't return a prompt. Hopefully it'll help to track down the issue.

Revision history for this message
yong hu (yhu6) wrote :

right now I can access the shell, see the attachment.
Maybe it's been many hours after that issue happened.

BTW: what's the password for this setup?
I might login for a few experiments.

Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → yong hu (yhu6)
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Yong,

The prompt wasn't available for, at least, 12 hours. At some point in time between Friday and Sunday when you checked, it came back to normal behavior.

Today we spend the whole day troubleshooting and documenting different bugs that prevented this one for being reproduced (such bugs are during the setup of starlingx).

I'll be sure to reproduce this one and leave a configuration running for you to login (I'll email you the login details).

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Yong,

I tried to reproduce this bug with 20190611T000451Z. It was not possible, and I tried 3 times on Standard.

Does it worth it going back to a previous image where this was failing? I could do that and leave the config running for you to check.

Or can we close this and re-open if it happens in the future?

Revision history for this message
Cindy Xie (xxie1) wrote :

Hi, Cristopher, please keep it open and run the same testing for at least 3 builds. If this is still cannot repo, we can mark it as "fix released" by some unknown patch.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi All,

Just to keep this one properly updated. I verified on the logs, and this issue was not presented on the 3 green sanities that we had last week, 06/11, 06/12 and 06/13.

The actual Sanity is blocked since 06/14 (different failures during setup), we have not had a chance to re-test this scenario. As soon as we re-test this part, I'll update the results.

Cindy Xie (xxie1)
Changed in starlingx:
importance: High → Medium
Revision history for this message
yong hu (yhu6) wrote :

@Cristopher, next time if we have a chance to reproduce this issue, we might capture more log by setting "LogLevel DEBUG3" in "/etc/ssh/sshd_config" (note: we need to restart sshd service by # sudo systemctl restart sshd) and login again.
no matter the issue comes out or not, we can see more logs from "/var/log/auth.log" and "/var/log/daemon.log". And we will debug from these logs.

BTW: why did "ssh -t bash" work? "-t" means Force pseudo-terminal allocation and it forces tty allocation for "bash", and it is not using the default tty which was created by sshd.

Revision history for this message
yong hu (yhu6) wrote :

whenever login by ssh, sshd daemon will create new processes hierarchically.

NOTE: the following cmd must be executed right after ssh login, so within 30 seconds.
==============================================================================
controller-0:~$ all_pids=`tail -n 30 /var/log/auth.log | grep sshd | cut -d " " -f 3 | sort -u | cut -d "[" -f 2 | cut -d "]" -f 1`; date; for p in $all_pids; do ps -wo pid,lstart,cmd -p $p; done; pstree -aps $p ; ps ax -wo pid,lstart,cmd | grep sshd | grep -v grep
2019年 06月 20日 星期四 16:05:47 CST
    PID STARTED CMD
1707579 Thu Jun 20 16:05:32 2019 sshd: wrsroot [priv]
    PID STARTED CMD
1707622 Thu Jun 20 16:05:35 2019 sshd: wrsroot@pts/0
    PID STARTED CMD
1707635 Thu Jun 20 16:05:35 2019 -sh
systemd,1 --switched-root --system --deserialize 21
  └─sshd,1637002
      └─sshd,1707579
          └─sshd,1707622
              └─sh,1707635
                  ├─pstree,1707789 -aps 1707635
                  └─sh,1707659
1637002 Thu Jun 20 14:47:47 2019 /usr/sbin/sshd
1707579 Thu Jun 20 16:05:32 2019 sshd: wrsroot [priv]
1707622 Thu Jun 20 16:05:35 2019 sshd: wrsroot@pts/0

==============================================================================

From the output above, we can see 3 new processes were created by sshd daemon every time when the user does the login, in order 1637002 ->> 1707579 -> 1707622 -> 1707659

If Linux system in Controller is too busy, possibly the process generation chain would be halt or blocked and consequently user cannot see the shell prompt (the last process generated in this chain).

Therefore, we need MORE log from sshd to debug if something goes wrong. please refer the method to enable more logs in sshd from the previous comment I made.

Changed in starlingx:
status: Confirmed → Incomplete
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

This bug was not reproduced with Green ISOS from 06/20 and 06/21. I think that is safe to close this one.

Yong, thanks for the information, if by any chance this re-appears (not likely), I'll capture as much information as possible to debug it. Thanks.

Cindy Xie (xxie1)
Changed in starlingx:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers