ssh-keyscan(1) exits prematurely on some non-fatal errors

Bug #483928 reported by Daniel Richard G.
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
portable OpenSSH
Confirmed
High
openssh (Ubuntu)
Confirmed
Wishlist
Unassigned

Bug Description

Binary package hint: openssh-client

This concerns openssh-client 1:5.1p1-5ubuntu1 in Karmic.

I am using ssh-keyscan(1) for its intended purpose: building an ssh_known_hosts file for a large network. Most of the hosts on this network are well-maintained systems, with properly-functioning SSH servers, and present no difficulty to the program.

However, a handful of hosts are barely alive, with SSH servers that are not exactly in good working order. ssh-keyscan(1) currently will scan these systems, encounter some form of error, and then---right here is the problem---exit in the middle of the scan. The last bit of stderr output may look like

 # A.B.C.D SSH-2.0-OpenSSH_4.3
 # A.B.C.E SSH-2.0-OpenSSH_4.3
 # A.B.C.F SSH-1.99-OpenSSH_3.7p1
 Connection closed by A.B.C.F

or

 # A.B.C.D SSH-2.0-OpenSSH_4.1
 # A.B.C.E SSH-2.0-OpenSSH_4.1
 # A.B.C.F SSH-2.0-mpSSH_0.1.0
 Received disconnect from A.B.C.F: 10: Protocol error

or

 # A.B.C.D SSH-2.0-OpenSSH_4.4p1
 # A.B.C.E SSH-2.0-OpenSSH_5.0p1
 # A.B.C.F SSH-2.0-mpSSH_0.1.0
 Received disconnect from A.B.C.F: 11: SSH Disabled

(These are the different failure modes I've observed to date)

ssh-keyscan(1) needs to be robust to these kinds of errors---simply make a note of them, and continue on with the scan. I don't want to have to find out which systems are misbehaving by running and re-running the scan (each run yields at most one bad host, obviously), nor manually edit out the few bad apples from the input list of hosts (especially considering that this particular subset can change over time). Neither is feasible when the number of hosts being scanned is very large.

Revision history for this message
In , Tryponraj (tryponraj) wrote :
Download full text (3.8 KiB)

Hello All,

Im using OpenSSH 4.3p2 and tyring to scan a list of 40 machines in my
network with ssh-keyscan utility. I used the following command,

ssh-keyscan -t rsa -f hosts.txt

The man page says that this utility displays the host keys rrespective of ssh or host is up/down and its working great. But in case if the scan stops at 30th host due to some protocol problems, the utility exits and don't display the host keys for remaining machines. I think this is an expected behaviour, but it would be better to ignore that host continue till the end or atleast this can be documented specifically in the man page.

I digged up this problem further and find my results below.

ssh-keyscan ignores the hosts if they are not up or sshd is not running
when used with -f <file> option. But when it encounters any error while
retrieving the host key from the machine which is up and have sshd running,it simply exits. This may happen due to transport layer implementation in packet.c at packet_read_poll_seqnr() which results in exiting.

My guess is that as packet.c is utilised by all OpenSSH utilities
including ssh-keyscan, we can't make ssh-keyscan to continue with
remaining hosts as specified in -f <files> in case of an error. But I also vote for atleast documenting this one.

Detailed debug traces are given below:
--------------------------------------
# ssh-keyscan -vvv -t rsa host.server.com
debug2: fd 3 setting O_NONBLOCK
debug1: no match: mpSSH_0.1.0
# host.server.com SSH-2.0-mpSSH_0.1.0
debug1: Enabling compatibility mode for protocol 2.0
debug3: RNG is ready, skipping seeding
debug1: SSH2_MSG_KEXINIT sent
Received disconnect from 16.245.97.226: 11: SSH Disabled

# ssh -vvv host.server.com
OpenSSH_4.3p2-hpn, OpenSSL 0.9.7i 14 Oct 2005
HP-UX Secure Shell-A.04.30.005, HP-UX Secure Shell version
debug1: Reading configuration data /opt/ssh/etc/ssh_config
debug3: RNG is ready, skipping seeding
debug2: ssh_connect: needpriv 0
debug1: Connecting to host.server.com [16.245.97.226] port 22.
debug1: Connection established.
debug1: permanently_set_uid: 0/3
debug1: identity file /.ssh/identity type 0
debug3: Not a RSA1 key file /.ssh/id_rsa.
debug2: key_type_from_name: unknown key type '-----BEGIN'
debug3: key_read: missing keytype
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug2: key_type_from_name: unknown key type '-----END'
debug3: key_read: missing keytype
debug1: identity file /.ssh/id_rsa type 1
debug3: Not a RSA1 key file /.ssh/id_dsa.
debug2: key_type_from_name: unknown key type '-----BEGIN'
debug3: key_read: missing keytype
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_rea...

Read more...

Revision history for this message
In , Paul Wouters (paul-cypherpunks) wrote :

I was going to open a new bug report, but I think I am reporting the same bug as this one.

ssh-keyscan aborts when it encounters glue without the proper authoritative data. eg:

hostname.domain.com IN NS hostname.domain.com
hostname.domain.com IN A 1.2.3.4

Where hostname.domain.com is itself not running a namserver.
It is correct in not processing this entry, as the glue is non-authoritative data, and cannot be confirmed by the nameserver ot the child zone.
However, ssh-keyscan should just skip this entry, not abort.

I noticed this when writing ftp://ftp.xelerance.com/sshfp/ which is a python script that can use ssh-keyscan (or known_hosts files) to generate SSHFP records.

Revision history for this message
In , Senthilkumar-sen (senthilkumar-sen) wrote :

Is there any chance that this bug will get fixed for the next release?

Chuck Short (zulcss)
Changed in openssh (Ubuntu):
importance: Undecided → Wishlist
status: New → Confirmed
Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Created attachment 1961
One attempt at getting the rsa key from a remote server that was having a number of problems.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

I believe I've encountered the same or similar ssh-keyscan problem.
local ssh - OpenSSH_5.1p1 Debian-5, OpenSSL 0.9.8g 19 Oct 2007
remote ssh - OpenSSH_4.3p2, OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008
The remote server was having "problems": 1) no connection; 2) connection and key returned; or 3) connection but hanging until remote time out and
disconnect. With the latter, ssh-keyscan aborted immediately with exit-code=255 (see attachment).

I disagree with the original poster in that I think that ssh-keyscan should continue in all cases except for an internal error. In our case, ssh-keyscan is buried several layers deep in wrapper scripts where it is being fed (today) 3690+ host names. Per the man pages, I was expecting it to continue regardless of what the remote servers did or didn't do.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Created attachment 1969
Fix(?) for premature ssh-keyscan abort.

This adds a local/static `cleanup_exit()' function to ssh-keyscan so that aborts in non-ssh-keyscan code can be converted to "continue"s while the `dispatch_run()' function is being executed. It mimics the already extant local/static `fatal()' function in using `exit()' instead of the `_exit()' used in the default cleanup.c.

Two observations:
1) I also incremented the `howmany()' argument #1 count by 1. This is probably unnecessary but I note that all other occasions where `howmany()' is used do this (and I'm chicken ...).
2) The current local/static `fatal()' function could possibly be removed and the default one, defined in fatal.c, be used.

Revision history for this message
In , Count-mindrot (count-mindrot) wrote :

I'm running into the same problem on recent versions.

Revision history for this message
In , Count-mindrot (count-mindrot) wrote :

btw: I've elevated this to 'major', as it completely breaks the usefulness for ssh-keyscan in large networks, as the error condition (len == 0 in packet_read_seqnr() in packet.c; resulting in logit("Connection closed ... etc") and cleanup_exit(255);) is much easier to hit. On 10 runs of ssh-keyscan over ~3800 IPs I couldn't get a single complete run without hitting this. Please fix.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Mr. Kotes, I have a patch against openssh-5.[678]p1 for our problem that could be called a workaround or a fix depending on your way of looking at it. The probable reason that `packet_read_seqnr()' gets the len==0 is that one of the IPs from which your attempting to get a key has a bad `sshd' server that times out because of the "LoginGraceTime". This, in turn, causes almost all of the other servers that have open sockets at that time to "LoginGraceTime" out as well. To back up a bit, `packet_read_seqnr()' calls the vanilla `cleanup_exit()' that in the current ssh-keyscan aborts immediately rather than continuing like ssh-keyscan's `fatal()' call does. This is part 1 of the fix. The second part is to teach ssh-keyscan how to deal with the problem when a bad server times out. My patch does both although the code seems a bit kludgy to me.

Unfortunately, we haven't had a bad server recently so I can't completely test the patch (I'm using it in test mode now) and, until then, I don't want to send it to the OpenSSH folks. FWIW - our host farm is 3500+ with an additional 1200+ to be online soon and probably more in the late summer.

In my opioion, this should be marked as a bug against the current openssh variant. How do I go about doing that?

If you'd like to have a copy of the current patch so you can test it, please tell me where to send it.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

I've noted that this is a ssh-keyscan bug and I've attached it to the openssh-5.8p1 release.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Oops, can't read. ssh-keygen ain't ssh-keyscan. Changed the component back to Miscellaneous. Hey, isn't ssh-keyscan a component also?

Revision history for this message
Daniel Richard G. (skunk) wrote :

I'm still seeing this with openssh-client 1:5.5p1-4ubuntu5. From a makefile that invokes "ssh-keyscan -v":

[...]
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: match: OpenSSH_3.6.1p2 pat OpenSSH_3.*
# A.B.C.D SSH-1.99-OpenSSH_3.6.1p2
debug1: Enabling compatibility mode for protocol 2.0
debug1: SSH2_MSG_KEXINIT sent
Connection closed by A.B.C.D
make: *** [ssh_known_hosts.new] Error 255

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

I reported this a while ago on the Ubuntu Launchpad bug tracker:

    https://bugs.launchpad.net/openssh/+bug/483928

I've also confirmed that the bug persists in OpenSSH 5.8p1, and I gave your patch a try to scan a corporate network of 6000+ hosts.

Most of the hosts don't appear to be running SSH, but I can't be sure if that's really the case, or if ssh-keyscan(1) is bugging out on many of the connections. It does run through to the end of the list, but with some anomalies, like "Connection closed by A.B.C.D" or "Received disconnect from A.B.C.D: 2: Client Disconnect" messages that crop up multiple times for the same IP address.

Is it possible that one bad connection can still take down active good connections, even with this patch?

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Ummm. If you're referring to the "original" patch that I submitted, It's out-of-date. It was written before I had a complete(?) handle on what was going wrong. Included with this comment is an attachment with the newer patch against the openssh-5.8p1 source.

A bit of explanation. Some of the mods are for clarity. When your working, as we are, with a large number of hosts, "socket" doesn't tell you very much as to where the problem is occuring. Same with "Bad hostkey alg".

In the patch, I've attempted to allow `ssh-keyscan' to continue if the encountered problem is external in origin. Some of the items that you noticed are (I think) addressed by this patch.

NOTE - NOTE - NOTE - this patch has NOT been completely verified. The closed by remote because of LoginGraceTime" outs needs a bad remote server so that that can be done. Unfortunately, all of our servers are playing nice-nice at present. I did have an earlier buggy variant of the patch that "tried" to execute the patch code but I screwed up and generated an infinite loop instead. The basic code is running as the `ssh-keyscan' of choice in our setup.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Created attachment 2000
openssh-5.8p1 - patch for ssh-keyscan

Is this comment different from the other one????

Later (better?) patch to fix `ssh-keyscan's premature aborting observed in large network scans. Hopefully, there are sufficient comments in the code to describe the fix. Please ask if you find something annoying. I also have patches for 5.6p1 and 5.7p1.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

With this updated patch, I'm seeing at least twice as many host keys returned than before (up to ~2400, from ~1000), and the "multiple errors from the same IP" oddness is gone now.

The more-specific error messages are very helpful. I do notice that hosts which are firewalled or otherwise fail to yield a server banner are not cited with an error message to stderr. I think this would be useful if it can be done, that every host listed in the input is spoken for one way or the other in the output, because that way you can be sure that no host is being silently dropped by the program.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Created attachment 2005
Upgraded(?) patch to include extra ssh-keyscan logging.

Try this to log all attempt failures. I put it under control of a command line option, '-L'. One failure noted by ssh-keyscan is the ECONNREFUSED that I think should have caused a standard error message to be elided. Except for the ECONNREFUSED, all of the new messages are written by the `logit()' function. FWIW - this patch may or may not obsolete the patch supplied with attachment 2000 so I didn't check the obsolete:2000 box. I didn't test this patch out very thoroughly but what testing I did showed what I wanted.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

aab, thanks for putting together this updated patch. I gave it a try, and whether due to the patch or another issue that I hadn't encountered before, it bombed out with this error:

[...]
# A.B.C.D SSH-2.0-dropbear_0.50
# W.X.Y.Z SSH-1.99-OpenSSH_3.9p1
# A.B.C.E SSH-2.0-dropbear_0.50
Connection closed by A.B.C.E
conalloc: attempt to reuse fdno 47
make: *** [ssh_known_hosts.unx.new] Error 255

A couple of ancillary notes on the patch:

1. The old and new filenames both have the .orig extension! I had to edit one of each pair so that the patch could apply.

2. IMO, there isn't a need to add a new -L option... are "Connection closed" and e.g. "no 'blah' hostkey alg(s)" really categorically distinct to the end user?

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

># A.B.C.D SSH-2.0-dropbear_0.50
># W.X.Y.Z SSH-1.99-OpenSSH_3.9p1
># A.B.C.E SSH-2.0-dropbear_0.50
>Connection closed by A.B.C.E
>conalloc: attempt to reuse fdno 47
>make: *** [ssh_known_hosts.unx.new] Error 255

Oh boy, I missed something. Is this repeatable? I think I saw this myself somewhere along the line but I thought I had fixed the problem. Since my time is pretty much taken up for the next week or so, I don't know when I'll be able to check.

>1. The old and new filenames both have the .orig extension! I had to
>edit one of each pair so that the patch could apply.

I just looked at the attachment. There are two ".orig"s per file. One is on the `diff' statement and is ignored (I hope) by `patch'. The second is one line down on the "old" file identifier (---) and `patch' does use that. Which one was your `patch' making complaints about?

>2. IMO, there isn't a need to add a new -L option... are "Connection
>closed" and e.g. "no 'blah' hostkey alg(s)" really categorically
>distinct to the end user?

STDERR is extremely noisy as it is. In my case, at this time, I think I'd add on the order of 7000+ extra lines when I use '-L' that I'd need to winnow to find any important data. Besides, you can't forget that god called "upward compatibility" you know (;-}).

And yes, if you meant "Connection timed out", I think that they are distinct at least from a Systems Administrator (me) point of view.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

(In reply to comment #17)
>
> Oh boy, I missed something. Is this repeatable? I think I saw this
> myself somewhere along the line but I thought I had fixed the problem.
> Since my time is pretty much taken up for the next week or so, I don't
> know when I'll be able to check.

Well, I tried it again, and it ran to completion. Must be a rare failure mode.

> I just looked at the attachment. There are two ".orig"s per file. One
> is on the `diff' statement and is ignored (I hope) by `patch'. The
> second is one line down on the "old" file identifier (---) and `patch'
> does use that. Which one was your `patch' making complaints about?

Presumably the second one. It was looking for e.g. kex.c.orig rather than kex.c.

> STDERR is extremely noisy as it is. In my case, at this time, I think
> I'd add on the order of 7000+ extra lines when I use '-L' that I'd need
> to winnow to find any important data. Besides, you can't forget that
> god called "upward compatibility" you know (;-}).
>
> And yes, if you meant "Connection timed out", I think that they are
> distinct at least from a Systems Administrator (me) point of view.

*shrugs* I'd pretty much expect a flood of information anyway. Given a large network, you have to use grep(1) or the like to make any sense of it.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Created attachment 2008
patch - fixes bug in previous patch

>> Oh boy, I missed something. Is this repeatable? I think I saw this
>> myself somewhere along the line but I thought I had fixed the problem.
>> Since my time is pretty much taken up for the next week or so, I don't
>> know when I'll be able to check.
>
>Well, I tried it again, and it ran to completion. Must be a rare
>failure mode.

Yep, I missed something. The sockets associated with ALL connections processed by the `keygrab_ssh2()' function are closed twice. I missed the close in the `packet.c:packet_close()' function that's called at the bottom of the `keygrab_ssh2()' function. I had assumed (bad bad word) that the only close was in the `confree()' function. Work/not work is up to the gods and the relative connection timings I think.

>> I just looked at the attachment. There are two ".orig"s per file. One
>> is on the `diff' statement and is ignored (I hope) by `patch'. The
>> second is one line down on the "old" file identifier (---) and `patch'
>> does use that. Which one was your `patch' making complaints about?
>
>Presumably the second one. It was looking for e.g. kex.c.orig rather
>than kex.c.

The format of this patch is the same as before. If you are using the current GNU `patch', you should be able to `patch [-p0] < patch' in the "openssh-5.8p1" parent directory. If your in the "openssh-5.8p1" directory itself, you should be able to `patch -p1 <patch'.

>> STDERR is extremely noisy as it is. In my case, at this time, I think
>> I'd add on the order of 7000+ extra lines when I use '-L' that I'd need
>> to winnow to find any important data. Besides, you can't forget that
>> god called "upward compatibility" you know (;-}).
>>
>> And yes, if you meant "Connection timed out", I think that they are
>> distinct at least from a Systems Administrator (me) point of view.
>
>*shrugs* I'd pretty much expect a flood of information anyway. Given a
>large network, you have to use grep(1) or the like to make any sense of
>it.

I think that, if/when this patch is actually submitted to the OpenSSH folks, I'll let the mavins there decide whether or not to have a '-L' option.

To satisfy my curiosity, did you observe any missing hosts when you use the '-L' option (and it actually completes)?

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

(In reply to comment #19)
>
> Yep, I missed something. The sockets associated with ALL connections
> processed by the `keygrab_ssh2()' function are closed twice. I missed
> the close in the `packet.c:packet_close()' function that's called at
> the bottom of the `keygrab_ssh2()' function. I had assumed (bad bad
> word) that the only close was in the `confree()' function. Work/not
> work is up to the gods and the relative connection timings I think.

I tried the new patch, and no errors. I'll give it a few more runs to see if anything breaks again.

> The format of this patch is the same as before. If you are using the
> current GNU `patch', you should be able to `patch [-p0] < patch' in the
> "openssh-5.8p1" parent directory. If your in the "openssh-5.8p1"
> directory itself, you should be able to `patch -p1 <patch'.

Oh, I know about -p0 vs. -p1 and such. The problem is that the patch, as up currently, looks for foo.c.orig instead of foo.c. In other words,

    --- dir/foo.c.orig
    +++ dir/foo.c.orig (WRONG)

    --- dir/foo.c.orig
    +++ dir/foo.c (CORRECT)

> I think that, if/when this patch is actually submitted to the OpenSSH
> folks, I'll let the mavins there decide whether or not to have a '-L'
> option.

Fair enough, though I think there might be more value in just (unconditionally) printing a tally at the end of how many valid hosts were found, how many had no host algs, etc. (a bit like what "md5sum -c" does when it encounters errors).

> To satisfy my curiosity, did you observe any missing hosts when you use
> the '-L' option (and it actually completes)?

Ah, I forgot to report on this; my bad!

I do see a few hosts in the input list that are not mentioned anywhere in the stderr output. These appear to be strictly "alias" IP addresses, e.g. for an input line of

    10.0.0.1,10.0.0.2,10.0.0.3 host.example.com,10.0.0.1,10.0.0.2,...
             ^^^^^^^^ ^^^^^^^^
                   these

This is the correct behavior, I take it?

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

(In reply to comment #20)
> (In reply to comment #19)
>
>> The format of this patch is the same as before. If you are using the
>> current GNU `patch', you should be able to `patch [-p0] < patch' in the
>> "openssh-5.8p1" parent directory. If your in the "openssh-5.8p1"
>> directory itself, you should be able to `patch -p1 <patch'.
>
>Oh, I know about -p0 vs. -p1 and such. The problem is that the patch,
>as up currently, looks for foo.c.orig instead of foo.c. In other words,
>
> --- dir/foo.c.orig
> +++ dir/foo.c.orig (WRONG)
>
> --- dir/foo.c.orig
> +++ dir/foo.c (CORRECT)

Hmmm, but the patch doesn't have two consecutive lines with ".orig" as you describe above. From observation, the first three lines for each modified file are similar to

diff -u openssh-5.8p1/kex.c.orig openssh-5.8p1/kex.c
--- openssh-5.8p1/kex.c.orig 2010-09-24 08:11:14.000000000 -0400
+++ openssh-5.8p1/kex.c 2011-02-11 18:14:03.396688000 -0500

Are you using the GNU patch? The attached patch text works for me with no changes whatsoever. Or to ask it somewhat differently, does your `patch' process WRONG even though the text is actually CORRECT? Is it possible that your`patch' is not ignoring the "diff" line?

>> I think that, if/when this patch is actually submitted to the OpenSSH
>> folks, I'll let the mavins there decide whether or not to have a '-L'
>> option.
>
> Fair enough, though I think there might be more value in just
> (unconditionally) printing a tally at the end of how many valid hosts
> were found, how many had no host algs, etc. (a bit like what "md5sum
> -c" does when it encounters errors).

Actually, after I had sent the previous, I thought I should have added that the described approach is a cop out on my part (;-}).

>> To satisfy my curiosity, did you observe any missing hosts when you use
>> the '-L' option (and it actually completes)?
>
> Ah, I forgot to report on this; my bad!
>
> I do see a few hosts in the input list that are not mentioned anywhere
> in the stderr output. These appear to be strictly "alias" IP addresses,
> e.g. for an input line of
>
> 10.0.0.1,10.0.0.2,10.0.0.3 host.example.com,10.0.0.1,10.0.0.2,...
> ^^^^^^^^ ^^^^^^^^
> these
>
> This is the correct behavior, I take it?

I submit hosts, one per line, as the data to ssh-keyscan and am not familiar with the "alias" format. In fact, your comments clarified it somewhat for me. If you meant that "10.0.0.1" was seen in stderr and the others weren't, I believe that this is the "correct" behavior if ssh-keyscan had success with "10.0.0.1". I think the code tells me that it stops looking after the first IP/host with which it has success.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

(In reply to comment #21)
>
> Hmmm, but the patch doesn't have two consecutive lines with ".orig" as
> you describe above. From observation, the first three lines for each
> modified file are similar to
>
> diff -u openssh-5.8p1/kex.c.orig openssh-5.8p1/kex.c
> --- openssh-5.8p1/kex.c.orig 2010-09-24 08:11:14.000000000 -0400
> +++ openssh-5.8p1/kex.c 2011-02-11 18:14:03.396688000 -0500

Um. Are we looking at the same file? Here are the first three lines of your most recent patch (attachment 2008, in comment #19):

--- openssh-5.8p1/kex.c.orig 2010-09-24 08:11:14.000000000 -0400
+++ openssh-5.8p1/kex.c.orig 2011-02-11 18:14:03.396688000 -0500
@@ -49,6 +49,7 @@

> Are you using the GNU patch? The attached patch text works for me with
> no changes whatsoever. Or to ask it somewhat differently, does your
> `patch' process WRONG even though the text is actually CORRECT? Is it
> possible that your`patch' is not ignoring the "diff" line?

This is on an Ubuntu Linux system:

host:/tmp/openssh-5.8p1$ patch -p1 --dry-run <aab-2008.patch
patching file kex.c.orig
Hunk #1 FAILED at 49.
Hunk #2 FAILED at 367.
2 out of 2 hunks FAILED -- saving rejects to file kex.c.orig.rej
patching file packet.c.orig
Hunk #1 FAILED at 1025.
Hunk #2 FAILED at 1035.
Hunk #3 FAILED at 1100.
3 out of 3 hunks FAILED -- saving rejects to file packet.c.orig.rej
[...]

If I edit each "+++" line in the patch, it applies cleanly.

> I submit hosts, one per line, as the data to ssh-keyscan and am not
> familiar with the "alias" format. In fact, your comments clarified it
> somewhat for me. If you meant that "10.0.0.1" was seen in stderr and
> the others weren't, I believe that this is the "correct" behavior if
> ssh-keyscan had success with "10.0.0.1". I think the code tells me
> that it stops looking after the first IP/host with which it has
> success.

Okay, that seems reasonable. (Yes, I only saw 10.0.0.1 and not the other two.)

The sample "Input format" line in the ssh-keyscan man page has two IP addresses in the first column, though the semantics of this are left unexplained. My assumption is that it's meant for hosts with round-robined DNS names, where the SSH server at each address uses the same host keys. (Which would be consistent with what you describe.)

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

(In reply to comment #22)
> (In reply to comment #21)
>>
>> Hmmm, but the patch doesn't have two consecutive lines with ".orig" as
>> you describe above. From observation, the first three lines for each
>> modified file are similar to
>>
>> diff -u openssh-5.8p1/kex.c.orig openssh-5.8p1/kex.c
>> --- openssh-5.8p1/kex.c.orig 2010-09-24 08:11:14.000000000 -0400
>> +++ openssh-5.8p1/kex.c 2011-02-11 18:14:03.396688000 -0500
>
> Um. Are we looking at the same file? Here are the first three lines of
> your most recent patch (attachment 2008 [details], in comment #19):
>
> --- openssh-5.8p1/kex.c.orig 2010-09-24 08:11:14.000000000 -0400
> +++ openssh-5.8p1/kex.c.orig 2011-02-11 18:14:03.396688000 -0500
> @@ -49,6 +49,7 @@

Boy, I'm not sure that we are looking at the same file. I just did a

  wget -Ojunk https://bugzilla.mindrot.org/attachment.cgi?id=2008

and got my version. When I click on the attachment line near the top of the bug #1213 comments (this page - "patch - fixes bug ..."), I get my version. Clicking on the "details" button that you specified above, I get my version.

Have we encountered a bug in yet another utility? Browser problem?

I should have thanked you earlier for "testing" the patch so I'll do so now - THANKS.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

Okay, I think I see what's going on here.

When you click on the "attachment 2008" link, you're taken to a fancy side-by-side rendition of the diff. At the top, there are a series of links:

    View | Details | Raw Unified | Return to bug 1213 | Differences ...

I was clicking on "Raw Unified," and got the broken patch. "View" goes to the URL you gave (which yields the correct patch). Confusing, isn't it?

Anyway, I'm happy to test your patches, because that means I can get the company-wide ssh_known_hosts file I've been needing so much :-)

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

(In reply to comment #24)
> Okay, I think I see what's going on here.
>
> When you click on the "attachment 2008 [details]" link, you're taken to a fancy
> side-by-side rendition of the diff. At the top, there are a series of
> links:
>
> View | Details | Raw Unified | Return to bug 1213 | Differences ...
>
> I was clicking on "Raw Unified," and got the broken patch. "View" goes
> to the URL you gave (which yields the correct patch). Confusing, isn't
> it?

Yes, it is indeed confusing. I've never used the exact path you used to get to the patch so I missed seeing the "bad" representation of it.

One of the things that I've observed in generating the "ssh_known_hosts" file is that it can end up having a quite variable keyset as it depends on ALL of the hosts ALWAYS being up (don't we wish). It's probably overkill but we generate the "hosts" file once an hour via a set of wrapper scripts. Included within the scripts is a database that contains the current keys for all hosts that are currently supposed to be active (previously acquired via these same scripts). This allows us two capabilities: 1) if there is no key returned for some host, the database can supply the last one and 2) it allows us to see if there have been any changes in the keys that might signify a security break.

A second part is a condensation of the keys via globbing. This assumes that a number of the hosts have the same key. The cluster nodes on our private networks are basically all cloned so we do get considerable condensation. Right now, for 4700+ hosts, the "hosts" file has 334 entries.

The core script is a highly modified variant of the GNU licensed script, "make_ssh_known_hosts.pl", that was in "ssh-1.0.0" (circa 1998). Note that's "ssh" not "openssh". My original came from "ssh-1.2.26". For some reason, it disappeared when the OpenSSH folks took over. For Linux boxes, it's still dependent on my bind 9 hack of `nslookup' as I haven't had time to modify it to use the current GNU `host'.

Would you be interested in anything like this?

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

(In reply to comment #25)
>
> Yes, it is indeed confusing. I've never used the exact path you used
> to get to the patch so I missed seeing the "bad" representation of it.

Lord knows what the point of that link even is... I clicked on it only because "Raw" suggested that it would yield the "real" text/plain diff instead of a fancy HTML rendition.

> Would you be interested in anything like this?

I appreciate the offer, but a database would be overkill for my use case. I'm not in my company's IT department, and metamorphosing host keys on those 6000+ hosts are waaaay out of my purview. (I can't get too worked up over the security implications, either, since much worse than that is officially tolerated.)

If anything, the most I would do is put together a Perl script to merge an old and new known_hosts file, such that new entries override old ones, and old ones that don't have a newer replacement are kept.

Revision history for this message
In , Paul Wouters (paul-cypherpunks) wrote :

(In reply to comment #26)
> (In reply to comment #25)

> If anything, the most I would do is put together a Perl script to merge
> an old and new known_hosts file, such that new entries override old
> ones, and old ones that don't have a newer replacement are kept.

You really want to look at SSHFP DNS records protected by DNSSEC, and setting VerifyHostKeyDNS ask in your /etc/ssh/ssh_config

you can use the "sshfp" tool for that, which is exactly why I was interested in this bug. sshfp can AXFR a zone, and use ssh-keyscan to connect to all A records in the zone and print the SSHFP record to add in your zones.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

(In reply to comment #27)
>
> You really want to look at SSHFP DNS records protected by DNSSEC, and
> setting VerifyHostKeyDNS ask in your /etc/ssh/ssh_config

I would, if I were in my company's IT department :-)

(All I'm doing is generating an ssh_known_hosts file that is accessible to a handful of clients via a local fileserver. The network infrastructure beyond that is completely out of my hands.)

> you can use the "sshfp" tool for that, which is exactly why I was
> interested in this bug. sshfp can AXFR a zone, and use ssh-keyscan to
> connect to all A records in the zone and print the SSHFP record to add
> in your zones.

Hmm, that could be useful. While I couldn't do much with the SSHFP records, the AXFR->keyscan functionality would be useful. (Right now, I'm doing the AXFR via host(1), and using a Perl script to reformat that into a hosts list for ssh-keyscan(1).)

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Comment on attachment 1961
One attempt at getting the rsa key from a remote server that was having a number of problems.

This has been resolved with attachment 2008.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

One of our `sshd' servers finally gave me sufficient problems to test the last of the patched code and, as far as I can tell, it worked.

Is there anybody out there that has any issues with the current patch? If not, I wonder if I can catch the attention of any of the OpenSSH folks. I note that this problem has yet to be assigned to anyone.

Or is there another route that I should take for attention?

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Created attachment 2016
Remove a bit of confusion from previous patch.

I guess I'm the one that has an issue with the previous patch. The hostkey alg error message always references the "other end" of the socket. On the server the message reads as if the client was the one that didn't have the necessary hostkey algorithms. The updated patch has modified verbage for the server that attempts to distnguish the difference.

I have a general issue with this anyhow. Wouldn't it be possible to check the server algorithms BEFORE asking the server to return a key that it doesn't have. If I read the code correctly, the debug2:kex_parse_init messages indicate that the code extracts the list of algorithms that the server supports from the SSH2_MSG_KEXINIT response. Isn't that before the request? Right now both the server and the client issue the same abort message and that seems a waste of time (and log file space (;-})).

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Created attachment 2018
Add 'L' option to usage message

Another small issue. I forgot to add the new '-L' option to the usage message. Also modified some of the comments for clarity.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Created attachment 2021
Withdraw patch attachment #2018.

This missive just obsoletes(withdraws) the current variant of the patch. We just had a bad network glitch here and, because of it, ssh-keyscan called the `select()' function in the `packet_read_seqnr()' function with a NULL timeout value. Since the read wasn't going to receive any data because of the glitch ever, it occasionally did one of those hang forever thingys. The patch still works if your network doesn't glitch like ours did albeit very crudely.

It turns out that the original coders of ssh-keyscan missed(?) a call to the `packet_set_timeout()' function which in turn caused the above referenced NULL. I'm in the process of rewriting the patch to include a "set" call.

FWIW - bugzilla won't let me subit this withour a non-null file. The new attachment is a NL.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Created attachment 2057
Fix for previous patch variant.

For all those waiting breathlessly (ha) for a correction to the ssh-keyscan patch I submitted earlier, here it is. I apologize for not getting it here sooner.

This variant adds a call to the `packet_set_timeout()' function using the time value set or defaulted to on the command line by the '-T' option. The man page actually implies that this is the case but the code to implement it was never included. Part of the new code is a trap for the timeout condition and a resetting of the remaining active socket's timeout values to compensate for the time used waiting for the slow/braindead server that caused the timeout.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Forgot to change the release to 5.8p2.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Change component from "miscellaneous" to the new "ssh-keyscan".

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

Yet another failure mode...

[...]
# XXX.YYY.ZZ.8 SSH-2.0-Sun_SSH_1.1.3
# XXX.YYY.ZZ.9 SSH-2.0-OpenSSH_3.8.1p1
# XXX.YYY.ZZ.14 SSH-2.0-OpenSSH_4.3
# 10.10.1.35 SSH-2.0-RomSShell_4.62
Received disconnect from 10.10.1.35: 2: Protocol Timeout
make: *** [ssh_known_hosts.unx.new] Error 255

This is with 5.8p1 still. aab@, I'll have to give your latest patch a try.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

I haven't seen this one before. The text you included indicates that ssh-keyscan was processing a Protocol 2 key and it should be using the modified code to do it. Is there any way that you could send me a traceback when the failure occurs?

FWIW - I think the " 2: Protocol Timeout" part of the message comes from the remote "SSH-2.0-RomSShell_4.62" server because I couldn't find that text in the OpenSSH source. What is "RomSShell"?

Changed in openssh:
importance: Unknown → High
status: Unknown → Confirmed
Revision history for this message
In , Daniel Richard G. (skunk) wrote :

(In reply to comment #38)
> I haven't seen this one before. The text you included indicates that
> ssh-keyscan was processing a Protocol 2 key and it should be using the
> modified code to do it. Is there any way that you could send me a
> traceback when the failure occurs?

I'll do that, when I'm back in the office. I'll use your patch. (This was with the stock Ubuntu build; it was just a failure mode that hadn't been noted here before.)

> FWIW - I think the " 2: Protocol Timeout" part of the message comes
> from the remote "SSH-2.0-RomSShell_4.62" server because I couldn't find
> that text in the OpenSSH source. What is "RomSShell"?

It seems to be an OEM embedded implementation of SSH... this was probably a router or something.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

Okay, I tried Ubuntu's packaging of OpenSSH (version 1:5.8p1-7ubuntu1) with your patch, and it powered through everything. Here is a list of all the error messages I received:

A.B.C.D: Connection closed by remote host
Connection closed by A.B.C.D
Connection to A.B.C.D timed out while waiting to read
Received disconnect from A.B.C.D: 10: Protocol error
Received disconnect from A.B.C.D: 10: Protocol error
Received disconnect from A.B.C.D: 11: SSH Disabled
Received disconnect from A.B.C.D: 2: Client Disconnect
Received disconnect from A.B.C.D: 2: Protocol Timeout
connect (`A.B.C.D'): Network is unreachable
no 'ssh-rsa' hostkey alg(s) for A.B.C.D
read (A.B.C.D): Connection reset by peer
read (A.B.C.D): No route to host

(This is ssh-keyscan output with /^#.*$/ filtered out, all IPs zapped, and 'sort -u'd)

Now the question is, why hasn't this been checked in already! (Have you tried making some noise on the mailing list?)

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

(In reply to comment #40)
> Okay, I tried Ubuntu's packaging of OpenSSH (version 1:5.8p1-7ubuntu1)
> with your patch, and it powered through everything. Here is a list of
> all the error messages I received:
>
> A.B.C.D: Connection closed by remote host
> Connection closed by A.B.C.D
> Connection to A.B.C.D timed out while waiting to read
> Received disconnect from A.B.C.D: 10: Protocol error
> Received disconnect from A.B.C.D: 10: Protocol error
> Received disconnect from A.B.C.D: 11: SSH Disabled
> Received disconnect from A.B.C.D: 2: Client Disconnect
> Received disconnect from A.B.C.D: 2: Protocol Timeout
> connect (`A.B.C.D'): Network is unreachable
> no 'ssh-rsa' hostkey alg(s) for A.B.C.D
> read (A.B.C.D): Connection reset by peer
> read (A.B.C.D): No route to host
>
> (This is ssh-keyscan output with /^#.*$/ filtered out, all IPs zapped,
> and 'sort -u'd)

The number of ways that key access can be terminated keeps increasing, doesn't it?

FWIW - the message "A.B.C.D: Connection closed by remote host" has been changed to "read(A.B.C.D): Connection closed by remote host" to be more consistent with the other messages (as above) issued in the same code block.

> Now the question is, why hasn't this been checked in already! (Have you
> tried making some noise on the mailing list?)

My oops. I have had my focus redirected to other projects and, besides, I'm very lazy (;-}).

Dumb me, I thought at least a question or two would be forthcoming from the OpenSSH folks. Guess not. I saw the mailing list reference in the README and promptly forgot about it. I will send the patch there. I apologize for the slowness.

Question for you. The ssh-keyscan code currently limits the maximum number of used file descriptors to <256. The biggest problem that I've seen with that number is, if you ever have a very large number of down hosts (which we have had), the code uses the available fds and has to wait for a '-Tn' timeout on one of them to start another key access. I've made a local modification that changes that number to 512. The code seems smart enough so that, if the OS has smaller limits, nothing will break. Right now Debian defaults to 1024 fds max and (at least our) Redhat to 20480. So 512 is a modest increase. Would you have an opinion on this?

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

(In reply to comment #41)
>
> The number of ways that key access can be terminated keeps increasing,
> doesn't it?

I hope it won't be necessary to enumerate them all before this bug can be closed!

> My oops. I have had my focus redirected to other projects and,
> besides, I'm very lazy (;-}).
>
> Dumb me, I thought at least a question or two would be forthcoming from
> the OpenSSH folks. Guess not. I saw the mailing list reference in the
> README and promptly forgot about it. I will send the patch there. I
> apologize for the slowness.

Hey, it's your patch. All the fame and glory will go to you ;-)

> Question for you. The ssh-keyscan code currently limits the maximum
> number of used file descriptors to <256. The biggest problem that I've
> seen with that number is, if you ever have a very large number of down
> hosts (which we have had), the code uses the available fds and has to
> wait for a '-Tn' timeout on one of them to start another key access.
> I've made a local modification that changes that number to 512. The
> code seems smart enough so that, if the OS has smaller limits, nothing
> will break. Right now Debian defaults to 1024 fds max and (at least
> our) Redhat to 20480. So 512 is a modest increase. Would you have an
> opinion on this?

Debian has 1024 fds max per process, or across the entire system? (If a local DoS attack were really as easy as calling open() ~1000 times...)

If the limit is for the whole system, that would be a good reason to make this an option, or a recognized environment variable. If for a single process, then just call sysconf(_SC_OPEN_MAX) and go to town.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

And a year later, this issue still afflicts OpenSSH 6.1p1 (as packaged by Ubuntu). Aab's patch still applies, if fuzzily, and still hardens up ssh-keyscan so that it can deal with my company's network.

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Created attachment 2197
Besides comments, inludes patch for openssh-6.1p1

I knew I forgot to do something. I meant to CC you but obviously forgot. I apologize for the delay.

I finally got around to submitting the patch last week via direct email to <email address hidden>. Again I apologize for this particular delay.

I retired at the end of May (2012) after 38 years at Purdue University and the last six months there were a frenzy of "close down shop" activity. I still retain minimal access to some of the Purdue computers via what they call my "career account". This missive is being submitted from one of them.

I was lucky enough to get a basic clone of my workstation to take home but I didn't get it up and running until about a month ago. Since then I've been learning a lot of new stuff since my main access is now a Windows box (I know, I know). The patch submission was the first major thing I did with it. A copy of the patch for openssh-6.1p1 is attached.

Fuzzy??? Did I bug something up or is it because the patch you're using is somewhat dated?

-- Paul//aab

Revision history for this message
In , Paul Townsend (aabatpurdue) wrote :

Oops, forgot to change the version.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

I don't think anyone will fault you for having more momentous matters to attend to! As it is, I've gone without doing a network scan for that long anyway.

Thanks for formally submitting the patch; hopefully this issue will be put to rest soon. Best of luck with the transition to a retired life, and may you continue to make contributions of value to our community :)

(The old patch applied to 6.1p1 with fuzz, yet without rejections, only because it hadn't been updated in a while.)

Revision history for this message
Samuel (samuel-t) wrote :

8 years later, we saw this on one of our customer's systems. Looking at the status of this entry, it looks like it's still unresolved?

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

According to the upstream bug it was fixed in version 6.8, which Ubuntu release are you using? Ubuntu >= Xenial should be fixed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.