Ubuntu
linux package

STC860:Tuleta-L:KVM:iap03:HTX logs miscompares on multiple adapters on ubuntu 16.04 guest.

Bug #1632466 reported by bugproxy on 2016-10-11

This bug report is a duplicate of: Bug #1632462: [Trusty->Yakkety] powerpc/64: Fix incorrect return value from __copy_tofrom_user. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	New	Undecided	Taco Screen team

Bug Description

== Comment: #0 - Application Cdeadmin <email address hidden> - 2016-07-26 11:20:20 ==

== Comment: #1 - Application Cdeadmin <email address hidden> - 2016-07-26 11:20:22 ==
==== State: Open by: panico on 25 July 2016 18:02:09 ====

Contact Information:
====================
Defect Originator: Michael Panico
Defect Originator <email address hidden>

Backup: Deepti Umarani
Backup ST ID: Deepti S Umarani/India/IBM

System Info:
============
Machine Type:............8284-22A
Card Type:...............FSP2_P8LE
Current Boot Side:.......T
Next Boot Side:..........T
PT_Swap:.................1
Current Side Driver:.....fips860/b0713a_1630.860

Ubuntu KVM Host:
root@iaos1:/tmp# uname -a
Linux iaos1 4.4.0-30-generic #49-Ubuntu SMP Fri Jul 1 10:00:36 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
root@iaos1:/tmp# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu Yakkety Yak (development branch)
Release: 16.10
Codename: yakkety

Ubuntu Guest:
[htx@iap03] [1m/usr/lpp/htx/bin# [0m uname -a
Linux iap03 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:05:18 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
[htx@iap03] [1m/usr/lpp/htx/bin# [0m lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial

HTX level on ubuntu 16.04 guest:
[htx@iap03] [1m/usr/lpp/htx/bin# [0m dpkg -l | grep htxubuntu
ii htxubuntu 402 ppc64el

Problem Description:
====================
Running networks (via bridging) on three different guests (other linux, ubuntu 16.10 and ubuntu 16.04) the ubuntu 16.04 guest logged a miscompare on the austin adapter and the shinerT adapter. The shinerT ran for 24 hours and the austin for 72 hours before logging miscompares.

From /tmp/htxerr:
[htx@iap03] [1m/usr/lpp/htx/bin# [0m cat /tmp/htxerr

/dev/enp0s6 Jul 23 14:29:00 2016 err=00e30000 sev=4 hxecom
R(102.1.2.20.50854) connected to W(102.1.1.20.39073)
Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 51082
Miscompare at displacement (decimal) = 23876
wbuf = ffffffffffffffffffffffffffffffffffffffff
rbuf = 55555555555555555555555555555555ffffffff

For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good reads= 70739865, good bytes read= 2868570843930

/dev/enp0s6 Jul 23 14:29:00 2016 err=00d80000 sev=4 hxecom
Shutting down testing due to error flag - 1
R(102.1.2.20.50854) connected to W(102.1.1.20.39073)
For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good reads= 0, good bytes read= 0

/dev/enp0s7 Jul 23 14:29:00 2016 err=00fb0068 sev=1 hxecom
W(102.1.1.20.39073) connected to R(102.1.2.20.50854)
Wrote 878 characters, expected 963
Write failed - Connection reset by peer.
For stanza 1, TCP connection, bufmin=3, bufmax=2869, bufinc=94
Consecutive prior good writes= 25459432, good bytes written= 35283290616

htx_messages Jul 23 14:29:01 2016 err=00000000 sev=1 hxssup
hxecom HE program for enp0s6 terminated by exit(0) call.

htx_messages Jul 23 16:29:07 2016 err=00000000 sev=4 hang_monitor
hxecom for enp0s7 is HUNG!
Max run time (set in mdt) = 7200 seconds.
Current elasped time = 7206 seconds.

htx_messages Jul 23 18:29:07 2016 err=00000000 sev=4 hang_monitor
hxecom for enp0s7 is HUNG!
Max run time (set in mdt) = 7200 seconds.
Current elasped time = 14406 seconds.

htx_messages Jul 23 20:29:07 2016 err=00000000 sev=4 hang_monitor
hxecom for enp0s7 is HUNG!
Max run time (set in mdt) = 7200 seconds.
Current elasped time = 21606 seconds.

htx_messages Jul 23 22:29:08 2016 err=00000000 sev=4 hang_monitor
hxecom for enp0s7 is HUNG!
Max run time (set in mdt) = 7200 seconds.
Current elasped time = 28807 seconds.

htx_messages Jul 24 00:29:08 2016 err=00000000 sev=4 hang_monitor
hxecom for enp0s7 is HUNG!
Max run time (set in mdt) = 7200 seconds.
Current elasped time = 36007 seconds.

/dev/enp0s4 Jul 25 04:56:13 2016 err=00e30000 sev=4 hxecom
R(103.1.2.20.36665) connected to W(103.1.1.20.40459)
Miscompare with packet 0, data size = 30000. miscompare_count=0 pakLen = 30020
Miscompare at displacement (decimal) = 17508
wbuf = ffffffffffffffffffffffffffffffffffffffff
rbuf = e7f0e11400000000333039373600ffffffffffff

For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good reads= 42699991, good bytes read= 1731422119820

/dev/enp0s4 Jul 25 04:56:13 2016 err=00d80000 sev=4 hxecom
Shutting down testing due to error flag - 1
R(103.1.2.20.36665) connected to W(103.1.1.20.40459)
For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good reads= 0, good bytes read= 0

/dev/enp0s5 Jul 25 04:56:13 2016 err=00fb0068 sev=1 hxecom
W(103.1.1.20.40459) connected to R(103.1.2.20.36665)
Wrote 13877 characters, expected 40551
Write failed - Connection reset by peer.
For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good writes= 42700201, good bytes written= 1731430540751

htx_messages Jul 25 04:56:14 2016 err=00000000 sev=1 hxssup
hxecom HE program for enp0s4 terminated by exit(0) call.

/dev/enp0s5 Jul 25 04:56:13 2016 err=00d80000 sev=4 hxecom
Shutting down testing due to error flag - 10
W(103.1.1.20.40459) connected to R(103.1.2.20.36665)
For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good writes= 0, good bytes written= 0

The bridges in the guest xml:
    <interface type='bridge'>
      <mac address='52:54:00:ab:08:11'/>
      <source bridge='brenP3p5s0f0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:96:cb:70'/>
      <source bridge='brenP3p5s0f2'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:26:6f:ec'/>
      <source bridge='brenP3p5s0f3'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:a8:b0:09'/>
      <source bridge='brenP1p3s0f0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:a1:e6:4d'/>
      <source bridge='brenP1p3s0f1'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:b5:27:2a'/>
      <source bridge='brenp1s0f0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:be:da:6c'/>
      <source bridge='brenp1s0f1'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </interface>

There is also another defect opened (SW360179) on a glacier park logging a miscompare on another ubuntu 16.04 guest, but on a ubuntu 16.04 KVM host.
==== State: Open by: avkuttan on 26 July 2016 04:49:05 ====

Summary :
=========

device : /dev/enp0s6
Miscompare offset (decimal) : 23876
Transfer size (decimal) : 51082
Miscompare length (decimal) : 16

Reader with IP (102.1.2.20.50854) is connected to Writer with (102.1.1.20.39073) on enp0s6.
This miscompare shows 16 bytes of unexpected data in the read buffer.

Analysis details:
================

/tmp/htxerr:
------------
/dev/enp0s6 Jul 23 14:29:00 2016 err=00e30000 sev=4 hxecom <<=== device (/dev/enp0s6)
R(102.1.2.20.50854) connected to W(102.1.1.20.39073) <<=== TCP Connection details : Writer(102.1.1.20.39073) and Reader (102.1.2.20.50854)
Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 51082 <<=== Transfer size (51062 + 20 for header)
Miscompare at displacement (decimal) = 23876 <<=== Miscompare offset (23876)
wbuf = ffffffffffffffffffffffffffffffffffffffff
rbuf = 55555555555555555555555555555555ffffffff

For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good reads= 70739865, good bytes read= 2868570843930

Expected data (write buffer) at miscompare location:
----------------------------------------------------
00005c80: 89d3 bb05 0000 0000 3233 3638 3000 ffff ........23680...
00005c90: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7...
00005ca0: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294......
00005cb0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005cc0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005cd0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005ce0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005cf0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005d00: 89d3 bb05 0000 0000 3233 3830 3800 ffff ........23808...
00005d10: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7...
00005d20: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294......
00005d30: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005d40: ffff ffff ffff ffff ffff ffff ffff ffff ................ <<=== start of miscompare
00005d50: ffff ffff ffff ffff ffff ffff ffff ffff ................ <<=== end of miscompare
00005d60: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005d70: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005d80: 89d3 bb05 0000 0000 3233 3933 3600 ffff ........23936...
00005d90: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7...
00005da0: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294......
00005db0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005dc0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005dd0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005de0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005df0: ffff ffff ffff ffff ffff ffff ffff ffff ................

Actual data (read buffer) at miscompare location:
-------------------------------------------------
00005c80: 89d3 bb05 0000 0000 3233 3638 3000 ffff ........23680...
00005c90: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7...
00005ca0: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294......
00005cb0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005cc0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005cd0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005ce0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005cf0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005d00: 89d3 bb05 0000 0000 3233 3830 3800 ffff ........23808...
00005d10: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7...
00005d20: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294......
00005d30: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005d40: ffff ffff 5555 5555 5555 5555 5555 5555 ....UUUUUUUUUUUU <<=== start of miscompare
00005d50: 5555 5555 ffff ffff ffff ffff ffff ffff UUUU............ <<=== end of miscompare
00005d60: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005d70: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005d80: 89d3 bb05 0000 0000 3233 3933 3600 ffff ........23936...
00005d90: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7...
00005da0: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294......
00005db0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005dc0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005dd0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005de0: ffff ffff ffff ffff ffff ffff ffff ffff ................
00005df0: ffff ffff ffff ffff ffff ffff ffff ffff ................

Miscompare side wise:
====================
00005d30: ffff ffff ffff ffff ffff ffff ffff ffff .......... 00005d30: ffff ffff ffff ffff ffff ffff ffff ffff ..........
00005d40: ffff ffff ffff ffff ffff ffff ffff ffff .......... | 00005d40: ffff ffff 5555 5555 5555 5555 5555 5555 ....UUUUUU
00005d50: ffff ffff ffff ffff ffff ffff ffff ffff .......... | 00005d50: 5555 5555 ffff ffff ffff ffff ffff ffff UUUU......
00005d60: ffff ffff ffff ffff ffff ffff ffff ffff .......... 00005d60: ffff ffff ffff ffff ffff ffff ffff ffff ..........

Miscompare started at = 0x00005d45
Miscompare ended at = 0x00005d55
Miscompare length = x00005d55 - 0x00005d45 = 0x10 = 16 (decimal)

Buffer files are dumped into readable form as:
/tmp/htxenp0s6.wbuf_1 ==>> /tmp/wbuf_1_dump
/tmp/htxenp0s6.rbuf_1 ==>> /tmp/rbuf_1_dump

Transfer this defect to Linux team for further analysis.
==== State: Open by: kalok on 26 July 2016 06:32:03 ====

transferred to ppclinux_nonconfidential_BZilla using CMVC. updates not visible in BQ.
==== State: Open by: panico on 26 July 2016 10:11:29 ====

The same host logged another two miscompares on an ubuntu 16.10 guest (iap02.aus.stglabs.ibm.com (9.3.242.19)) on the austin adapter in slot C10 and the glacier park adapter in slot C7.

[htx@iap02] [1m/usr/lpp/htx/bin# [0m cat /tmp/htxerr

/dev/enp0s10 Jul 26 01:24:25 2016 err=00e30000 sev=4 hxecom
R(101.1.2.19.33643) connected to W(101.1.1.19.40555)
Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 51082
Miscompare at displacement (decimal) = 35954
wbuf = aaaaaaaaaaaaaaaaaaaaaaaaaaaa6b3af7020000
rbuf = fffffffffffffffffffffffffffffffff7020000

For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good reads= 32932090, good bytes read= 1335345881380

/dev/enp0s10 Jul 26 01:24:25 2016 err=00d80000 sev=4 hxecom
Shutting down testing due to error flag - 1
R(101.1.2.19.33643) connected to W(101.1.1.19.40555)
For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good reads= 0, good bytes read= 0

htx_messages Jul 26 01:24:26 2016 err=00000000 sev=1 hxssup
hxecom HE program for enp0s10 terminated by exit(0) call.

/dev/enp0s11 Jul 26 01:24:26 2016 err=00fb0068 sev=1 hxecom
W(101.1.1.19.40555) connected to R(101.1.2.19.33643)
Wrote 1980 characters, expected 51082
Write failed - Connection reset by peer.
For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good writes= 32932264, good bytes written= 1335354769648

/dev/enp0s11 Jul 26 01:24:26 2016 err=00d80000 sev=4 hxecom
Shutting down testing due to error flag - 10
W(101.1.1.19.40555) connected to R(101.1.2.19.33643)
For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good writes= 0, good bytes written= 0

/dev/enp0s11 Jul 26 01:24:26 2016 err=002a0068 sev=4 hxecom
SetUpConnect: Error Connecting to (9.3.242.19,41525) - Connection reset by peer.

/dev/enp0s11 Jul 26 01:24:39 2016 err=002a006f sev=4 hxecom
SetUpConnect: Error Connecting to (9.3.242.19,41525) - Connection refused.

/dev/enp0s4 Jul 26 01:25:14 2016 err=00e30000 sev=4 hxecom
R(103.1.2.19.43035) connected to W(103.1.1.19.53023)
Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 51082
Miscompare at displacement (decimal) = 16166
wbuf = 3433313400aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
rbuf = 0200000000323232373200aaaa01002faaaaaaaa

For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good reads= 12475849, good bytes read= 505864438618

/dev/enp0s4 Jul 26 01:25:14 2016 err=00d80000 sev=4 hxecom
Shutting down testing due to error flag - 1
R(103.1.2.19.43035) connected to W(103.1.1.19.53023)
For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good reads= 0, good bytes read= 0

htx_messages Jul 26 01:25:14 2016 err=00000000 sev=1 hxssup
hxecom HE program for enp0s4 terminated by exit(0) call.

/dev/enp0s5 Jul 26 01:25:14 2016 err=00fb0068 sev=1 hxecom
W(103.1.1.19.53023) connected to R(103.1.2.19.43035)
Wrote 10254 characters, expected 51082
Write failed - Connection reset by peer.
For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good writes= 12475989, good bytes written= 505871590098

/dev/enp0s5 Jul 26 01:25:14 2016 err=00d80000 sev=4 hxecom
Shutting down testing due to error flag - 10
W(103.1.1.19.53023) connected to R(103.1.2.19.43035)
For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
Consecutive prior good writes= 0, good bytes written= 0

/dev/enp0s5 Jul 26 01:25:14 2016 err=002a0068 sev=4 hxecom
SetUpConnect: Error Connecting to (9.3.242.19,50744) - Connection reset by peer.

/dev/enp0s5 Jul 26 01:25:26 2016 err=002a006f sev=4 hxecom
SetUpConnect: Error Connecting to (9.3.242.19,50744) - Connection refused.

== Comment: #84 - Dwip N. Banerjee <email address hidden> - 2016-10-10 11:43:26 ==
This is going to be a long post (my apologies). But it has, I believe, the
root cause analysis for this issue, based on what I have discovered till
now.

As has been mentioned this issue has been worked on, on a 24X7 including myself.

First I want to correct something I said in comment 61. I did the analysis as
mentioned there. But when looking at the data on the receiver - which had
captured the miscompare packet - I had misconstrued the position of the error.
I had mistakenly picked the one at the end - and failed to notice the actual
miscompare (the "XXXX...X" in a see of "YYYYYY"). That had mistakenly pushed
me towards considering an additional problem (of possible reordering of
packets as well). This caused a proliferation of possibilities to consider -
on which I spent quite a bit of time before I went back to drawing board again
to revalidate. I tracked the packets down through every concerned module and
realized that there was no reordering. I then went back to the old data and
discovered the missed sequences.

So that at least narrowed down the problem to one (a nasty one albeit) - a
case of data modification with a specific signature. the last 16 bytes of a
page were getting corrupted. This was visible when transmitting the packet.
The visibility/detection signature:
  -- always on the receive side
  -- always on the tun output and bridge on sender
  -- sometimes on the vhost transmit
  -- sometimes (rare) on the guest transmit.

But from the signature, there is no clue as to whether the corruption is on
  - receive side
  - send side
OR if it is on:
  - virtio/guest side
  - vhost/host side

While my hunch was that the problem was likely on the receive side (because
it was so close to the edge of a page, it almost seemed like someone had
mistakenly read past their page), it has been very hard to narrow down or to
validate. I had put in my own logs on the host and guest to capture the
descriptor flows in order to be able to validate the access of the addresses
and if the suspect address (the one NEXT to the corrupted one, in the next
page) was being accessed - especially be receivers. As it turned out sometimes
they were indeed accessed by receivers and also by transmitters. I had to put
timestamps to sort them out.

One had to consider the possibility of races in the virtio/vhost communication
-- memory barrier issues
-- race possibilities
-- cases where one side prematurely accessed descriptors before they were
advertised by the producer
-- descriptor corruption
-- other scenarios.

I did observe the following interesting points:
  - typically the virtio queues were running at close to full
     (especially for the slower Austin adapter)
  - the vhost receive queues for the MLX adapters were empty i.e. there were
     lots of skbs lying on the tun receive queues, waiting to be delivered to
     the guest.
The above does bring into the picture the possibility of possible barrier/race
causing premature exposure and leading to the corruption seen.

I did consider various scenarios in this realm, both for send and receives on
the guest/host but my logging didn't validate any of the scenarios. There is
no point in cataloging them now - suffice it to say that none of them panned
out.

Eventually, I decided to focus on the receive side since it appeared to be
the likely place for the data modification based on the signatures that were
seen. I instrumented the receive side especially looking for the signatures
we have observed. This was in addition to the logs I had encoded in the
host/guest.

This is when I started observing similar packet modification signatures as
has been observed for this problem. I dumped the packet before and after
looking for modification signatures like we are seeing.

IN EVERY CASE I SAW A DATA CORRUPTION in the transmit path, I saw the SAME
SIGNATURE WAS FLAGGED in the receive path on tun_recvmsg.

Comparing timestamps, the modification observed on the tun receive path
was always before (some 30/40ms to 100/200 ms before). Also the change
happened during the receive:
  -- before the receive when the packet was dumped it had original data
  -- after the receive, it was a corrupted packet
  -- and if I let it go through it will then hit the miscompare on the
     transmit side

The signature captured in this receive intercept was similar to the
miscompare... An example below:

---------------------------------------------
Previously (last two lines)
ffffffffffffffffffffffffffffffff <END PAGE - 16>
ffffffffffffffffffffffffffffffff <END PAGE>

current (last two lines)
ffffffffffffffffffffffffffffffff <END PAGE - 16>
cccccccccccccccccccccccccccccccc <END PAGE>
cccccccccccccccccccccccccccccccc <START new PAGE>
---------------------------------------------

When I let it run thru, this scenario will trigger a miscompare in
the transmit path like we have been seeing all along.

Thereafter, I put some instrumentation in the code to determine what
was causing this. After going through a few this is the general
sequence, which I am illustrating with an example below.

Sequence leading to the corruption
---------------------------------
A large packet is read in. Two iovecs assigned for this:

0. Buffers are posted via vhost (the sizes are important... esp
the first one ... here they are 0x7700 and 0x10000). In every
instance I observed, the first one had a buffer size of 7700.

c000001fb485dd88: 00003ffb5ad90000 0000000000007700 ...Z.?...w......
c000001fb485dd98: 00003fff40cd0000 0000000000010000 ...@.?..........

1.struct iovec {
iov_base = 0x3ffb5ad90000,
iov_len = 30464
}

2.<header> read from skb1
3.several GSO segmnents read in from the fragments of skb1
4.Once over, it reads the next skb2 from fraglist
5.In this skb, there is no linear data,
6.It reads the fragments from skb2.
7.After four complete fragments, the fifth one can not be
accommodated in this iovec. This is where it gets interesting.

8. The iov_offset is at 716e when it issues a read for 0x5a8
bytes (one MSS/1448 bytes). But it can fit only 0x592 bytes.

c000000001995a30: 00000000000005a8 000000000000716e
                                    ^^^iov_offset^^^
c000000001995a40: 0000000000000592 0000000000000000 ................
c000000001995a50: 00000000000005a8 c000001ec7937842 ........Bx......
                                    ^^^^from^^^^^^
c000000001995a60: 00003ffb5ad9716e 0000000000000592
                   ^^^to/userspc^^^ ^^^amount read^^

The iov has only 7700-716e=0x592 space left, so it can not fit
one full GSO segment.

Now from is advanced: c000001ec7937242+0x592=c000001ec79377d4
to/userspace_buf is advanced: 716e+592=7700 i.e. end of iov. So next
iovec is invoked.

struct iovec {
iov_base = 0x3fff40cd0000,
iov_len = 65536
}

Now we see the next read (into new iov)
c000000001995aa0: 00000000000005a8 0000000000007700 .........w......
                                    ^^^offset: first iov done
c000000001995ab0: 0000000000000016 0000000000000026 ........&.......
                   ^^amount to read ^^^left=returned by copy_to_user
c000000001995ac0: 00000000000005a8 c000001ec7937dd4 .........}......
                   ^^amount wanted ^^^from^^^^^^^^
c000000001995ad0: 00003fff40cd0000 fffffffffffffff0 ...@.?..........
                   ^^^to^^^^^^^^^ ^^^len=-10!!!!!^^^

So we wanted to read 0x16 bytes, but __copy_to_user_inatomic returned
0x26 bytes as the amount still remaining to be read!!!!

__copy_to_user_inatomic(00003fff40cd0000, c000001ec7937dd4, 0x16) = 0x26

THIS IS THE PROBLEM.

Now our iov_len becomes negative = 0x16-0x26=fffffffffffffff0

We then apply this negative offset and modify the 16 buytes before the
start of this page on the next read. Here is the next read:

c000000001995b50: fffffffffffffff0 0000000000000026 ........&.......
                   ^^^offset^^^^^^ ^^^ bytes to write
c000000001995b60: 0000000000000000 00000000000005a8 ................
                                    ^^^total wanted - from the initial
c000000001995b70: c000001ec7937dc4 00003fff40ccfff0 .}.........@.?..
                   ^^^from^^^^^^^^ ^^^^to- previous page
c000000001995b80: 0000000000000026 0000000000000016 &...............
                   ^^^iov_len=amount read
note that both from and to has regressed!!!
from=c000001ec7937dd4+fffffffffffffff0=c000001ec7937dc4
to=0x3fff40cd0000+fffffffffffffff0=00003fff40ccfff0
i.e. we are re-reading 16 bytes from from
and we are reading into the last 16 bytes of the previous page

our offset is -16(dec) (will read into i.e. overwrite the the previous 16
bytes

After this the reads normalize and behave the way they ought to.
But those 16 bytes were completely messed up (16 re-read and 16 read
into wrong place) - which were obviously wrong/unnecessary.

For the instances, I observed, the following were common:
  - read into first iov of size 0x7700
  - next iov starts on a new page
  - GSO segments read properly till truncation due to first iov
    running out of space. Last read=0x592. So 0x16 bytes remaining.
    At this point the "from" address seems to end in "d4" - not sure
    if that is just a coincidence.
  - next iov is start of page
  - __copy_to_user_inatomic(from, to, 0x16)=0x26 -- CULPRIT!!!
  - Now our len, offset will become negative -16(decimal) and we
    read into the last 16 bytes of the previous page
  - after this everything normalizes. Other than the 16 bytes of
    mayhem everything is fine!

I investigated the call chain. Here is the call chain in invoking
the copy_from_user routine...

Call Chain
==========
tun_put_user
  --> skb_copy_datagram_iter
        --> copy_to_iter
        --> copy_page_to_iter
            --> copy_page_to_iter_iovec
                 -->__copy_to_user_inatomic (using atomic kmaps)
                     --> __copy_tofrom_user

******************************************
__copy_to_user_inatomic pretty much passes everything to __copy_tofrom_user
So that is likely where the focus has to be.

The __copy_tofrom_user is in Linux/arch/powerpc/lib/copyuser_64.S
-----------------------------------------------------------------

_GLOBAL_TOC(__copy_tofrom_user)
BEGIN_FTR_SECTION
        nop
FTR_SECTION_ELSE
        b __copy_tofrom_user_power7
ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
_GLOBAL(__copy_tofrom_user_base)
        /* first check for a whole page copy on a page boundary */
        cmpldi cr1,r5,16
...
******************************************

Note that NOT all of the above corruption will be detected!! In our
case, if the page corrupted is one of the pages being CURRENTLY
transmitted (on the vhost send queue) we will detect it. But the
queue size is pretty small (256) - so probability is quite low. Or
if we have to hold on to packets due to the possibility of TCP
retransmission, they will be long-lived and will increase the
probability as well. I did observe one TCP retransmission in
one of the instances but I wasn't actively seeking those
occurrences.
The corrupted packets will be delivered to the guest - I did see
a packet of size 0x16 on the guest in one of the crashes - which
seemed bizarre (how can that be generated?) but now makes sense.
It was just thrown away.

I let the code run with some checks in copy_page_to_iter_iovec to
capture the anomalies in __copy_tofrom_user. The instances are
pretty rare and generally conform to the following pattern:

0x26=__copy_tofrom_user(to, from, 0x16)

However, it is not always 26/16. This is confirmed/validated by
prior observations of the corruption being less than 16 bytes as
well (any where from 4 bytes to 16 bytes - 16 being most common).

[ 2365.918381] error in copying ... buf=00003fff1e7e0000 from=c000001f6e03cd6c left=26 copy=16
[ 8789.412510] error in copying ... buf=00003ffeef060000 from=c000001f6ecb99f6 left=26 copy=16
[11930.325914] error in copying ... buf=00003fff23f40000 from=c000001f53041fd4 left=26 copy=16
[15400.086279] error in copying ... buf=00003fff165b0000 from=c000001f5380ab66 left=26 copy=16
[22717.969918] error in copying ... buf=00003fff199a0000 from=c000001f947a6606 left=24 copy=14 <<<<<-----NOTE
[27274.429858] error in copying ... buf=00003fff14120000 from=c000001f677c31d4 left=26 copy=16
[32103.463712] error in copying ... buf=00003ffc272e0000 from=c000001f25e7e68c left=26 copy=16

The size of the read along with the new user page (and possibly
even the from address offset) may have something to do with
this (just conjecture again).

I believe the solution to the problem lies in figuring out why
__copy_tofrom_user is returning a number greater than the read
as "left". As mentioned, the code appears to be in
linux/arch/powerpc/lib/copyuser_64.S

While it is possible to put a check in copy_page_to_iter_iovec
to test for a case where "left" is greater than "copy", the
correct approach (I believe) is to determine why it is behaving in
the way observed. Interestingly, "left" is expected to be between
0 and "copy" - no check for negative or greater than copy. So clearly
the function is expected behave in a tight range and so the
community may want to keep it that way (just conjecture).

This issue doesn't seem unique to POWER. It seems Sparc observed
something similar as well.

In linux 4.8, in copy_page_to_iter_iovec() the calls to the
atomic kmaps are enveloped with
(IS_ENABLED(CONFIG_HIGHMEM)
additionally. The reasoning given is efficiency. I don't believe
it has any impact in our scenario (both Ubuntu 16.04 and PowerKVM
don't seem to have it set)

Similar Issues
-------------
I did a little rudimentary research regarding any similar issues
reported in the past. The following URLs report similar relevant issues:

1 https://lkml.org/lkml/2016/7/8/775
2 http://www.spinics.net/lists/stable/msg94139.html
3 https://lkml.org/lkml/2016/8/3/181
4 https://patchwork.kernel.org/patch/9260733/

Note that all of them are quite recent.

Case 1 appears very interesting. I was wondering if we are having
any page fault related issues for the new user page (we will be
reading into) especially for a small read. (Case 2 is just a
little usage case).

Cases 3 and 4 seem to refer to similar issues where the
corruption was narrowed down to the commit which had recoded
the module to use copy_page_to_iter(). I don't believe that
the root cause was found (but I didn't put too much effort
into digging up the history because in our case we have a
smoking gun already, I believe).

While working on this, I observed another bug in tun_put_user()
although it didn't have any impact in our case (I verified)

tun_put_user calls
skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset);

but never checks the return code (which can be EFAULT). In that case,
it will blithely copy junk to the vhost (caller's) buffers. That
though appears to be a secondary issue for this problem.

I believe though, that we need to figure out how
__copy_tofrom_user
is returning the anomalous data described above.

While I can go through the assembly and make some progress, I believe
there are experts here who can probably rapidly figure out the issue
here from what has been described above.

== Comment: #85 - Dwip N. Banerjee <email address hidden> - 2016-10-10 13:03:24 ==
In the meanwhile, I will try the foillowing:
-- disable atomic kmaps
-- immediately switch to non-atomic kmaps in case left > copy

== Comment: #86 - Pradeep Satyanarayana <email address hidden> - 2016-10-10 15:12:51 ==
Paul, Can you please take a look at Dwip's excellent analysis and let us know your thoughts?

== Comment: #87 - Dwip N. Banerjee <email address hidden> - 2016-10-11 00:53:47 ==
I wanted to add a couple of updates and a note...

My effort to try to maintain the performance profile by keeping the atomic kmaps and switching over to the non-atomic mode in case of the weird read failed. It was fairly quick. The corruption was visible within half an hour.

So I tried to switch off the atomic completely by essentially following the code from 4.8 (using a CONFIG_HIGHMEM check although its stated introduction purpose was different). It's still running , after almost 12 hours (on this machine it reproduces fairly quickly). David also did the the same thing but by commenting out the atomic portion. I think that is probably still running as well. This may provide a helpful data point.

Finally, I wanted to mention that __copy_tofrom_user() is also called in the transmit path via __copy_from_user_inatomic().

The call sequence is:

handle_tx->tun_sendmsg->tun_get_user->skb_copy_datagram_from_iter->copy_page_from_iter->copy_page_from_iter_iovec->__copy_from_user_inatomic->__copy_tofrom_user

This is done for the non-zerocopy path. By default it is off but the vhost code is self-tuning w.r.t zero-copy buffers even when it zerocopy mode set, i.e. beyond a threshold of buffers still in use (ubufs yet to be released) the code will switch to a copy mode. That brings into the picture another invocation sequence of __copy_tofrom_user() (with the src/dst reversed w.r.t. kernel/user space).

So that will be a further possible source of mayhem...

== Comment: #88 - Paul Mackerras <email address hidden> - 2016-10-11 07:09:38 ==
Dwip, thank you for the excellent analysis. I dug into __copy_tofrom_user today to find out how it could return more than the number of bytes it was asked to copy. It turns out that there is a bug in __copy_tofrom_user_base which has been there since 2002! There is one store instruction in that function for which the exception handling goes to the wrong place, resulting in the return value being 16 greater than it should be. The fix is very simple and I will post it to the upstream mailing list and attach it to this bug.

== Comment: #89 - Paul Mackerras <email address hidden> - 2016-10-11 07:21:52 ==

Tags: