Activity log for bug #1799397

Date Who What changed Old value New value Message
2018-10-23 10:15:17 Talat Batheesh bug added bug
2018-10-31 15:59:07 Launchpad Janitor dpdk (Ubuntu): status New Confirmed
2018-10-31 16:13:15 David Coronel bug added subscriber David Coronel
2018-11-09 13:56:47 Christian Ehrhardt  dpdk (Ubuntu): importance Undecided Low
2018-11-12 05:29:41 Christian Ehrhardt  bug watch added https://bugs.dpdk.org/show_bug.cgi?id=97
2018-11-12 05:32:39 Christian Ehrhardt  bug task added gcc-7 (Ubuntu)
2018-11-12 05:32:49 Christian Ehrhardt  bug added subscriber Matthias Klose
2018-11-12 13:03:35 Christian Ehrhardt  bug task added dpdk
2019-02-26 08:47:10 Christian Ehrhardt  dpdk: status New Fix Released
2019-02-26 08:47:13 Christian Ehrhardt  gcc-7 (Ubuntu): status New Invalid
2019-02-26 08:47:17 Christian Ehrhardt  nominated for series Ubuntu Cosmic
2019-02-26 08:47:17 Christian Ehrhardt  bug task added dpdk (Ubuntu Cosmic)
2019-02-26 08:47:17 Christian Ehrhardt  bug task added gcc-7 (Ubuntu Cosmic)
2019-02-26 08:47:17 Christian Ehrhardt  nominated for series Ubuntu Bionic
2019-02-26 08:47:17 Christian Ehrhardt  bug task added dpdk (Ubuntu Bionic)
2019-02-26 08:47:17 Christian Ehrhardt  bug task added gcc-7 (Ubuntu Bionic)
2019-02-26 08:47:23 Christian Ehrhardt  dpdk (Ubuntu): status Confirmed Fix Released
2019-02-26 08:47:25 Christian Ehrhardt  dpdk (Ubuntu Bionic): status New Triaged
2019-02-26 08:47:28 Christian Ehrhardt  dpdk (Ubuntu Cosmic): status New Triaged
2019-02-26 08:47:32 Christian Ehrhardt  bug task deleted gcc-7 (Ubuntu Bionic)
2019-02-26 08:47:38 Christian Ehrhardt  bug task deleted gcc-7 (Ubuntu Cosmic)
2019-03-04 09:47:05 Christian Ehrhardt  description Hi, Christian We've recently encountered a weird issue with Ubuntu 18.04 on the Skylake server. I can always reproduce this crash and I could narrowed it down. I guess it could be a GCC issue. [1] How to reproduce - ConnectX-4Lx/ConnectX-5 with mlx5 PMD in DPDK 18.02.1 - Ubuntu 18.04 on Intel Skylake server - gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0 - Testpmd crashes when it starts to forward traffic. Easy to reproduce. - Only happens on the Skylake server. - DPDK 18.05 and later don't have such issue. git-bisect gives no clue. This is because I enabled MEMPOOL_DEBUG and MLX5_DEBUG. As mempool/rte_memcpy is inlined function, it should be affected. Now I can see the crash regardlessly - 18.02, 18.05 and 18.08. [2] Failure point The attached patch gives an insight of why it crashes. The following is the result of the patch and the GDB commands. In summary, rte_memcpy() doesn't work as expected. In __mempool_generic_put(), there's rte_memcpy() to move the array of objects to the lcore cache. If I run memcmp() right after rte_memcpy(dst, src, n), data in dst differs from data in src. And it looks like some of data got shifted by a few bytes as you can see below. [GDB command] $dst = 0x7ffff4e09ea8 $src = 0x7fffce3fb970 $n = 256 x/32gx 0x7ffff4e09ea8 x/32gx 0x7fffce3fb970 testpmd: /home/mlnxtest/dpdk/build/include/rte_mempool.h:1140: __mempool_generic_put: Assertion `0' failed. Thread 4 "lcore-slave-1" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffce3ff700 (LWP 69913)] (gdb) x/32gx 0x7ffff4e09ea8 0x7ffff4e09ea8: 0x00007fffaac38ec0 0x00007fffaac38500 0x7ffff4e09eb8: 0x00007fffaac37b40 0x00007fffaac37180 0x7ffff4e09ec8: 0x850000007fffaac3 0x7b4000007fffaac3 0x7ffff4e09ed8: 0x00007fffaac35440 0x00007fffaac34a80 0x7ffff4e09ee8: 0xaac3850000007fff 0xaac37b4000007fff 0x7ffff4e09ef8: 0x00007fffaac32d40 0x00007fffaac32380 0x7ffff4e09f08: 0x7fffaac385000000 0x7fffaac37b400000 0x7ffff4e09f18: 0x00007fffaac30640 0x00007fffaac2fc80 0x7ffff4e09f28: 0x00007fffaac2f2c0 0x00007fffaac2e900 0x7ffff4e09f38: 0x00007fffaac2df40 0x00007fffaac2d580 0x7ffff4e09f48: 0x00007fffaac2cbc0 0x00007fffaac2c200 0x7ffff4e09f58: 0x00007fffaac2b840 0x00007fffaac2ae80 0x7ffff4e09f68: 0x00007fffaac2a4c0 0x00007fffaac29b00 0x7ffff4e09f78: 0x00007fffaac29140 0x00007fffaac28780 0x7ffff4e09f88: 0x00007fffaac27dc0 0x00007fffaac27400 0x7ffff4e09f98: 0x00007fffaac26a40 0x00007fffaac26080 (gdb) x/32gx 0x7fffce3fb970 0x7fffce3fb970: 0x00007fffaac38ec0 0x00007fffaac38500 0x7fffce3fb980: 0x00007fffaac37b40 0x00007fffaac37180 0x7fffce3fb990: 0x00007fffaac367c0 0x00007fffaac35e00 0x7fffce3fb9a0: 0x00007fffaac35440 0x00007fffaac34a80 0x7fffce3fb9b0: 0x00007fffaac340c0 0x00007fffaac33700 0x7fffce3fb9c0: 0x00007fffaac32d40 0x00007fffaac32380 0x7fffce3fb9d0: 0x00007fffaac319c0 0x00007fffaac31000 0x7fffce3fb9e0: 0x00007fffaac30640 0x00007fffaac2fc80 0x7fffce3fb9f0: 0x00007fffaac2f2c0 0x00007fffaac2e900 0x7fffce3fba00: 0x00007fffaac2df40 0x00007fffaac2d580 0x7fffce3fba10: 0x00007fffaac2cbc0 0x00007fffaac2c200 0x7fffce3fba20: 0x00007fffaac2b840 0x00007fffaac2ae80 0x7fffce3fba30: 0x00007fffaac2a4c0 0x00007fffaac29b00 0x7fffce3fba40: 0x00007fffaac29140 0x00007fffaac28780 0x7fffce3fba50: 0x00007fffaac27dc0 0x00007fffaac27400 0x7fffce3fba60: 0x00007fffaac26a40 0x00007fffaac26080 AFAIK, AVX512F support is disabled by default in DPDK as it is still experimental (CONFIG_RTE_ENABLE_AVX512=n). But with gcc optimization, AVX2 version of rte_memcpy() seems to be optimized with 512b instructions. If I disable it by adding EXTRA_CFLAGS="-mno-avx512f", then it works fine and doesn't crash. Do you have any idea regarding this issue or are you already aware of it? Thanks, Yongseok $ git diff diff --git a/config/common_base b/config/common_base index ad03cf433..f512b5a88 100644 --- a/config/common_base +++ b/config/common_base @@ -275,8 +275,8 @@ CONFIG_RTE_LIBRTE_MLX4_TX_MP_CACHE=8 # # Compile burst-oriented Mellanox ConnectX-4 & ConnectX-5 (MLX5) PMD # -CONFIG_RTE_LIBRTE_MLX5_PMD=n -CONFIG_RTE_LIBRTE_MLX5_DEBUG=n +CONFIG_RTE_LIBRTE_MLX5_PMD=y +CONFIG_RTE_LIBRTE_MLX5_DEBUG=y CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS=n CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8 @@ -597,7 +597,7 @@ CONFIG_RTE_RING_USE_C11_MEM_MODEL=n # CONFIG_RTE_LIBRTE_MEMPOOL=y CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=512 -CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n +CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=y # # Compile Mempool drivers diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h index 8b1b7f7ed..9f48028d9 100644 --- a/lib/librte_mempool/rte_mempool.h +++ b/lib/librte_mempool/rte_mempool.h @@ -39,6 +39,7 @@ #include <errno.h> #include <inttypes.h> #include <sys/queue.h> +#include <assert.h> #include <rte_config.h> #include <rte_spinlock.h> @@ -1123,6 +1124,22 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table, /* Add elements back into the cache */ rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n); + if(memcmp(&cache_objs[0], obj_table, sizeof(void *) * n)) { + printf("[GDB command] \n" + "$dst = %p\n" + "$src = %p\n" + "$n = %ld\n" + "x/%ldgx %p\n" + "x/%ldgx %p\n", + (void *)&cache_objs[0], + (const void *)obj_table, + sizeof(void *) * n, + sizeof(void *) * n / 8, (void *)&cache_objs[0], + sizeof(void *) * n / 8, (const void *)obj_table + ); + assert(0); + } + cache->len += n; if (cache->len >= cache->flushthresh) { [Impact] * Crashing on certain SkyLake Chips * Follow upstream disabling one of the gcc options [Test Case] * Part of the MRE bug 1817675 following the MRE verficiation process as defined there. [Regression Potential] * Rebuilds with the new code using DPDK headers will be slightly slower (not using the feature) but avoiding the crash. The slowdown should be negligible for most cases and the crash avoidance outweigh this. [Other Info] * n/a --- Hi, Christian We've recently encountered a weird issue with Ubuntu 18.04 on the Skylake server. I can always reproduce this crash and I could narrowed it down. I guess it could be a GCC issue. [1] How to reproduce - ConnectX-4Lx/ConnectX-5 with mlx5 PMD in DPDK 18.02.1 - Ubuntu 18.04 on Intel Skylake server - gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0 - Testpmd crashes when it starts to forward traffic. Easy to reproduce. - Only happens on the Skylake server. - DPDK 18.05 and later don't have such issue. git-bisect gives no clue. This is because I enabled MEMPOOL_DEBUG and MLX5_DEBUG. As mempool/rte_memcpy is inlined function, it should be affected. Now I can see the crash regardlessly - 18.02, 18.05 and 18.08. [2] Failure point The attached patch gives an insight of why it crashes. The following is the result of the patch and the GDB commands. In summary, rte_memcpy() doesn't work as expected. In __mempool_generic_put(), there's rte_memcpy() to move the array of objects to the lcore cache. If I run memcmp() right after rte_memcpy(dst, src, n), data in dst differs from data in src. And it looks like some of data got shifted by a few bytes as you can see below.  [GDB command]  $dst = 0x7ffff4e09ea8  $src = 0x7fffce3fb970  $n = 256  x/32gx 0x7ffff4e09ea8  x/32gx 0x7fffce3fb970  testpmd: /home/mlnxtest/dpdk/build/include/rte_mempool.h:1140: __mempool_generic_put: Assertion `0' failed.  Thread 4 "lcore-slave-1" received signal SIGABRT, Aborted.  [Switching to Thread 0x7fffce3ff700 (LWP 69913)]  (gdb) x/32gx 0x7ffff4e09ea8  0x7ffff4e09ea8: 0x00007fffaac38ec0 0x00007fffaac38500  0x7ffff4e09eb8: 0x00007fffaac37b40 0x00007fffaac37180  0x7ffff4e09ec8: 0x850000007fffaac3 0x7b4000007fffaac3  0x7ffff4e09ed8: 0x00007fffaac35440 0x00007fffaac34a80  0x7ffff4e09ee8: 0xaac3850000007fff 0xaac37b4000007fff  0x7ffff4e09ef8: 0x00007fffaac32d40 0x00007fffaac32380  0x7ffff4e09f08: 0x7fffaac385000000 0x7fffaac37b400000  0x7ffff4e09f18: 0x00007fffaac30640 0x00007fffaac2fc80  0x7ffff4e09f28: 0x00007fffaac2f2c0 0x00007fffaac2e900  0x7ffff4e09f38: 0x00007fffaac2df40 0x00007fffaac2d580  0x7ffff4e09f48: 0x00007fffaac2cbc0 0x00007fffaac2c200  0x7ffff4e09f58: 0x00007fffaac2b840 0x00007fffaac2ae80  0x7ffff4e09f68: 0x00007fffaac2a4c0 0x00007fffaac29b00  0x7ffff4e09f78: 0x00007fffaac29140 0x00007fffaac28780  0x7ffff4e09f88: 0x00007fffaac27dc0 0x00007fffaac27400  0x7ffff4e09f98: 0x00007fffaac26a40 0x00007fffaac26080  (gdb) x/32gx 0x7fffce3fb970  0x7fffce3fb970: 0x00007fffaac38ec0 0x00007fffaac38500  0x7fffce3fb980: 0x00007fffaac37b40 0x00007fffaac37180  0x7fffce3fb990: 0x00007fffaac367c0 0x00007fffaac35e00  0x7fffce3fb9a0: 0x00007fffaac35440 0x00007fffaac34a80  0x7fffce3fb9b0: 0x00007fffaac340c0 0x00007fffaac33700  0x7fffce3fb9c0: 0x00007fffaac32d40 0x00007fffaac32380  0x7fffce3fb9d0: 0x00007fffaac319c0 0x00007fffaac31000  0x7fffce3fb9e0: 0x00007fffaac30640 0x00007fffaac2fc80  0x7fffce3fb9f0: 0x00007fffaac2f2c0 0x00007fffaac2e900  0x7fffce3fba00: 0x00007fffaac2df40 0x00007fffaac2d580  0x7fffce3fba10: 0x00007fffaac2cbc0 0x00007fffaac2c200  0x7fffce3fba20: 0x00007fffaac2b840 0x00007fffaac2ae80  0x7fffce3fba30: 0x00007fffaac2a4c0 0x00007fffaac29b00  0x7fffce3fba40: 0x00007fffaac29140 0x00007fffaac28780  0x7fffce3fba50: 0x00007fffaac27dc0 0x00007fffaac27400  0x7fffce3fba60: 0x00007fffaac26a40 0x00007fffaac26080 AFAIK, AVX512F support is disabled by default in DPDK as it is still experimental (CONFIG_RTE_ENABLE_AVX512=n). But with gcc optimization, AVX2 version of rte_memcpy() seems to be optimized with 512b instructions. If I disable it by adding EXTRA_CFLAGS="-mno-avx512f", then it works fine and doesn't crash. Do you have any idea regarding this issue or are you already aware of it? Thanks, Yongseok $ git diff diff --git a/config/common_base b/config/common_base index ad03cf433..f512b5a88 100644 --- a/config/common_base +++ b/config/common_base @@ -275,8 +275,8 @@ CONFIG_RTE_LIBRTE_MLX4_TX_MP_CACHE=8  #  # Compile burst-oriented Mellanox ConnectX-4 & ConnectX-5 (MLX5) PMD  # -CONFIG_RTE_LIBRTE_MLX5_PMD=n -CONFIG_RTE_LIBRTE_MLX5_DEBUG=n +CONFIG_RTE_LIBRTE_MLX5_PMD=y +CONFIG_RTE_LIBRTE_MLX5_DEBUG=y  CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS=n  CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8 @@ -597,7 +597,7 @@ CONFIG_RTE_RING_USE_C11_MEM_MODEL=n  #  CONFIG_RTE_LIBRTE_MEMPOOL=y  CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=512 -CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n +CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=y  #  # Compile Mempool drivers diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h index 8b1b7f7ed..9f48028d9 100644 --- a/lib/librte_mempool/rte_mempool.h +++ b/lib/librte_mempool/rte_mempool.h @@ -39,6 +39,7 @@  #include <errno.h>  #include <inttypes.h>  #include <sys/queue.h> +#include <assert.h>  #include <rte_config.h>  #include <rte_spinlock.h> @@ -1123,6 +1124,22 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,         /* Add elements back into the cache */         rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n); + if(memcmp(&cache_objs[0], obj_table, sizeof(void *) * n)) { + printf("[GDB command] \n" + "$dst = %p\n" + "$src = %p\n" + "$n = %ld\n" + "x/%ldgx %p\n" + "x/%ldgx %p\n", + (void *)&cache_objs[0], + (const void *)obj_table, + sizeof(void *) * n, + sizeof(void *) * n / 8, (void *)&cache_objs[0], + sizeof(void *) * n / 8, (const void *)obj_table + ); + assert(0); + } +         cache->len += n;         if (cache->len >= cache->flushthresh) {
2019-03-26 16:03:41 Brian Murray dpdk (Ubuntu Cosmic): status Triaged Fix Committed
2019-03-26 16:03:43 Brian Murray bug added subscriber Ubuntu Stable Release Updates Team
2019-03-26 16:03:46 Brian Murray bug added subscriber SRU Verification
2019-03-26 16:03:49 Brian Murray tags verification-needed verification-needed-cosmic
2019-03-27 16:19:24 Brian Murray dpdk (Ubuntu Bionic): status Triaged Fix Committed
2019-03-27 16:19:31 Brian Murray tags verification-needed verification-needed-cosmic verification-needed verification-needed-bionic verification-needed-cosmic
2019-03-27 17:12:11 David Coronel removed subscriber David Coronel
2019-03-28 06:23:57 Christian Ehrhardt  tags verification-needed verification-needed-bionic verification-needed-cosmic verification-done verification-done-bionic verification-done-cosmic
2019-04-02 17:16:58 Launchpad Janitor dpdk (Ubuntu Cosmic): status Fix Committed Fix Released
2019-04-02 17:17:20 Brian Murray removed subscriber Ubuntu Stable Release Updates Team
2019-04-04 09:35:14 Launchpad Janitor dpdk (Ubuntu Bionic): status Fix Committed Fix Released