Race condition when generating ovn.env file

Bug #2019317 reported by Martin Kalcok
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
microovn
Fix Released
High
Unassigned

Bug Description

When member joins (or leaves) microovn cluster, rest of the existing members re-generate their $SNAP_COMMON/data/ovn.env file by calling `ovn.generateEnvironment()`[0]. As part of this call, a connection strings to the OVN NB and SB databases are updated as well and this part of code contains race condition. Offending code is in `connectionString()` function[1] and looks like this:

```
<earlier lines omitted>
addresses := make([]string, len(servers))
remotes := s.Remotes().RemotesByName()
for i, server := range servers {
 remote, ok := remotes[server.Member]
 if !ok {
  continue
 }

 addresses[i] = fmt.Sprintf("tcp:%s",
  netip.AddrPortFrom(remote.Address.Addr(), uint16(port)).String())
}

return strings.Join(addresses, ","), nil
```

The final string is create from the slice `addresses` which is initiated with the length of `servers` and filled by looping over `servers` and looking up remote address of each server. However if the lookup fails (because the remote server is not properly registered in the database yet), the loop `continue`s and the `addresses` now have empty element that breaks the final string.

Expected string for cluster of 3:
"tcp:10.5.3.129:6641,tcp:10.5.3.8:6641,tcp:10.5.1.223:6641"

Broken string for cluster of 3 where one remote address lookup failed:
"tcp:10.5.3.129:6641,tcp:10.5.3.8:6641,"

This broken string then produces errors in ovn-central service like this:

May 12 11:49:02 juju-89d2ab-microovn-2 ovn-northd[417700]: ovs|00035|socket_util|ERR|10.5.3.8:6641,: bad port number "6641,"

My proposal is to add retry mechanism in case that remote server lookup fails. In addition, I think the slice `addresses` should be initialized with length 0 and capacity of `len(servers)` to prevent having empty elements in it.

[0]https://github.com/canonical/microovn/blob/f8a449753c5f00ec6e271cb430075a76850e5021/microovn/ovn/environment.go#LL58C6-L58C25
[1]https://github.com/canonical/microovn/blob/f8a449753c5f00ec6e271cb430075a76850e5021/microovn/ovn/environment.go#L26

Revision history for this message
Max Asnaashari (masnax) wrote :

The missing address happens because microcluster's `Remotes` are updated on the next heartbeat. To get the nodes at the cluster level, there is the `cluster.InternalClusterMember` type available to fetch from the dqlite database via `cluster.GetInternalClusterMembers()`

Revision history for this message
Stéphane Graber (stgraber) wrote :

Worth noting that the actual bug (empty address) has been fixed a few weeks ago, but the missing address (missing entry in Remotes) still needs fixing.

Revision history for this message
Max Asnaashari (masnax) wrote :
Frode Nordahl (fnordahl)
Changed in microovn:
status: New → Fix Committed
importance: Undecided → High
Frode Nordahl (fnordahl)
Changed in microovn:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.