Graylog waiting on REST API

Bug #1899158 reported by Márton Kiss
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Graylog Charm
Invalid
Critical
Joe Guo

Bug Description

The graylog unit is trying to use to local mongodb connection and reporting a Waiting for REST API status.

$ juju status graylog | grep waiting
graylog 3.0.1 waiting 3 graylog jujucharms 44 ubuntu
graylog/1* waiting idle 19 10.70.227.37 9000/tcp Waiting for: REST API

2020-10-07 15:57:52 INFO juju-log mongodb:274: Retrying REST API check...
2020-10-07 15:57:52 INFO juju-log mongodb:274: Error calling graylog api: HTTPConnectionPool(host='127.0.0.1', port=9000): Max retries exceeded with url: /api/?pretty=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7c06aa28d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2020-10-07 15:57:53 INFO juju-log mongodb:274: Error calling graylog api: HTTPConnectionPool(host='127.0.0.1', port=9000): Max retries exceeded with url: /api/?pretty=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7c06aa2da0>: Failed to establish a new connection: [Errno 111] Connection refused',))

The mongodb relation configuration is missing from graylog snap's server.conf:

$ juju ssh graylog/1 'sudo grep "^mongodb_uri" /var/snap/graylog/common/server.conf'
mongodb_uri = mongodb://localhost/graylog

This is a 3 unit deployment where the other units are properly settled and active, and including the correct mongodb configuration:

$ juju ssh graylog/0 'sudo grep "^mongodb_uri" /var/snap/graylog/common/server.conf'
mongodb_uri = mongodb://10.XX.XX.XX:27017/graylog

Related branches

Revision history for this message
Márton Kiss (marton-kiss) wrote :

$ juju --version
2.7.8-bionic-amd64

The broken unit remove not resolved the issue, adding a new unit resulted in a blocked charm status due to missing mongodb-relation:

$ juju status graylog/4 | grep ^graylog
graylog 3.0.1 blocked 1 graylog jujucharms 44 ubuntu
graylog/4 blocked executing 30 10.70.227.2 9000/tcp Missing relation to: mongodb

However the relation is present based on juju status:

$ juju status --relations | grep graylog: | grep mongodb
graylog-mongodb:database graylog:mongodb mongodb regular

Revision history for this message
John George (jog) wrote :

Solutions-QA is also seeing this issue. The bundle and logs are available by following the link at the bottom of this test run https://solutions.qa.canonical.com/testruns/testRun/81fe8520-3fee-4e54-a096-e76ea3b76886

Revision history for this message
Joshua Genet (genet022) wrote :

Solutions-QA hit this a bunch over the weekend again on our Focal Kubernetes 1.19 AWS test runs.

You can see each run (with artifacts/logs) here:
https://solutions.qa.canonical.com/bugs/bugs/bug/1899158

Revision history for this message
Michael Skalka (mskalka) wrote :
Revision history for this message
Przemyslaw Hausman (phausman) wrote :

I'm seeing this on a customer deployment. Although in my case the server.conf file seem to have the correct contents, see below.

Graylog version: 3.0.1
Graylog charm revision: 45
Juju version: 2.8.7

ubuntu@infra-1:~/deploy$ juju ssh graylog/0 'sudo grep "^mongodb_uri" /var/snap/graylog/common/server.conf'
mongodb_uri = mongodb://192.168.1.193:27017,192.168.1.161:27017,192.168.1.176:27017/graylog
Connection to 192.168.1.79 closed.

ubuntu@infra-1:~/deploy$ juju ssh graylog/1 'sudo grep "^mongodb_uri" /var/snap/graylog/common/server.conf'
mongodb_uri = mongodb://192.168.1.176:27017/graylog
Connection to 192.168.1.77 closed.

ubuntu@infra-1:~/deploy$ juju ssh graylog/2 'sudo grep "^mongodb_uri" /var/snap/graylog/common/server.conf'
mongodb_uri = mongodb://192.168.1.193:27017,192.168.1.161:27017/graylog
Connection to 192.168.1.78 closed.

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

Seeing this on a customer deployment.

ubuntu@infra-01:~$ juju ssh graylog/0 'sudo grep "^mongodb_uri" /var/snap/graylog/common/server.conf'
mongodb_uri = mongodb://172.24.70.131:27017/graylog
Connection to 172.24.70.117 closed.

ubuntu@infra-01:~$ juju ssh graylog/1 'sudo grep "^mongodb_uri" /var/snap/graylog/common/server.conf'
mongodb_uri = mongodb://172.24.70.131:27017/graylog
Connection to 172.24.70.122 closed.

$ juju ssh graylog/2 'sudo grep "^mongodb_uri" /var/snap/graylog/common/server.conf'
mongodb_uri = mongodb://172.24.70.131:27017/graylog
Connection to 172.24.70.137 closed.

ubuntu@infra-01:~$ juju status graylog
graylog/0* waiting executing 0 172.24.70.117 9000/tcp Waiting for: filebeat, REST API
graylog/1 waiting executing 22 172.24.70.122 9000/tcp Waiting for: filebeat, REST API
graylog/2 waiting executing 23 172.24.70.137 9000/tcp Waiting for: filebeat, REST API

Is there a known workaround ?

Revision history for this message
Aymen Frikha (aym-frikha) wrote :

Hello,

I'm also affected by this bug in the customer environment. We hit that after rebooting graylog machines.

Revision history for this message
Xav Paice (xavpaice) wrote :

If this is seen when there's a relation to a certificate provide, maybe LP:#1923537 is a duplicate here?

Xav Paice (xavpaice)
Changed in charm-graylog:
status: New → In Progress
assignee: nobody → Xav Paice (xavpaice)
milestone: none → 21.07
importance: Undecided → Medium
Xav Paice (xavpaice)
Changed in charm-graylog:
assignee: Xav Paice (xavpaice) → nobody
status: In Progress → New
Revision history for this message
Michael Skalka (mskalka) wrote :

Marking this crit, we are seeing numerous release test failures over this bug: https://solutions.qa.canonical.com/bugs/bugs/bug/1899158

Revision history for this message
Joe Guo (guoqiao) wrote :

Some facts I found while digging into the code:

- graylog service runs on 9000 port, and graylog charm is trying to reach REST API (on same host) via 127.0.0.1:9000 [0], the IP 127.0.0.1 is correct here
- both error message `Retrying REST API check..`[1] and `Error calling graylog api ...`[2] are from graylog charm, not mongodb charm
- graylog REST API could run on https, if `graylog.certificates.configured` flag is set. [3]
- graylog REST API v2 will starts with `/api/`, while v3 doesn't. [4]

However, in log from case description:

    $ juju status graylog | grep waiting
    graylog 3.0.1 waiting 3 graylog jujucharms 44 ubuntu
    graylog/1* waiting idle 19 10.70.227.37 9000/tcp Waiting for: REST API

    2020-10-07 15:57:52 INFO juju-log mongodb:274: Retrying REST API check...
    2020-10-07 15:57:52 INFO juju-log mongodb:274: Error calling graylog api: ...url: /api/?pretty=true...

2 issues:

A) this is graylog log, but some how juju printed `mongodb:274` before graylog error message. Not sure how/why this happened.
B) gaylog is `3.0.1` (v3), but the url has `/api/` in it (v2), which is wrong.

Because of A), bug reporter Marton Kiss was misled, and ended up checking mongodb_uri in graylog config. From what I can see, that could be a timing/race condition/separate issue.
Because of B), it seems like graylog is v3 but charm is trying to access v2 REST API, with wrong url, and caused the timeout.

[0]: https://git.launchpad.net/charm-graylog/tree/src/lib/charms/layer/graylog/utils.py#n24
[1]: https://git.launchpad.net/charm-graylog/tree/src/reactive/graylog.py#n1117
[2]: https://git.launchpad.net/charm-graylog/tree/src/lib/charms/layer/graylog/api.py#n86
[3]: https://git.launchpad.net/charm-graylog/tree/src/lib/charms/layer/graylog/utils.py#n26
[4]: https://git.launchpad.net/charm-graylog/tree/src/lib/charms/layer/graylog/utils.py#n51

Revision history for this message
Joe Guo (guoqiao) wrote :

Here is how v2/v3 is determined:

    def is_v2(): # noqa: D103
        version = snap.get_installed_version(SNAP_NAME)
        # If the snap isn't installed yet, base our version off the charm config
        if not version:
            version = hookenv.config("channel")
        return True if version.startswith("2") else False

I created a simple bundle on my local juju env, and didn't reproduce this issue.
So next key thing to do is, in a env with this issue, we need check whether above function is working as expected.
Some output will be helpful:

    juju run graylog/0 -- sudo snap info graylog
    juju config graylog channel

Revision history for this message
Joe Guo (guoqiao) wrote :

Also, regard https/http, it's hard to determine which one should be used and was used with current info. That could be another possible cause of the wrong API url and timeout.

Revision history for this message
Michael Skalka (mskalka) wrote :

@Joe

Here is a crashdump from a juju model which is exhibiting the behavior: https://oil-jenkins.canonical.com/artifacts/983bd2cd-58de-4273-9919-60fdd1b28838/generated/generated/kubernetes/juju-crashdump-kubernetes-2021-04-24-21.30.09.tar.gz

As well as the bundle used to deploy this model: https://oil-jenkins.canonical.com/artifacts/983bd2cd-58de-4273-9919-60fdd1b28838/generated/generated/kubernetes/bundle.yaml

As you can see we have configured the charm to use '3/stable' for the channel.

The runs where we see this are in our CI, testing other products, if you follow the link I posted above you can see the full occurances of this issue, and at the bottom of each of those linked test runs is a link to the artifacts gathered at the end of each failed run. Please let me know if there any anything else we need to add to our crashdump utility log-wise.

Revision history for this message
Joe Guo (guoqiao) wrote :

hi Michael Skalka:

Thanks for the info, appreciated.

Revision history for this message
Joe Guo (guoqiao) wrote :

With Michael Saklka's bundle, I can reproduce this issue easily.

The problem:

In juju status, graylog will stuck in status "Waiting for: REST API".

(Here I remove filebeat to narrow down the issue).

What actually happened:

In graylog unit, once any http request send to graylog server, either http://127.0.0.1:9000/ or http://127.0.0.1:9000/api/, cpu will shoot high and graylog snap service (java process) will get killed and restart.

Some findings:

* This behavior is stable, no matter you trigger the http request via browser, curl or call requests in juju hook. So issue is not caused by charm code.
* Even when I snap stop graylog and run the java cmd manually, process still get killed on request, so it's not related to snap.
* change graylog channel from 3/stable to 2/stable doesn't help, issue is same. (api will run on :9001/)
* I am using 8G ram for gaylog, it's not caused by out of memory.
* deploy graylog only in a model, http service won't work (no mongo and ES), but java process is stable, so the restart is not related to focal/snap/lxd.
* deploy graylog and mongodb only in a model, issue appears, but log is all about missing Elasticsearch.
* deploy graylog, mongodb and elasticsearch, issue still there, but mongodb and elasticsearch all look healthy

I tried to run gaylog in debug mode with:

    vim /var/snap/graylog/current/default-graylog-server
    GRAYLOG_SERVER_ARGS="-d"
    snap restart graylog

But still, no useful log printed.

It's most likely the issue is caused by elasticsearch or mongodb relation, but I can didn't find any proof.

Bundle I used to debug: https://pastebin.canonical.com/p/KwdF6zGjgY/
(graylog + mongodb + elasticsearch x 3)

Revision history for this message
Joe Guo (guoqiao) wrote :

Hi All,

With Xav Paice's help, we find out the issue is caused by this:

  graylog:
    bindings:
      ? ''
      : alpha
    charm: cs:graylog
    constraints: mem=4G
    num_units: 1
    options:
      channel: 3/stable
      jvm_heap_size: 8G

jvm_heap_size is 8G but mem is 4G, which caused gaylog java process to be killed.
jvm_heap is the ram where java objects live, it must be smaller than machine ram.
After changing it back to default 1G, the issue disappear from my test env.

Revision history for this message
Joe Guo (guoqiao) wrote :

Hi All,

I have change status to `Invalid`, since it's a configure issue.

If that's still not the root reason in your case, please feel free to report more information.

Changed in charm-graylog:
status: New → Confirmed
importance: Medium → Critical
assignee: nobody → Joe Guo (guoqiao)
status: Confirmed → Invalid
Revision history for this message
Joe Guo (guoqiao) wrote :

I realized I can add a check in charm code to avoid this happen again. I will make a patch for that. Status changed to "In Progress".

Changed in charm-graylog:
status: Invalid → In Progress
Revision history for this message
Joe Guo (guoqiao) wrote :
Revision history for this message
Joe Guo (guoqiao) wrote :

To make things clear, I created another bug about jvm_heap_size validation:

https://bugs.launchpad.net/charm-graylog/+bug/1927170

For this one, I will again mark it as invalid.

To those who hit this issue:

please adjust the jvm_heap_size config in your bundle, to ensure it's smaller than the machine total mem.

The default value is "1G".
Please note this config is NOT "the bigger, the better".
When it's large, GC will be triggered less frequently, but each GC will take longer to finish.
When it's small, GC will be frequent, but each run will be quicker.
So perhaps default "1G" is a reasonable value most of the time.

Changed in charm-graylog:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.