[DCOS_OSS-3804] dcos-checks-poststart: run as root Created: 13/Jul/18  Updated: 15/Feb/19  Resolved: 16/Jul/18

Status: Resolved
Project: DC/OS
Component/s: dcos-checks
Affects Version/s: None
Fix Version/s: DC/OS 1.11.4, DC/OS 1.12.0

Type: Bug Priority: Medium
Reporter: Gustav Paul (Inactive) Assignee: Gustav Paul (Inactive)
Resolution: Done  
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
Epic Link: Security Team initiatives
Team: DELETE Security Team
Sprint: Security Sprint 24, Security Sprint 25
Story Points: 3

 Description   

The output of the dcos-checks-poststart.service does not get formatted properly in the journal. It looks like most of the output gets sent to the journal but isn't associated with the systemd unit.

For example, if I run `systemctl start dcos-checks-poststart.service` we see the output in the main journal but not in the journal for that specific dc/os unit:

# Nothing in the unit's logs:
$ journalctl -u dcos-checks-poststart.service
Jul 13 16:30:04 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 systemd[1]: Started DC/OS Poststart Checks: Run node-poststart checks.
Jul 13 16:30:04 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 systemd[1]: Starting DC/OS Poststart Checks: Run node-poststart checks...
Jul 13 16:30:12 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 systemd[1]: Started DC/OS Poststart Checks: Run node-poststart checks.
Jul 13 16:30:12 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 systemd[1]: Starting DC/OS Poststart Checks: Run node-poststart checks...

# But in the main journal we find
$ journalctl
...
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0,
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "checks": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "components_master": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "curl": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "exhibitor_majority_expected_masters_connected": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "exhibitor_total_expected_masters_connected": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "ifconfig": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "ip_detect_script": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "172.17.0.2 is a valid IPV4 address\n",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "journald_dir_permissions": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "directory /run/log/journal has the group owner `systemd-journal` and group permissons r-x\n",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "mesos_master_replog_synchronized": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "tar": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "unzip": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "xz": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: },
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "zookeeper_serving": {
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "output": "serving\n",
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: "status": 0
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: }
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: }
Jul 13 16:31:01 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[18389]: }
...

This is really bad as it makes it impossible to figure out what broke when.

On a hunch I tested what happens when we run the unit as the `root` user and it fixed the logs issue! We now see...

$ journalctl -u dcos-checks-poststart.service
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 systemd[1]: Starting DC/OS Poststart Checks: Run node-poststart checks...
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0,
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "checks": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "components_master": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "curl": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "exhibitor_majority_expected_masters_connected": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "exhibitor_total_expected_masters_connected": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "ifconfig": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "ip_detect_script": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "172.17.0.2 is a valid IPV4 address\n",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "journald_dir_permissions": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "directory /run/log/journal has the group owner `systemd-journal` and group permissons r-x\n",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "mesos_master_replog_synchronized": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "tar": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "unzip": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "xz": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: },
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "zookeeper_serving": {
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "output": "serving\n",
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: "status": 0
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: }
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: }
Jul 13 16:33:03 dcos-e2e-e2f6c624-32b7-423b-8ee4-7aaa4cee088e-master-0 dcos-check-runner[19322]: }

Not only does this work around what appears to be a weird systemd bug, it also makes way more sense. The dcos-checks simulate the operator, Dan, periodically checking the health of his cluster. That implies the checks should assume Dan's permissions, which implies root access. Also, it doesn't serve requests or perform any user interaction at all - there is no reason not to run the checks as root.



 Comments   
Comment by Gustav Paul (Inactive) [ 13/Jul/18 ]

PR https://github.com/dcos/dcos/pull/3083

Comment by Gustav Paul (Inactive) [ 13/Jul/18 ]

This is very clearly related to https://bugs.freedesktop.org/show_bug.cgi?id=50184.

The issue is that if a process logs and immediately exits (like dcos-checks-runner does in dcos-checks-poststart.service due to being run by a timer unit) then the process has exited by the time the journal tries to read the process metadata from /proc/<pid>.

Comment by Tim Weidner (Inactive) [ 16/Jul/18 ]

When this gets merged we can stop running checks with 'sudo' in dcos-e2e wait_for_dcos_oss/ee

Comment by Gustav Paul (Inactive) [ 16/Jul/18 ]

Removing fixVersion 1.10 as the checks aren't run periodically through a dcos-checks-poststart.service / .timer combination.

Comment by Gustav Paul (Inactive) [ 16/Jul/18 ]

This master PR has Ship it.

1.11 PR: https://github.com/dcos/dcos/pull/3100

Generated at Tue May 24 03:32:25 CDT 2022 using JIRA 7.8.4#78004-sha1:5704c55c9196a87d91490cbb295eb482fa3e65cf.