Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-2132

dcos-log does not handle the journald files rotation properly

    Details

      Description

      The problem occurs when the following is true:
      1. Client opens a connection to a streaming endpoint. (under the hood it means that dcos-log would use a golang systemd binding this binding simply opens a journald file(s) on a file system.)
      2. Client will consume the logs
        2.1 Client decides to close the journald and it will release the opened journald file descriptors so no issues here)
        2.2. Client does not close the journald and here comes issues.
      3. The linux system might be configured to rotate journald logs (this is happening on a soak cluster). When this occurs, journald would simple remove the files from the file system without any notification to client obviously
      4. Now, from client perspective there are no new logs coming, (because journald removed the files) so the client is silently waiting for new events holding the FDs opened.
      5. As the numbers of logs grow, eventually the system will be running out of disk space, and the currently connected client will not be getting any logs. Of course if the client closes the journald and opens it again it all will start working until the next rotation happens.
       
      How to fix this:
      journald has a function
      int sd_journal_wait(}}sd_journal *j, uint64_t timeout_usec{{);
       
      It should return one of the following results:
      SD_JOURNAL_NOP is returned, the journal did not change since the last invocation. SD_JOURNAL_APPEND is returned, new entries have been appended to the end of the journal. 
      SD_JOURNAL_INVALIDATE, journal files were added or removed (possibly due to rotation). 
       
      So this seems to be a trivial, we just need to call wait and if we get SD_JOURNAL_INVALIDATE then close the journald and open it again with original parameters. However...
       
      I implemented this code, but i noticed, that for some reason journald would be sending SD_JOURNAL_INVALIDATE every time when we reached the bottom of the journal and a new log entry was appended. Making my logic pretty much useless, because it'd have to close/open journal basically every time the journald is fully read and a new entry added.
       
      Then i found this bug fix and finally it was merged in upstream systemd on April 2017.
      I looked through systemd releases and it looks like version v236 has this fix in.
       
      Our soak cluster runs systemd v219
       
      [root@int-infinityagent10-soak110 dcos-log--3d8342bb425d2876539016fb6250fab021a5fac6]# systemctl --version
      systemd 219
      +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN
       
      Our coreos image, deployed by CCM runs v231
       
      ip-10-0-7-127 core # systemctl --version
      systemd 231
      +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS -ACL +XZ -LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN

        Attachments

          Activity

            People

            • Assignee:
              mnaboka Maksym Naboka (Inactive)
              Reporter:
              mnaboka Maksym Naboka (Inactive)
              Team:
              DELETE Cluster Ops Team
              Watchers:
              Kevin Klues (Inactive), Maksym Naboka (Inactive), Marco Monaco
            • Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Zendesk Support

                  NextupJiraPlusStatus

                  Error rendering 'slack.nextup.jira:nextup-jira-plus-status'. Please contact your JIRA administrators.