Resolving the Mystery of Zombie Node Services in Kubernetes

Resolving the Mystery of Zombie Node Services in Kubernetes

Mysteriously, services were spontaneously restarting, while others, confoundingly, remained running but were dead to any incoming requests. Let's unravel this enigma, from its discovery to its resolution, with the hope that our experiences can assist others in a similar bind.

Mystery Unfolds

Our Node services began displaying some eyebrow-raising behaviour:

  • A sporadic yet notable number of services started restarting without discernible cause.

  • Certain others, while ostensibly operational and listening on their assigned ports, were starkly unresponsive to any requests.

Such erratic conduct directly jeopardized our system's stability and efficiency.

Intertwined Clues and Haphazard Diagnosis

Unraveling this situation wasn't straightforward. Even as we delved deep into logs, configurations, and Kubernetes event streams, the solution remained elusive. It was during this quest that a coworker, somewhat serendipitously, uncovered the anomaly of the Node application. Despite its unresponsiveness, it continued to actively listen to its designated port.

File permissions seemed to be at the heart of this issue. Piecing together disparate clues embedded in documentation and combining it with our observations, the trail led us to the npm update notifier. Its role? To periodically verify new npm versions and notify users of any updates. Yet, herein lay the twist: to maintain a record of its last npm update status, the notifier attempted to write to a file. But given our Docker setup, where strict write permissions were enforced and images ran as non-root, this operation invariably failed.

Instead of gracefully failing or throwing a conspicuous error, the application did something rather unexpected: it clung onto its port, listening but coldly ignoring any incoming requests. This behaviour mislead Kubernetes into presuming the pod was operational in many instances.

To validate our growing suspicions, we tinkered with the notifier's status file. By artificially adjusting the file's creation date, a pattern emerged: the system faltered whenever this status file aged past seven days.

Original Dockerfile

FROM node:16-alpine AS base

ARG NPM_TOKEN
ENV HOME=/home/node
WORKDIR $HOME/app
ENV NO_UPDATE_NOTIFIER=true
ENV NPM_CONFIG_PREFIX=$HOME/.npm-global
ENV NODE_PATH=$NPM_CONFIG_PREFIX/lib/node_modules

RUN adduser node root
RUN chgrp -R 0 $HOME/app && chmod -R g=u $HOME/app

COPY package.json package-lock.json $HOME/app/

FROM base AS dependencies
RUN apk add --update --no-cache \
  g++ make python3 \
  nss
USER node
RUN touch $HOME/app/.npmrc && \
  mkdir $HOME/.npm-global $HOME/app/node_modules
RUN npm -v
RUN npm set progress=false && npm config set depth 0
RUN npm ci

FROM base AS release
USER node
RUN mkdir $HOME/.npm-global && \
  mkdir -p $HOME/.npm && \
  chmod -R g+rwx $HOME/.npm && \
  chown -R node:root $HOME/.npm $HOME/.npm-global

COPY --from=dependencies $HOME/app/node_modules ./node_modules
COPY . .

EXPOSE 4000
CMD npm run start

Decoding Deprecated Flags

With our problem cornered, the solution seemed imminent. In past iterations, we'd quelled the update notifier using the NO_UPDATE_NOTIFIER flag. Yet, npm 7 threw a wrench in our plans by retiring this flag, a change subtly tucked away in the release notes.

Grand Solution

Our diligence led us to a lifeline: NPM_CONFIG_UPDATE_NOTIFIER. Setting this beacon to false disarmed the update notifier and its contentious disk write attempts.

Recognizing the scale of our challenge:

  • We armed our teams with guidance and context, catering to their varied Docker proficiency, to fix the issue in their services

  • We kept track of the progress of adoption of this new flag across our services

Modified Dockerfile

FROM node:20 AS base

ARG NPM_TOKEN
ENV HOME=/home/node
ENV NO_UPDATE_NOTIFIER=true
ENV NPM_CONFIG_UPDATE_NOTIFIER=false
ENV NPM_CONFIG_PREFIX=$HOME/.npm-global
ENV NPM_CONFIG_SCRIPT_SHELL=/bin/bash
ENV NPM_CONFIG_DEPTH=0
ENV NODE_PATH=$NPM_CONFIG_PREFIX/lib/node_modules

WORKDIR $HOME/app

RUN adduser node root
RUN chgrp -R 0 $HOME/app && chmod -R g=u $HOME/app

FROM base AS dependencies
USER node
COPY package.json package-lock.json $HOME/app/
RUN npm set progress=false
RUN npm ci

FROM base AS release
USER node
WORKDIR $HOME/app
COPY --from=dependencies $HOME/app/node_modules ./node_modules
COPY --chown=node:root . .

EXPOSE 4000
CMD ["node", "src/index.js"]

Reflective Conclusion

Our takeaways?

  • Even subtle changes in third-party utilities can spiral into monumental challenges.

  • Release notes, no matter how mundane, deserve meticulous scrutiny.

  • Ensuring robust security, as we did with our Docker configurations, can sometimes spotlight lurking issues that might otherwise remain camouflaged.

To our peers navigating the vast seas of Node, Docker, and Kubernetes, remember the little savior: NPM_CONFIG_UPDATE_NOTIFIER. It may be diminutive, but it’s powerful enough to ward off an avalanche of issues!

Here's to fewer mysteries and more seamless coding!