Deploy 5 services with dependencies?

  • Detect the deploy ordering of services
  • Combine the scripts to deploy the services by orders, combine to wait for health check successfully, wait for service status to be DONE, then deploy the next service.
  • If 5 services are in the same instances, running as 5 containers >> with Helm, we can define the container initialization ordering. With ECS, we have dependsOn.

OOMKilled on prod

  • Make sure the cluster autoscaling works to ensure it has enough space to scale more pods
  • Check the RAM and increasing the limit for the pod
  • Mem leak > can restart if needed or considering to rollback
  • Check if there is any spike
  • Restart services

Check issues on prod

  • Combine with checking metrics, logs, trace to find issues
  • With EC2 instance, can use dmesg to find if the container is killed by OOM.
  • journalctl
  • Check metrics to find spikes

Chaos tools

  • Chaos Mesh for K8S to kill a random pod to test the self healing
  • Stress test sidecar container to consume the pod RAM.

Process

  • Check if the issue from app or DB or network
  • Stop bleeding to save the system (rollback, restart)
  • Finding the root cause & prepare solutions.