Thursday, April 08, 2021

Tip: error: failed to load key pair tls: failed to parse private key

 Symptom:

    When we kubectl create secret tls ..., we hit below error

error: failed to load key pair tls: failed to parse private key

Reason:

    It is likely the private key file is encrypted with a passphrase.

   Use openssl to unencrypt it and use the new key for kubectl 

openssl rsa -in encrypted-private.key -out unencrypted.key

 Enter pass phrase for ...... 

 

 

Wednesday, April 07, 2021

Tip: Pods keep crashloopbackoff

 Symptom:

 Pods always crashloopbackoff 

"kubectl describe pod..."  does not give meaningful info, as well as "kubectl get events"

Reason:

One of the likely reason is related to pod security policy. My situation is the existing pod security policy does not allow Nginx or Apache to run. It does not have

 allowedCapabilities:

  - NET_BIND_SERVICE

  # apache or nginx need escalation to root to function well

  allowPrivilegeEscalation: true


So the pods keep crashloopbackoff. To fix it is to add the above into the pod security policy.


Saturday, April 03, 2021

Tip: Istio TLS secrets, Gateway, VirtualService namespace scope

There is some confusion about where we should put istio objects. Is it in the istio-system or users namespace?

Here are some tips:

For TLS,mTLS CA, certs, key management in istio, the Kubernetes secrets should be created in the istio-system. Not in users' namespace

Gateway and VirtualService need to be created on the users' namespace 

Tuesday, March 09, 2021

How to find which type of VMs pods are running via promQL

Requirement:

     Users need to know which type of VMs their pods are running. i.e. users wanna verify pods are running on GPU VMs

Solution:

In Prometheus, we have 2 metrics:  kube_pod_info{} and kube_node_lables{}

kube_node_labels often has a label to tell which type of VM it is. 

We can use "node" to join these 2 metrics to provide a report to users

sum( kube_pod_info{}) by(pod,node) *on(node) group_left(label_beta_kubernetes_io_instance_type) sum(kube_node_labels{}) by (node,label_beta_kubernetes_io_instance_type)

Please refer official promQL doc 

Tip: create grafana API for it:

curl -g -k -H "Authorization: Bearer ******" https://grafana.testtest.com/api/datasources/proxy/1/api/v1/query?query=sum\(kube_pod_info{}\)by\(pod,node\)*on\(node\)group_left\(label_beta_kubernetes_io_instance_type\)sum\(kube_node_labels{}\)by\(node,label_beta_kubernetes_io_instance_type\)

Als refer my blog how to convert promQL into grafana API call

Monday, March 08, 2021

How to convert PromQL into Grafana API call

Requirement:

     We use promQL to fetch some metadata of a Kubernetes cluster. i.e existing namespaces

sum(kube_pod_info) by (namespace)

We would like to convert it to a grafana API call, so other apps can consume this metadata

Solution:

  • First, we need to generate an API token. Refer grafana doc 
  • Second, below is a curl example to consume it:
curl -k -H "Authorization: Bearer e*****dfwefwef0=" https://grafana-test.testtest.com/api/datasources/proxy/1/api/v1/query?query=sum\(kube_pod_info\)by\(namespace\)

Thursday, February 25, 2021

Istio install against different Docker Repos

Requirement:

       With istioctl, it has built-in manifests. However, these manifests or docker images may not be accessible in the corporate network, or users use other docker repo other than docker.io.  How to install it?

Solution:

  • istioctl manifest generate --set profile=demo > istio_generate_manifests_demo.yaml
  • find docker images path in the yaml ,download and upload them to your internal docker repo.
  • edit the file with right docker image path of internal docker repo
  • kubectl apply -f istio_generate_manifests_demo.yaml
  • istioctl verify-install -f istio_generate_manifests_iad_demo.yaml
  • to purge the deployment:
    • istioctl x uninstall --purge

Tuesday, February 16, 2021

Tip: Pod FQDN in Kubernetes

Pods from deployment, statefulset. daemonset exposed by service

FQDN is  pod-ip-address.svc-name.my-namespace.svc.cluster.local

i.e  172-12-32-12.test-svc.test-namespace.svc.cluster.local

not 172.12.32.12.test-svc.test-namespace.svc.cluster.local

Isolated Pods:

FQDN is  pod-ip-address.my-namespace.pod.cluster.local

i.e  172-12-32-12.test-namespace.pod.cluster.local

Wednesday, February 03, 2021

Tip: Kubernetes intermittent DNS issues of pods

 Symptom:

     The pods get "unknown name" or "no such host" for the external domain name. i.e. test.testcorp.com

The issues are intermittent.

Actions:

  • Follow k8s guide and check all  DNS pods are running well. 
  • One possible reason is one or a few of namespaces in /etc/resolv.conf of hosts may not be able to solve the DNS name  test.testcorp.com
    • i.e. *testcorp.com is  corp intranet name, it needs to be resolved by corp name servers. however, in normal cloud VM setup, we have name server option 169.254.169.254 in the /etc/resolv.conf,  in this case 169.254.169.254 has no idea for *.testcorp.com, thus we have intermittent issues
    • To solve this, we need to update DHCP server, remove 169.254.169.254 from /etc/resolv.conf
    • kubectl rollout restart deployment coredns -n kube-system
  • One possible reason is some of the nodes have network issues which DNS pods are not functioning well.  use below commands to test DNS pods. 

kubectl -n kube-system get po -owide|grep coredns |awk '{print $6 }' > /tmp/1.txt

cat /tmp/1.txt  | while read -r line; do echo $line | awk '{print "curl -v --connect-timeout 10 telnet://"$1":53", "\n"}'; done
  • Enable debug log of DNS pods per  k8s guide
  • test the DNS and kubectl tail all DNS pods to get debug info
kubectl -n kube-system logs -f deployment/coredns --all-containers=true --since=1m |grep testcorp

  • You may get log like

INFO] 10.244.2.151:43653 - 48702 "AAAA IN test.testcorp.com.default.svc.cluster.local. udp 78 false 512" NXDOMAIN qr,aa,rd 171 0.000300408s

[INFO] 10.244.2.151:43653 - 64047 "A IN test.testcorp.com.default.svc.cluster.local. udp 78 false 512" NXDOMAIN qr,aa,rd 171 0.000392158s 

  • The /etc/resolv.conf has  "options ndots:5"  which may impact the external domain DNS resolution. To use full qualified name can mitigate the issue. test.testcorp.com --> test.testcorp.com.  (there is a .  at the end)
  • Disable coredns AAAA (IPv6) queries. it will reduce NXDOMAIN (not found), thus reduce the fail rate back to the dns client
    • Add below into coredns config file. refer coredns rewrite
    • rewrite stop type AAAA A
  • Install node local DNS to speed DNS queries. Refer kubernetes doc
  • test dig test.testcorp.com +all many times, it will show authorization section
;; AUTHORITY SECTION:
test.com.     4878    IN      NS      dnsmaster1.test.com.
test.com.     4878    IN      NS      dnsmaster5.test.com.
    • to find out which DNS server  timeout
  • Add below parameter in /etc/resolv.conf to improve DNS query performance
    • options single-request-reopen   refer manual
    • options single-request   refer manual
  • Another solution is to use an external name:

    // code placeholder
    apiVersion: v1
    kind: Service
    metadata:
      annotations:
      name: test-stage
      namespace: default
    spec:
      externalName: test-stage.testcorp.com
      ports:
      - port: 636
        protocol: TCP
        targetPort: 636
      type: ExternalName

Tuesday, February 02, 2021

Tip: A Command to get all resources and subresources in Kuberentes Cluster

 list=($(kubectl get --raw / | jq -r '.paths[] | select(. | startswith("/api"))')); for tgt in ${list[@]}; do aruyo=$(kubectl get --raw ${tgt} | jq .resources); if [ "x${aruyo}" != "xnull" ]; then echo; echo "===${tgt}==="; kubectl get --raw ${tgt} | jq -r ".resources[] | .name,.verbs"; fi; done

Tip: Use oci cli to reboot a VM

oci compute instance action --action SOFTRESET --region us-ashburn-1 --instance-id  <instance id you can get from kubectl describe node>

oci compute instance get  --region us-ashburn-1 --instance-id  <instance id you can get from kubectl describe node>

sometimes, you may get 404 error if you omit " --region us-ashburn-1"

Tip: Collect console serial Logs of Oracle Cloud Infrastructure

oci compute console-history capture   --region us-ashburn-1 --instance-id <instance-ocid>

--> oci compute console-history get  --region us-ashburn-1 --instance-console-history-id <OCID from the command before> 

--> oci compute console-history get-content --region us-ashburn-1  --length 1000000000 --file /tmp/logfile.txt --instance-console-history-id <OCID from the command before>


Tuesday, January 05, 2021

Tip: Change default storageclass in Kubernetes

The below example is for OKE (Oracle Kubernetes Engine), the same concept for other Kubernetes 

Change default storageclass from oci to oci-bv:

kubectl patch storageclass oci -p '{"metadata": {"annotations":{"storageclass.beta.kubernetes.io/is-default-class":"false"}}}'


kubectl patch storageclass oci-bv -p '{"metadata": {"annotations":{"storageclass.beta.kubernetes.io/is-default-class":"true"}}}'