Troubleshooting commands

6 MINUTE READ

Big picture

Use command line tools to get status and troubleshoot.

Note: calico-system is used for operator-based commands and examples; for manifest-based install, use kube-system.

See Calico architecture and components for help with components.

Hosts

Verify number of nodes in a cluster

kubectl get nodes

NAME           STATUS   ROLES    AGE   VERSION
ip-10-0-0-10   Ready    master   27h   v1.18.0
ip-10-0-0-11   Ready    <none>   27h   v1.18.0
ip-10-0-0-12   Ready    <none>   27h   v1.18.0

Verify calico-node pods are running on every node, and are in a healthy state

kubectl get pods -n calico-system -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP             NODE           
calico-node-77zgj           1/1     Running   0          27h   10.0.0.10      ip-10-0-0-10   
calico-node-nz8k2           1/1     Running   0          27h   10.0.0.11      ip-10-0-0-11
calico-node-7trv7           1/1     Running   0          27h   10.0.0.12      ip-10-0-0-12 

Exec into pod for further troubleshooting

kubectl run multitool --image=praqma/network-multitool 

kubectl exec -it multitool -- bash
bash-5.0# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=97 time=6.61 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=97 time=6.64 ms

Collect Calico diagnostic logs

sudo calicoctl node diags
Collecting diagnostics
Using temp dir: /tmp/calico194224816
Dumping netstat
Dumping routes (IPv4)
Dumping routes (IPv6)
Dumping interface info (IPv4)
Dumping interface info (IPv6)
Dumping iptables (IPv4)
Dumping iptables (IPv6)

Diags saved to /tmp/calico194224816/diags-20201127_010117.tar.gz

Kubernetes

Verify all pods are running

kubectl get pods -A
kube-system       coredns-66bff467f8-dxbtl                   1/1     Running   0          27h
kube-system       coredns-66bff467f8-n95vq                   1/1     Running   0          27h
kube-system       etcd-ip-10-0-0-10                          1/1     Running   0          27h
kube-system       kube-apiserver-ip-10-0-0-10                1/1     Running   0          27h

Verify Kubernetes API server is running

kubectl cluster-info
Kubernetes master is running at https://10.0.0.10:6443
KubeDNS is running at https://10.0.0.10:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
ubuntu@master:~$ kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.49.0.1    <none>        443/TCP   2d2h

Verify Kubernetes kube-dns is working

kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.49.0.1    <none>        443/TCP   2d2h
kubectl exec -it multitool  bash
bash-5.0# curl -I -k https://kubernetes
HTTP/2 403 
cache-control: no-cache, private
content-type: application/json
x-content-type-options: nosniff
content-length: 234
bash-5.0# nslookup google.com
Server:         10.49.0.10
Address:        10.49.0.10#53
Non-authoritative answer:
Name:   google.com
Address: 172.217.14.238
Name:   google.com
Address: 2607:f8b0:400a:804::200e

Verify that kubelet is running on the node with the correct flags

systemctl status kubelet

If there is a problem, check the journal

journalclt -u kubelet | head

Check the status of other system pods

Look especially at coredns; if they are not getting an IP, something is wrong with the CNI

kubectl get pod -n kube-system -o wide

But if other pods fail, it is likely a different issue. Perform normal Kubernetes troubleshooting. For example:

kubectl describe pod kube-scheduler-ip-10-0-1-20.eu-west-1.compute.internal -n kube-system | tail -15

Calico components

View Calico CNI configuration on a node

cat /etc/cni/net.d/10-calico.conflist

Verify calicoctl matches cluster

The cluster version and type must match the calicoctl version.

calicoctl version

For syntax:

calicoctl version -help

Check tigera operator status

kubectl get tigerastatus
NAME     AVAILABLE   PROGRESSING   DEGRADED   SINCE
calico   True        False         False      27h

Check if operator pod is running

kubectl get pod -n tigera-operator

View calico nodes

kubectl get pod -n calico-system -o wide

View Calico installation parameters

kubectl get installation -o yaml
apiVersion: v1
items:
- apiVersion: operator.tigera.io/v1
  kind: Installation
  metadata:
    - apiVersion: operator.tigera.io/v1
 spec:
    calicoNetwork:
      bgp: Enabled
      hostPorts: Enabled
      ipPools:
      - blockSize: 26
        cidr: 10.48.0.0/16
        encapsulation: VXLANCrossSubnet
        natOutgoing: Enabled
        nodeSelector: all()
      multiInterfaceMode: None
      nodeAddressAutodetectionV4:
        firstFound: true
    cni:
      ipam:
        type: Calico
      type: Calico

Run commands across multiple nodes

export THE_COMMAND_TO_RUN=date && for calinode in `kubectl get pod -o wide -n calico-system | grep calico-node | awk '{print $1}'`; do echo $calinode; echo "-----"; kubectl exec -n calico-system $calinode -- $THE_COMMAND_TO_RUN; printf "\n"; done
calico-node-87lpx
-----
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
Thu Apr 28 13:48:06 UTC 2022

calico-node-x5fmm
-----
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
Thu Apr 28 13:48:07 UTC 2022

View pod info

kubectl describe pods `<pod_name>`  -n `<namespace> `
kubectl describe pods busybox -n default
Events:
  Type    Reason     Age   From                   Message
  ----    ------     ----  ----                   -------
  Normal  Scheduled  21s   default-scheduler      Successfully assigned default/busybox to ip-10-0-0-11
  Normal  Pulling    20s   kubelet, ip-10-0-0-11  Pulling image "busybox"
  Normal  Pulled     19s   kubelet, ip-10-0-0-11  Successfully pulled image "busybox"
  Normal  Created    19s   kubelet, ip-10-0-0-11  Created container busybox
  Normal  Started    18s   kubelet, ip-10-0-0-11  Started container busybox

View logs of a pod

kubectl logs `<pod_name>`  -n `<namespace>`
kubectl logs busybox -n default

View kubelet logs

journalctl -u kubelet

Routing

Verify routing table on the node

ip route
default via 10.0.0.1 dev eth0 proto dhcp src 10.0.0.10 metric 100 
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.10 
10.0.0.1 dev eth0 proto dhcp scope link src 10.0.0.10 metric 100 
10.48.66.128/26 via 10.0.0.12 dev eth0 proto 80 onlink 
10.48.231.0/26 via 10.0.0.11 dev eth0 proto 80 onlink 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown

Verify BGP peer status

sudo calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+------------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |   SINCE    |    INFO     |
+--------------+-------------------+-------+------------+-------------+
| 10.0.0.12    | node-to-node mesh | up    | 2020-11-25 | Established |
| 10.0.0.11    | node-to-node mesh | up    | 2020-11-25 | Established |
+--------------+-------------------+-------+------------+-------------+

Verify overlay configuration

kubectl get ippools default-ipv4-ippool -o yaml
...
spec:
  ipipMode: Always
  vxlanMode: Never
...

Verify bgp learned routes

ip r | grep bird
192.168.66.128/26 via 10.0.0.12 dev tunl0 proto bird onlink 
192.168.180.192/26 via 10.0.0.10 dev tunl0 proto bird onlink 
blackhole 192.168.231.0/26 proto bird 

Verify BIRD routing table

Note: The BIRD routing table gets pushed to node routing tables.

kubectl exec -it -n calico-system calico-node-8cfc8 -- /bin/bash
[root@ip-10-0-0-11 /]# birdcl
BIRD v0.3.3+birdv1.6.8 ready.
bird> show route
0.0.0.0/0          via 10.0.0.1 on eth0 [kernel1 18:13:33] * (10)
10.0.0.0/24        dev eth0 [direct1 18:13:32] * (240)
10.0.0.1/32        dev eth0 [kernel1 18:13:33] * (10)
10.48.231.2/32     dev calieb874a8ef0b [kernel1 18:13:41] * (10)
10.48.231.1/32     dev caliaeaa173109d [kernel1 18:13:35] * (10)
10.48.231.0/26     blackhole [static1 18:13:32] * (200)
10.48.231.0/32     dev vxlan.calico [direct1 18:13:32] * (240)
10.48.180.192/26   via 10.0.0.10 on eth0 [Mesh_10_0_0_10 18:13:34] * (100/0) [i]
                   via 10.0.0.10 on eth0 [Mesh_10_0_0_12 18:13:41 from 10.0.0.12] (100/0) [i]
                   via 10.0.0.10 on eth0 [kernel1 18:13:33] (10)
10.48.66.128/26    via 10.0.0.12 on eth0 [Mesh_10_0_0_10 18:13:36 from 10.0.0.10] * (100/0) [i]
                   via 10.0.0.12 on eth0 [Mesh_10_0_0_12 18:13:41] (100/0) [i]
                   via 10.0.0.12 on eth0 [kernel1 18:13:36] (10)

Capture traffic

For example,

sudo tcpdump -i calicofac0017c3 icmp

Network policy

Verify existing Kubernetes network policies

kubectl get networkpolicy --all-namespaces
NAMESPACE   NAME             POD-SELECTOR   AGE
client      allow-ui         <none>         20m
client      default-deny     <none>         4h51m
stars       allow-ui         <none>         20m
stars       backend-policy   role=backend   20m
stars       default-deny     <none>         4h51m

Verify existing Calico network policies

calicoctl get networkpolicy --all-namespaces -o wide
NAMESPACE     NAME                         ORDER   SELECTOR                                                       
calico-demo   allow-busybox                50      app == 'porter'                                                
client        knp.default.allow-ui         1000    projectcalico.org/orchestrator == 'k8s'                        
client        knp.default.default-deny     1000    projectcalico.org/orchestrator == 'k8s'                        
stars         knp.default.allow-ui         1000    projectcalico.org/orchestrator == 'k8s'                        
stars         knp.default.backend-policy   1000    projectcalico.org/orchestrator == 'k8s' 
stars         knp.default.default-deny     1000    projectcalico.org/orchestrator == 'k8s'                        

Verify existing Calico global network policies

calicoctl get globalnetworkpolicy -o wide
NAME                  ORDER   SELECTOR
default-app-policy    100
egress-lockdown       600
default-node-policy   100     has(kubernetes.io/hostname)
nodeport-policy       100     has(kubernetes.io/hostname)

Check policy selectors and order

For example,

calicoctl get np -n yaobank -o wide

If the selectors should match, check the endpoint IP and the node where it is running. For example,

kubectl get pod -l app=customer -n yaobank