Why is networking one of the biggest headaches for almost any opensource project? Whenever you look, OpenStack, Docker, Kubernetes… the number one issue has always been the network layer. I want to believe that the reason for that is because networking has been one of the last stack layers to be fully virtualized. And, until very recently, it was, literally, a bunch of wires (physical ones) connected to a black box (called switch or hub or router) in the top of the rack. A simple tcpdump was more than enough to debug network issues in most of the cases. We can say goodbye to those good old days, cause they are gone… but things are easier than you think.
Kubernetes networking model
If there was only one thing that you would learn from this post it should be that every pod has its own IP address and can communicate with any other pods or host in the cluster. With plain containers deployments, you need to link the different containers and keep track of all the mappings between the host ports and the container services: a total nightmare once you start to scale. Using other orchestration technologies like Docker Swarm or Mesos, will make this step really simple.
Our Deployment
So, with all this in mind, how this really works internally, at networking layer? I’m not going to enter into details of how to deploy a kubernetes cluster, but I’ll be using my Raspberry Py cluster (k8s RPI cluster) with HypriotOS (a Debian/Raspbian derivate distribution) + kubeadm + flannel.
The only “tricky” step in the deployment is that we need to use a slightly modified deployment file for flannel (use the official arm Docker images instead of amd64 and use hostgw as the default flannel plugin instead of vlanx – we’ll be playing with vxlan and other backends and network plugins in future posts)
curl -sSL https://rawgit.com/coreos/flannel/v0.9.0/Documentation/kube-flannel.yml | sed "s/amd64/arm/g" | sed "s/vxlan/host-gw/"
In total, we’ll have 3 hosts: 1 master and 2 nodes.
Every host will get its own address space to assign to its pods. Flannel will be the one responsible for keeping track of which host has which net address spaces, by storing that data in the etcd service or the kubernetes API.
$ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, podCIDR: .spec.podCIDR}' { "name": "rpi-master", "podCIDR": "10.244.0.0/24" } { "name": "rpi-node-1", "podCIDR": "10.244.1.0/24" } { "name": "rpi-node-2", "podCIDR": "10.244.2.0/24" }
Finally, to add some load and pods to our cluster, an httpd service will be deployed:
kubectl run httpd --image=httpd --replicas=3 --port=80
Basic Topology
With 3 replicas, the pods will be distributed among the two nodes, and each of them will have a unique IP address.
With the host-gw flannel backend, the network configuration is straightforward. Everything is transparent, all is done via ip routes, and all the routes will be managed by the flannel daemon.
root@rpi-node-1:/home/pirate# ip route default via 192.168.0.1 dev eth0 10.244.0.0/24 via 192.168.0.101 dev eth0 10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1 10.244.2.0/24 via 192.168.0.112 dev eth0 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.111
For this to work, all nodes should be on the same network. For more complex clusters, other backends like vxlan should be used to avoid those limitations, but at the expense of performance and complexity. There is no overlay protocol, natting, whatever… just a bridge connecting two different networks.
How is this managed in code
func (n *network) Run(ctx context.Context) { wg := sync.WaitGroup{} log.Info("Watching for new subnet leases") evts := make(chan []subnet.Event) wg.Add(1) go func() { subnet.WatchLeases(ctx, n.sm, n.lease, evts) wg.Done() }() // Store a list of routes, initialized to capacity of 10. n.rl = make([]netlink.Route, 0, 10) wg.Add(1) // Start a goroutine which periodically checks that the right routes are created go func() { n.routeCheck(ctx) wg.Done() }() defer wg.Wait() for { select { case evtBatch := <-evts: n.handleSubnetEvents(evtBatch) case <-ctx.Done(): return } } }
Like everything with this backend, all is pretty forward in flannel. In the main execution thread, we have two goroutines, one that will watch for network leases and changes (subnet.WatchLeases(), lines 57-60), and another one that will make sure that we have the desired routes in our system. (n.routeCheck(), lines 67-70). Then, the execution enters in a loop (lines 75-81) where we wait for an event to occur (notified by the previously mentioned subnet.WatchLeases() goroutine, through the evts channel) or for a ctx.Done() indicating that the program needs to terminate.
func (n *network) routeCheck(ctx context.Context) { for { select { case <-ctx.Done(): return case <-time.After(routeCheckRetries * time.Second): n.checkSubnetExistInRoutes() } } } func (n *network) checkSubnetExistInRoutes() { routeList, err := netlink.RouteList(nil, netlink.FAMILY_V4) if err == nil { for _, route := range n.rl { exist := false for _, r := range routeList { if r.Dst == nil { continue } if routeEqual(r, route) { exist = true break } } if !exist { if err := netlink.RouteAdd(&route); err != nil { if nerr, ok := err.(net.Error); !ok { log.Errorf("Error recovering route to %v: %v, %v", route.Dst, route.Gw, nerr) } continue } else { log.Infof("Route recovered %v : %v", route.Dst, route.Gw) } } } } else { log.Errorf("Error fetching route list. Will automatically retry: %v", err) } }
The routeCheck goroutine is also quite simple. Every 10 seconds (line 181) it calls n.checkSubnetExistInRoutes() and confirms that the desired routes are present in the host. If not, it just tries to add them (line 202). All the real magic happens in n.handleSubnetEvents(evtBatch) (line 77), so I consider this check a simple safety net just in case something goes really bad. (An external program or BOFH deleting routes, maybe)
func (n *network) handleSubnetEvents(batch []subnet.Event) { ... case subnet.EventAdded: ... n.addToRouteList(route) } else if err := netlink.RouteAdd(&route); err != nil { ... case subnet.EventRemoved: ... n.removeFromRouteList(route) if err := netlink.RouteDel(&route); err != nil { ...
Full code on Github.
We have two different types of events which are self-explanatory: EventAddded and EventRemoved. And after some checking, e.g., that the route exists (or not), we just add it to the host and to the list of expected routes (used by the previously mentioned routeCheck() line 176) or removed it from both places.
FORWARD default DROP policy
With the release of Docker 1.13 (+ info), there was a change in the default behavior of the FORWARD chain policy. It used to be ACCEPT, but now it has changed to DROP.
root@rpi-node-1:/home/pirate# iptables -L FORWARD Chain FORWARD (policy DROP)
With this, all the network packets that need to go through (like our route 10.244.0.0/24 via 192.168.0.101 dev eth0) will be drop by default if there is no other matching rule (most likely case). This “small” change has bitten a lot of upgrades and new deployments. The issue affects both host-gw and vxlan flannel backends among others. Hopefully, there are several bug reports about it and a solution it’s been testing to be available soon.
https://github.com/coreos/flannel/pull/872
And now what…
For the next posts of this series, I plan on talking about other aspects of k8s networking internal likes services, ingress, load balancers, other flannel backends like VXLAN, alternatives fo flannel and VXLAN like Geneve, and even CNI, the specification and library that allows us to change between different networking plugins easily.
Stay tuned!