Kubernetes networking under the hood (I)

Why is networking one of the biggest headaches for almost any opensource project? Whenever you look, OpenStack, Docker, Kubernetes… the number one issue has always been the network layer. I want to believe that the reason for that is because networking has been one of the last stack layers to be fully virtualized. And, until very recently, it was, literally, a bunch of wires (physical ones) connected to a black box (called switch or hub or router) in the top of the rack. A simple tcpdump was more than enough to debug network issues in most of the cases. We can say goodbye to those good old days, cause they are gone… but things are easier than you think.

Kubernetes networking model

If there was only one thing that you would learn from this post it should be that every pod has its own IP address and can communicate with any other pods or host in the cluster.  With plain containers deployments,  you need to link the different containers and keep track of all the mappings between the host ports and the container services: a total nightmare once you start to scale. Using other orchestration technologies like Docker Swarm or Mesos, will make this step really simple.

Our Deployment

So, with all this in mind, how this really works internally, at networking layer?  I’m not going to enter into details of how to deploy a kubernetes cluster, but I’ll be using my Raspberry Py cluster (k8s RPI cluster) with HypriotOS (a Debian/Raspbian derivate distribution) + kubeadm + flannel.

rpicluster

The only “tricky” step in the deployment is that we need to use a slightly modified deployment file for flannel (use the official arm Docker images instead of amd64 and use hostgw as the default flannel plugin instead of vlanx – we’ll be playing with vxlan and other backends and network plugins in future posts)

curl -sSL https://rawgit.com/coreos/flannel/v0.9.0/Documentation/kube-flannel.yml | sed "s/amd64/arm/g" | sed "s/vxlan/host-gw/"

In total, we’ll have 3 hosts: 1 master and 2 nodes.

net-screen1

Every host will get its own address space to assign to its pods. Flannel will be the one responsible for keeping track of which host has which net address spaces, by storing that data in the etcd service or the kubernetes API.

$ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, podCIDR: .spec.podCIDR}' 
{
 "name": "rpi-master",
 "podCIDR": "10.244.0.0/24"
}
{
 "name": "rpi-node-1",
 "podCIDR": "10.244.1.0/24"
}
{
 "name": "rpi-node-2",
 "podCIDR": "10.244.2.0/24"
}

Finally, to add some load and pods to our cluster, an httpd service will be deployed:

kubectl run httpd --image=httpd --replicas=3 --port=80

Basic Topology

With 3 replicas, the pods will be distributed among the two nodes, and each of them will have a unique IP address.

net2

net1

With the host-gw flannel backend, the network configuration is straightforward. Everything is transparent, all is done via ip routes, and all the routes will be managed by the flannel daemon.

​root@rpi-node-1:/home/pirate# ip route
default via 192.168.0.1 dev eth0 
10.244.0.0/24 via 192.168.0.101 dev eth0 
10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1 
10.244.2.0/24 via 192.168.0.112 dev eth0 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.111

For this to work, all nodes should be on the same network. For more complex clusters, other backends like vxlan should be used to avoid those limitations, but at the expense of performance and complexity. There is no overlay protocol, natting, whatever… just a bridge connecting two different networks.

How is this managed in code

func (n *network) Run(ctx context.Context) {
	wg := sync.WaitGroup{}

	log.Info("Watching for new subnet leases")
	evts := make(chan []subnet.Event)
	wg.Add(1)
	go func() {
		subnet.WatchLeases(ctx, n.sm, n.lease, evts)
		wg.Done()
	}()

	// Store a list of routes, initialized to capacity of 10.
	n.rl = make([]netlink.Route, 0, 10)
	wg.Add(1)

	// Start a goroutine which periodically checks that the right routes are created
	go func() {
		n.routeCheck(ctx)
		wg.Done()
	}()

	defer wg.Wait()

	for {
		select {
		case evtBatch := <-evts:
			n.handleSubnetEvents(evtBatch)

		case <-ctx.Done():
			return
		}
	}
}

Like everything with this backend, all is pretty forward in flannel. In the main execution thread, we have two goroutines, one that will watch for network leases and changes (subnet.WatchLeases(), lines 57-60), and another one that will make sure that we have the desired routes in our system. (n.routeCheck(), lines 67-70). Then, the execution enters in a loop (lines 75-81) where we wait for an event to occur (notified by the previously mentioned subnet.WatchLeases() goroutine, through the evts channel) or for a ctx.Done() indicating that the program needs to terminate.

func (n *network) routeCheck(ctx context.Context) {
	for {
		select {
		case <-ctx.Done():
			return
		case <-time.After(routeCheckRetries * time.Second):
			n.checkSubnetExistInRoutes()
		}
	}
}

func (n *network) checkSubnetExistInRoutes() {
	routeList, err := netlink.RouteList(nil, netlink.FAMILY_V4)
	if err == nil {
		for _, route := range n.rl {
			exist := false
			for _, r := range routeList {
				if r.Dst == nil {
					continue
				}
				if routeEqual(r, route) {
					exist = true
					break
				}
			}
			if !exist {
				if err := netlink.RouteAdd(&route); err != nil {
					if nerr, ok := err.(net.Error); !ok {
						log.Errorf("Error recovering route to %v: %v, %v", route.Dst, route.Gw, nerr)
					}
					continue
				} else {
					log.Infof("Route recovered %v : %v", route.Dst, route.Gw)
				}
			}
		}
	} else {
		log.Errorf("Error fetching route list. Will automatically retry: %v", err)
	}
}

The routeCheck goroutine is also quite simple. Every 10 seconds (line 181) it calls n.checkSubnetExistInRoutes() and confirms that the desired routes are present in the host. If not, it just tries to add them (line 202). All the real magic happens in n.handleSubnetEvents(evtBatch) (line 77), so I consider this check a simple safety net just in case something goes really bad. (An external program or BOFH deleting routes, maybe)

func (n *network) handleSubnetEvents(batch []subnet.Event) {
...
    case subnet.EventAdded:
    ...
        n.addToRouteList(route)
        } else if err := netlink.RouteAdd(&route); err != nil {
    ...
    case subnet.EventRemoved:
    ...
        n.removeFromRouteList(route)
        if err := netlink.RouteDel(&route); err != nil {
    ...

Full code on Github.
We have two different types of events which are self-explanatory: EventAddded and EventRemoved. And after some checking, e.g., that the route exists (or not), we just add it to the host and to the list of expected routes (used by the previously mentioned routeCheck() line 176) or removed it from both places.

FORWARD default DROP policy

With the release of Docker 1.13 (+ info), there was a change in the default behavior of the FORWARD chain policy. It used to be ACCEPT, but now it has changed to DROP.

root@rpi-node-1:/home/pirate# iptables -L FORWARD
Chain FORWARD (policy DROP)

With this, all the network packets that need to go through (like our route 10.244.0.0/24 via 192.168.0.101 dev eth0) will be drop by default if there is no other matching rule (most likely case). This “small” change has bitten a lot of upgrades and new deployments. The issue affects both host-gw and vxlan flannel backends among others. Hopefully, there are several bug reports about it and a solution it’s been testing to be available soon.

https://github.com/coreos/flannel/pull/872

And now what…

For the next posts of this series, I plan on talking about other aspects of k8s networking internal likes services, ingress, load balancers, other flannel backends like VXLAN, alternatives fo flannel and VXLAN like Geneve, and even CNI, the specification and library that allows us to change between different networking plugins easily.

Stay tuned!

3 thoughts on “Kubernetes networking under the hood (I)

  1. Decent article. Might I make one suggestion…

    You should probably remove the comment about Docker networking requiring links and port mappings. That’s such an out of date comment and made me wonder if the rest of the article would also be wrong. And the fact it appeared in paragraph 2 nearly made me stop reading.

    Deploying an overlay network on Docker is a two step process.

    docker swarm init
    docker network create…

    And funnily enough, it’s the same two step process to build a k8s cluster and initialize a Pod overlay.

    I’m only mentioning this because that inaccuracy almost made me stop reading the article. Feel free to delete this comment if you remove the misleading info.

    HTH

    Like

  2. Thanks for the info, you are completely right. The comparison is not fair at all. I was more thinking in simple containers deployments (just using the docker engine or rkt…) than full suites like Docker Swarm, Mesos… I will just reword it to make it clear. (And I promise to learn more about swarm)

    ps. I will never remove a comment that is not offensive. This is for learning and sharing. And today we all learn something else thanks to you.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s