Kubernetes networking under the hood (I)

Why is networking one of the biggest headaches for almost any opensource project? Whenever you look, OpenStack, Docker, Kubernetes… the number one issue has always been the network layer. I want to believe that the reason for that is because networking has been one of the last stack layers to be fully virtualized. And, until very recently, it was, literally, a bunch of wires (physical ones) connected to a black box (called switch or hub or router) in the top of the rack. A simple tcpdump was more than enough to debug network issues in most of the cases. We can say goodbye to those good old days, cause they are gone… but things are easier than you think.

Kubernetes networking model

If there was only one thing that you would learn from this post it should be that every pod has its own IP address and can communicate with any other pods or host in the cluster.  With plain containers deployments,  you need to link the different containers and keep track of all the mappings between the host ports and the container services: a total nightmare once you start to scale. Using other orchestration technologies like Docker Swarm or Mesos, will make this step really simple.

Our Deployment

So, with all this in mind, how this really works internally, at networking layer?  I’m not going to enter into details of how to deploy a kubernetes cluster, but I’ll be using my Raspberry Py cluster (k8s RPI cluster) with HypriotOS (a Debian/Raspbian derivate distribution) + kubeadm + flannel.

rpicluster

The only “tricky” step in the deployment is that we need to use a slightly modified deployment file for flannel (use the official arm Docker images instead of amd64 and use hostgw as the default flannel plugin instead of vlanx – we’ll be playing with vxlan and other backends and network plugins in future posts)

curl -sSL https://rawgit.com/coreos/flannel/v0.9.0/Documentation/kube-flannel.yml | sed "s/amd64/arm/g" | sed "s/vxlan/host-gw/"

In total, we’ll have 3 hosts: 1 master and 2 nodes.

net-screen1

Every host will get its own address space to assign to its pods. Flannel will be the one responsible for keeping track of which host has which net address spaces, by storing that data in the etcd service or the kubernetes API.

$ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, podCIDR: .spec.podCIDR}' 
{
 "name": "rpi-master",
 "podCIDR": "10.244.0.0/24"
}
{
 "name": "rpi-node-1",
 "podCIDR": "10.244.1.0/24"
}
{
 "name": "rpi-node-2",
 "podCIDR": "10.244.2.0/24"
}

Finally, to add some load and pods to our cluster, an httpd service will be deployed:

kubectl run httpd --image=httpd --replicas=3 --port=80

Basic Topology

With 3 replicas, the pods will be distributed among the two nodes, and each of them will have a unique IP address.

net2

net1

With the host-gw flannel backend, the network configuration is straightforward. Everything is transparent, all is done via ip routes, and all the routes will be managed by the flannel daemon.

​root@rpi-node-1:/home/pirate# ip route
default via 192.168.0.1 dev eth0 
10.244.0.0/24 via 192.168.0.101 dev eth0 
10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1 
10.244.2.0/24 via 192.168.0.112 dev eth0 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.111

For this to work, all nodes should be on the same network. For more complex clusters, other backends like vxlan should be used to avoid those limitations, but at the expense of performance and complexity. There is no overlay protocol, natting, whatever… just a bridge connecting two different networks.

How is this managed in code

func (n *network) Run(ctx context.Context) {
	wg := sync.WaitGroup{}

	log.Info("Watching for new subnet leases")
	evts := make(chan []subnet.Event)
	wg.Add(1)
	go func() {
		subnet.WatchLeases(ctx, n.sm, n.lease, evts)
		wg.Done()
	}()

	// Store a list of routes, initialized to capacity of 10.
	n.rl = make([]netlink.Route, 0, 10)
	wg.Add(1)

	// Start a goroutine which periodically checks that the right routes are created
	go func() {
		n.routeCheck(ctx)
		wg.Done()
	}()

	defer wg.Wait()

	for {
		select {
		case evtBatch := <-evts:
			n.handleSubnetEvents(evtBatch)

		case <-ctx.Done():
			return
		}
	}
}

Like everything with this backend, all is pretty forward in flannel. In the main execution thread, we have two goroutines, one that will watch for network leases and changes (subnet.WatchLeases(), lines 57-60), and another one that will make sure that we have the desired routes in our system. (n.routeCheck(), lines 67-70). Then, the execution enters in a loop (lines 75-81) where we wait for an event to occur (notified by the previously mentioned subnet.WatchLeases() goroutine, through the evts channel) or for a ctx.Done() indicating that the program needs to terminate.

func (n *network) routeCheck(ctx context.Context) {
	for {
		select {
		case <-ctx.Done():
			return
		case <-time.After(routeCheckRetries * time.Second):
			n.checkSubnetExistInRoutes()
		}
	}
}

func (n *network) checkSubnetExistInRoutes() {
	routeList, err := netlink.RouteList(nil, netlink.FAMILY_V4)
	if err == nil {
		for _, route := range n.rl {
			exist := false
			for _, r := range routeList {
				if r.Dst == nil {
					continue
				}
				if routeEqual(r, route) {
					exist = true
					break
				}
			}
			if !exist {
				if err := netlink.RouteAdd(&route); err != nil {
					if nerr, ok := err.(net.Error); !ok {
						log.Errorf("Error recovering route to %v: %v, %v", route.Dst, route.Gw, nerr)
					}
					continue
				} else {
					log.Infof("Route recovered %v : %v", route.Dst, route.Gw)
				}
			}
		}
	} else {
		log.Errorf("Error fetching route list. Will automatically retry: %v", err)
	}
}

The routeCheck goroutine is also quite simple. Every 10 seconds (line 181) it calls n.checkSubnetExistInRoutes() and confirms that the desired routes are present in the host. If not, it just tries to add them (line 202). All the real magic happens in n.handleSubnetEvents(evtBatch) (line 77), so I consider this check a simple safety net just in case something goes really bad. (An external program or BOFH deleting routes, maybe)

func (n *network) handleSubnetEvents(batch []subnet.Event) {
...
    case subnet.EventAdded:
    ...
        n.addToRouteList(route)
        } else if err := netlink.RouteAdd(&route); err != nil {
    ...
    case subnet.EventRemoved:
    ...
        n.removeFromRouteList(route)
        if err := netlink.RouteDel(&route); err != nil {
    ...

Full code on Github.
We have two different types of events which are self-explanatory: EventAddded and EventRemoved. And after some checking, e.g., that the route exists (or not), we just add it to the host and to the list of expected routes (used by the previously mentioned routeCheck() line 176) or removed it from both places.

FORWARD default DROP policy

With the release of Docker 1.13 (+ info), there was a change in the default behavior of the FORWARD chain policy. It used to be ACCEPT, but now it has changed to DROP.

root@rpi-node-1:/home/pirate# iptables -L FORWARD
Chain FORWARD (policy DROP)

With this, all the network packets that need to go through (like our route 10.244.0.0/24 via 192.168.0.101 dev eth0) will be drop by default if there is no other matching rule (most likely case). This “small” change has bitten a lot of upgrades and new deployments. The issue affects both host-gw and vxlan flannel backends among others. Hopefully, there are several bug reports about it and a solution it’s been testing to be available soon.

https://github.com/coreos/flannel/pull/872

And now what…

For the next posts of this series, I plan on talking about other aspects of k8s networking internal likes services, ingress, load balancers, other flannel backends like VXLAN, alternatives fo flannel and VXLAN like Geneve, and even CNI, the specification and library that allows us to change between different networking plugins easily.

Stay tuned!

Some extra commands to partially build kubernetes

Sometimes we are just testing small changes in a command or service and we don’t need to build the whole kubernetes universe as we did in the previous post, so the alternative is to build the binary we are playing with. For that, we just need to specify it with the environment variable WHAT, and we have two ways of doing it: compile it in our own environment with make WHAT=cmd/kubelet or building it using the docker images provided by the k8s community. For this last case, we simply need to append any command we want to run with build/run.sh

build/run.sh

KUBE_RUN_COPY_OUTPUT="${KUBE_RUN_COPY_OUTPUT:-y}"

kube::build::verify_prereqs
kube::build::build_image

if [[ ${KUBE_RUN_COPY_OUTPUT} =~ ^[yY]$ ]]; then
  kube::log::status "Output from this container will be rsynced out upon completion. Set KUBE_RUN_COPY_OUTPUT=n to disable."
else
  kube::log::status "Output from this container will NOT be rsynced out upon completion. Set KUBE_RUN_COPY_OUTPUT=y to enable."
fi

kube::build::run_build_command "$@"

if [[ ${KUBE_RUN_COPY_OUTPUT} =~ ^[yY]$ ]]; then
  kube::build::copy_output
fi

This script is really simple to follow and very similar to the build/release.sh we talked about in the previous post.  verify_prereqs and build_image will check that we have a running docker and will build the docker image (using build/build-image/Dockerfile) Next it will run kube::build::run_build_command with the commands we want to execute inside the docker image, and finally copy the results to the _output folder.

Simple, clean and easy to follow. We will use this build/run.sh not just for building binaries, but also for anything that we want to run inside the docker environment like testing and verifying. Again, using the WHAT environment variable, we can reduce the scope of the command to just the part of the project we are working with.

A walk through Kubernetes build process

Building Kubernetes binaries by hand can be a difficult operation and you will probably end with small differences for every new try. Luckily for us, the Kubernetes community provides us with a set of tools to make things easier and we will be able to have reproducible builds, every run, even compared to the official ones. This is possible because the official releases are built using Docker containers which will be a requirement to build k8s (usually a local installation, but could also be a remote one).

Before going deep into the build process, worth noticing that thekube-buildDocker image is built based on build/build-image/Dockerfile and we will have 3 different containers during the build process using that image: the “build” and “rsync” containers are used for the build action and to transfer data between the container and the host; both will be deleted after every run. The “data” container will store everything necessary to support incremental builds so it will not be destroyed after each use. All this images and data will be stored in the build/directory, which will be also used for testing purposes (out of the scope of this post).

There are some works on the way to use Bazel (the open source release of the Blaze project that Google use internally), but for now, we will be using the all-mighty make. And as always, everything starts with the Makefile.

Makefile

A simple make help will show us all the options available, from building targets to testing, using bazel, verifying… (full output at make help). We will be focusing on make quick-release, but as you will see almost everything is related and we will be, at the end, decomposing the make all command.

define RELEASE_SKIP_TESTS_HELP_INFO
# Build a release, but skip tests
#
# Args:
#   KUBE_RELEASE_RUN_TESTS: Whether to run tests. Set to 'y' to run tests anyways.
#   KUBE_FASTBUILD: Whether to cross-compile for other architectures. Set to 'true' to do so.
#
# Example:
#   make release-skip-tests
#   make quick-release
endef
.PHONY: release-skip-tests quick-release
ifeq ($(PRINT_HELP),y)
release-skip-tests quick-release:
    @echo "$$RELEASE_SKIP_TESTS_HELP_INFO"
else
release-skip-tests quick-release: KUBE_RELEASE_RUN_TESTS = n
release-skip-tests quick-release: KUBE_FASTBUILD = true
release-skip-tests quick-release:
    build/release.sh
endif

As we can see from the code snippet above, the quick-release target (aka release-skip-tests) is the same as the release one, with two changes: We will not run any tests on it (KUBE_RELEASE_RUN_TESTS = n), and we will not compile the binaries for other architectures (KUBE_FASTBUILD = true); only linux/amd64 binaries will be created.

build/release.sh

KUBE_ROOT=$(dirname "${BASH_SOURCE}")/..
source "${KUBE_ROOT}/build/common.sh"
source "${KUBE_ROOT}/build/lib/release.sh"

KUBE_RELEASE_RUN_TESTS=${KUBE_RELEASE_RUN_TESTS-y}

kube::build::verify_prereqs
kube::build::build_image
kube::build::run_build_command make cross

if [[ $KUBE_RELEASE_RUN_TESTS =~ ^[yY]$ ]]; then
  kube::build::run_build_command make test
  kube::build::run_build_command make test-integration
fi

kube::build::copy_output

kube::release::package_tarballs

All the kube::build functions are located in the file ${KUBE_ROOT}/build/common.sh. verify_prereqs and build_image will check that we have a running docker and will build the docker image (using build/build-image/Dockerfile) that we need for our purpose.

After running kube::build::run_build_command make cross, we will again ignore any tests, and will copy the results from the container to our hosts using the rsync container (kube::build::copy_output) and pack the binary into the _output dir (kube::release::package_tarballs).

build/common.sh

# Run a command in the kube-build image. This assumes that the image has
# already been built.
function kube::build::run_build_command() {
  kube::log::status "Running build command..."
  kube::build::run_build_command_ex "${KUBE_BUILD_CONTAINER_NAME}" -- "$@"
}

# Run a command in the kube-build image. This assumes that the image has
# already been built.
#
# Arguments are in the form of
# <container name> <extra docker args> -- <command>
function kube::build::run_build_command_ex() {
  [[ $# != 0 ]] || { echo "Invalid input - please specify a container name." >&2; return 4; }
  local container_name="${1}"
  shift

  local -a docker_run_opts=(
    "--name=${container_name}"
    "--user=$(id -u):$(id -g)"
    "--hostname=${HOSTNAME}"
    "${DOCKER_MOUNT_ARGS[@]}"
  )

  local detach=false

  [[ $# != 0 ]] || { echo "Invalid input - please specify docker arguments followed by --." >&2; return 4; }
  # Everything before "--" is an arg to docker
  until [ -z "${1-}" ] ;do
    if [[ "$1"=="--" ]];then
      shift
      break
    fi
    docker_run_opts+=("$1")
    if [[ "$1"=="-d"||"$1"=="--detach" ]] ;then
      detach=true
    fi
    shift
  done

  # Everything after "--" is the command to run
  [[ $# != 0 ]] || { echo "Invalid input - please specify a command to run." >&2; return 4; }
  local -a cmd=()
  until [ -z "${1-}" ] ;do
    cmd+=("$1")
    shift
  done
 
  docker_run_opts+=(
    --env "KUBE_FASTBUILD=${KUBE_FASTBUILD:-false}"
    --env "KUBE_BUILDER_OS=${OSTYPE:-notdetected}"
    --env "KUBE_VERBOSE=${KUBE_VERBOSE}"
    --env "GOFLAGS=${GOFLAGS:-}"
    --env "GOLDFLAGS=${GOLDFLAGS:-}"
    --env "GOGCFLAGS=${GOGCFLAGS:-}"
  )

  # If we have stdin we can run interactive. This allows things like 'shell.sh'
  # to work. However, if we run this way and don't have stdin, then it ends up
  # running in a daemon-ish mode. So if we don't have a stdin, we explicitly
  # attach stderr/stdout but don't bother asking for a tty.

  if [[ -t 0 ]];then
    docker_run_opts+=(--interactive --tty)
  elif [[ "${detach}"==false ]];then
    docker_run_opts+=(--attach=stdout --attach=stderr)
  fi

  local -ra docker_cmd=(
    "${DOCKER[@]}" run "${docker_run_opts[@]}""${KUBE_BUILD_IMAGE}")

  # Clean up container from any previous run
  kube::build::destroy_container "${container_name}"

  "${docker_cmd[@]}""${cmd[@]}"

  if [[ "${detach}"==false ]];then
    kube::build::destroy_container "${container_name}"
  fi
}

Inside kube::build::run_build_command, we will call kube::build::run_build_command_ex "${KUBE_BUILD_CONTAINER_NAME}" -- "$@" where KUBE_BUILD_CONTAINER_NAME is the name of the container we got in the previous kube::build::verify_prereqs call and “$@” will be “make cross” in this case.

Then, we will get all the extra options that will be passed to the docker image (docker_run_opts) that will form the final docker run command, docker_cmd ( "${DOCKER[@]}" run "${docker_run_opts[@]}""${KUBE_BUILD_IMAGE}")

And with that, and the cmd to run inside the container (make cross), we have the final command we will execute: "${docker_cmd[@]}""${cmd[@]}"

From now on, everything will be executed inside the kube-build container. And we are back to the main Makefile (there will be an exact copy of our kubernetes root folder inside the container), but this time, we will just execute the cross target.

make cross (inside the kube-build container)

define CROSS_HELP_INFO
# Cross-compile for all platforms
# Use the 'cross-in-a-container' target to cross build when already in
# a container vs. creating a new container to build from (build-image)
# Useful for running in GCB.
#
# Example:
#   make cross
#   make cross-in-a-container
endef
.PHONY: cross cross-in-a-container
ifeq ($(PRINT_HELP),y)
cross cross-in-a-container:
    @echo "$$CROSS_HELP_INFO"
else
cross:
    hack/make-rules/cross.sh
cross-in-a-container: KUBE_OUTPUT_SUBPATH = $(OUT_DIR)/dockerized
cross-in-a-container:
ifeq (,$(wildcard /.dockerenv))
    @echo -e "\nThe 'cross-in-a-container' target can only be used from within a docker container.\n"
else
    hack/make-rules/cross.sh
endif
endif

This is pretty straightforward, just a call to hack/make-rules/cross.sh

hack/make-rules/cross.sh

KUBE_ROOT=$(dirname "${BASH_SOURCE}")/../..
source "${KUBE_ROOT}/hack/lib/init.sh"

# NOTE: Using "${array[*]}" here is correct. [@] becomes distinct words (in
# bash parlance).

make all WHAT="${KUBE_SERVER_TARGETS[*]}" KUBE_BUILD_PLATFORMS="${KUBE_SERVER_PLATFORMS[*]}"

make all WHAT="${KUBE_NODE_TARGETS[*]}" KUBE_BUILD_PLATFORMS="${KUBE_NODE_PLATFORMS[*]}"

make all WHAT="${KUBE_CLIENT_TARGETS[*]}" KUBE_BUILD_PLATFORMS="${KUBE_CLIENT_PLATFORMS[*]}"

make all WHAT="${KUBE_TEST_TARGETS[*]}" KUBE_BUILD_PLATFORMS="${KUBE_TEST_PLATFORMS[*]}"

make all WHAT="${KUBE_TEST_SERVER_TARGETS[*]}" KUBE_BUILD_PLATFORMS="${KUBE_TEST_SERVER_PLATFORMS[*]}"

In hack/lib/init.sh we will source the file hack/lib/golang.sh which will be responsible to set all the KUBE_*_TARGETS and KUBE_*_PLATFORMS.

...
elif [[ "${KUBE_FASTBUILD:-}" == "true" ]]; then
  readonly KUBE_SERVER_PLATFORMS=(linux/amd64)
  readonly KUBE_NODE_PLATFORMS=(linux/amd64)
...

If you remember from our first call to make quick-release, it set KUBE_FASTBUILD = true so we will just compile the linux/amd64 binaries.

The targets (KUBE_SERVER_TARGETS) will also be specified by hack/lib/golang.sh

# The set of server targets that we are only building for Linux
# If you update this list, please also update build/BUILD.
kube::golang::server_targets() {
  local targets=(
    cmd/kube-proxy
    cmd/kube-apiserver
    cmd/kube-controller-manager
    cmd/cloud-controller-manager
    cmd/kubelet
    cmd/kubeadm
    cmd/hyperkube
    vendor/k8s.io/kube-aggregator
    vendor/k8s.io/apiextensions-apiserver
    plugin/cmd/kube-scheduler
  )
  echo"${targets[@]}"
}

And with all that set we call, again, the Makefile with the almighty all target. (Remember that we are now inside the kube-build container)

make all

define ALL_HELP_INFO
# Build code.
#
# Args:
#   WHAT: Directory names to build. If any of these directories has a 'main'
#     package, the build will produce executable files under $(OUT_DIR)/go/bin.
#     If not specified, "everything" will be built.
#   GOFLAGS: Extra flags to pass to 'go' when building.
#   GOLDFLAGS: Extra linking flags passed to 'go' when building.
#   GOGCFLAGS: Additional go compile flags passed to 'go' when building.
#
# Example:
#   make
#   make all
#   make all WHAT=cmd/kubelet GOFLAGS=-v
#   make all GOGCFLAGS="-N -l"
#     Note: Use the -N -l options to disable compiler optimizations an inlining.
#           Using these build options allows you to subsequently use source
#           debugging tools like delve.
endef
.PHONY: all
ifeq ($(PRINT_HELP),y)
all:
    @echo "$$ALL_HELP_INFO"
else
all: generated_files
    hack/make-rules/build.sh $(WHAT)
endif

Again, a simple make entry with the help and a call to hack/make-rules/build.sh with the WHAT variable being the KUBE_SERVER_TARGETS provided by hack/lib/golang.sh

hack/make-rules/build.sh

KUBE_ROOT=$(dirname "${BASH_SOURCE}")/../..
KUBE_VERBOSE="${KUBE_VERBOSE:-1}"
source "${KUBE_ROOT}/hack/lib/init.sh"

kube::golang::build_binaries "$@"
kube::golang::place_bins

And from now on, in kube::golang::build_binaries, is where the Go magic happens. It sets up the environment, some Go flags, builds the kube toolchain (github.com/jteeuwen/go-bindata/go-bindata and hack/cmd/teststale) and finally builds the targets.

Conclusions

As you can see, although it looks complicated at first, everything can be reduced to a simple call to make all with a bunch of environment variables to get the output we want. Making use of a common docker image with the same set of tools across the whole community will also help us to reduce any differences that can be a real nightmare if we need to debug any faulty service.

Running kubernetes on a Raspberry pi cluster

Running kubernetes on a raspberry pi can be a little tricky. There are some many moving parts that the easiest way to do it, it is using one of the multiple installation methods for that but… what’s the fun of that?

I just wanted to get my hands dirty and know all the internals and craft a kubernetes cluster all by myself. So that is why I started the project ansible-k8s on GitHub [https://github.com/GheRivero/ansible-k8s] It does not pretend to be a production-ready deployment, but just an easy starting point to have a kubernetes cluster you can play with.

At this moment, there are a couple of missing parts, like better documentation (or documentation at all), install the dashboard… but it is more than enough to follow the playbooks and understand how all the pieces are arranged together.

rpicluster

On the plus side of this adventure, I just got my first patch approved [ https://github.com/kubernetes/contrib/pull/2202 ] into the kubernetes project, and it is not going to be the last.