Get an in-depth introduction into how the author was able to observe, debug, monitor, and perform kernel syscall traces for applications, all within a containerized, multi-tenant, runtime environment.

When I first got involved in cloud development at Ericsson eons ago, virtual machines (VMs) were all the rage, and it didn't take me long to realize that the cloud would help unleash and evolve how we build, deploy, and run applications—not just telecom applications and services but just about anything that software touched.

Get a multi-cloud business advantage without sacrificing security or control - download an Apcera white paper

Services in a Linux jail

Now back in my early days, about 14 years ago, I ran my own little server farm hosting my own websites and some services for family and friends. I was running several services such as DNS, DHCP, HTTP, FTP, SSH, and the like on a single machine. To isolate each process, I placed each service in a separate Linux chroot jail, and ran Tripwire on the complete file system to detect unauthorized file access. (As I said, this was a long time ago, when either servers were expensive or I was IT-frugal, so I could not afford to host each service on an individual machine.) Today, none of this is necessary: cloud has brought us a new wave of services, products, streamlined automation of component and system management, as well as open source components and solutions.

The rise of VMs and then containers

Over the last few years, we have seen the rise of VMs as a mainstay component for IT cloud runtime environments. But, more recently and more prominently, containers have taken the marketplace by storm, with a huge number of companies providing solutions based around Linux containers, ranging from orchestration to networking, policy control and container deployment to IoT devices. Using these services within the container ecosystem makes life easier and allows you to focus on application development without worrying about the infrastructure, while someone else offers and manages the infrastructure service.

Yet I wasn't initially enthusiastic about containers: they restricted the control, configuration, and management of my own complete stack. With VMs, I got a slice of physical resources for my own sandbox and could do anything I wanted on the operating system, from loading any kernel module I needed, to configuring what I deemed necessary to run, deploy, debug, and perform system tuning on my own stack. Within that environment, I had the tools of my own complete stack.

In other words, in the VM sandbox, I had root-privilege access to my host operating system. I needed this not just to touch my Java application, but also to see what was being executed in the kernel. I needed syscall traces to observe at runtime what my application was doing underneath the application container, so that I could squeeze as much performance as possible out of it, as well as do low-level debugging.

As the ecosystem around containers grew, I wanted to get in on the container action, but the lack of privileged access to the host was a deterrent for me. I wanted a way to be able to observe, debug, monitor, and perform kernel syscall traces for my applications, all within a containerized, multi-tenant, runtime environment. Sadly, no existing PaaS or container orchestration system provided that. I started looking around to see what could be done to remove this barrier to adoption, so I could take advantage of containers and their growing automation and plugin ecosystem.

The birth of Cloud Runtime Tools (CRT)

After some discussions with people such as Per Andersson, Jason Hoffman, Peter Hedman, Suresh Krishnan, and Anders Franzen here at Ericsson, I set off, determined to build a service with which I could deploy my application in container environments using our container orchestration platform, Apcera, and remove those barriers of adoption that prevented me from using and benefiting from container environments.

Now the goal was clear: to to build a system in which we could observe, debug at runtime, see what containers were running that we had deployed using the CRT Geo Map, view application instances inside those containers, and be able to initiate kernel syscall tracing all within a multi-tenant environment from the ground up. We call the resulting system Cloud Runtime Tools, or CRT as we refer to it among us IT folks. As the diagram below shows, I get all the features mentioned above from CRT and by getting those features supported by the container platform, I now have the same tools and environment I got from my VM sandbox.

Perfect.

Ericsson-cloud-hyperscale-runtime-tools-containers.png

Now that we have seen how CRT was born, in my next post we'll take a look at what CRT is, why it's needed, and why I believe lots of different IT folks—from application developers to DevOps teams, to infrastructure providers, to people who just want to learn more about what's going on below their application in a container—will get real benefit from CRT.

To get a sneak peak, you can watch this video in which I talk about Cloud Runtime Tools at the Intel Developer Forum 2016 earlier this year:

 

 

  Sign up for the Hyperscale Cloud blog

Transcript of video

Cloud Runtime Tools

I’m going to give you a brief introduction on the cloud runtime tools […] Lets have a little recap on what the cloud runtime suite is. It’s a service that we have built specifically for developers, devops teams, infrastructure providers and people who want to have a better understanding of how their application is deployed and monitor it within a datacentre container environment.  

There are a couple of features that we have supported from the ground up – some of the most important are multi-tenancy, remote debugging, monitoring, observability and runtime insight for the application and also being able to take a look via system call tracing at what is happening below the application binary and libraries in the kernel of the host operating system.

What this typically allows us to do is have a look not just inside the application but also inside the host operating system via the syscall traces by doing syscall dumps. Typically this is not enabled in a lot of PAAS and container runtime environments. We see it as a major barrier for adoption to people being able to run in containers.

The biggest problem here is that containers by default do not give you what is called “privileged access” – making sure you do not have access to the root filesystem. We have built a system that allows you to have privileged access in a secure environment.

We have built a couple of agents that allow the developer in a single or multi-tenant environment to initiate system call tracing but to make sure where there are other tenants or containers present, that he will only be able to see the processes and system call traces that are relevant for his applications within his container.

The first feature which bears on this we call the CRT application topology viewing – a graphical interface to the containers we have and the applications instantiated within them. This is integrated in Apcera and we have a plan to do the same in Kubernetes and other platforms like Docker, Swarm and Mesos. This allows a devop, an application or infrastructure provider to see all the containers that they have at runtime. We can see the relationship between those containers for a specific tenant within a specific runtime and the applications within that container runtime.

A second feature of importance is remote debugging. This is unique among container environments. When someone enables new features in their running code, in the Apcera platform we take a clone of that application container and we can then log into that cloned environment to take a look at the code, fix the bug, save the running code, redeploy the container and in that way fix the problem.

Where the application is not the issue and it is felt that it lies lower in the stack in the server infrastructure, syscall tracing can be enabled through syscall dumps. If the investigator is convinced that the application is not the issue, syscall tracing provides enough data to pinpoint whether the problem actually is in the host operating system.

Some other features being worked on; support for historical data and application-based logging. Today we only support Java Runtime Environment and language – but we plan to support Go and Ruby as two other popular language runtimes. We will then grow the family as the CRT tool becomes more popular.

This is all applicable to IT cloud and also to telco infrastructures – the tool is application  agnostic. Previously used tools such as perf and netrace we not designed to be used in a containerised environment. We have taken best-of-breed open source components and developed some other modules, agents and APIs so we can support secure tracing in a runtime. [end]

 

Background photo by Deirdre Straughan.


Data & Analytics Cloud Infrastructure

Alan Kavanagh

Alan is a Cloud System Architect working in the Development Unit IT Cloud and has over 16 years experience in fixed and mobile broadband networks and IT cloud system design. Alan now works in the IT Cloud System and Technology group where he spends his time designing and building innovative solutions around the IT cloud space with a focus on big data infrastructure services and cloud platforms. Previously he has been working on designing and building innovative solution in the areas of GPRS, 3G, IP stack and fixed broadband services, OpenStack, NFV, PaaS, big data and instrumentation. He holds a BA in Computer and Electronic Engineering and a BAI in Mathematics from Trinity College Dublin.

Alan Kavanagh

Discussions