Accessing RStudio behind Multiple Proxies

Here at Arrikto our customers quite often present us with very interesting issues that can affect a lot of parts of the stack in a cloud native platform.

In this blog post we aim to expose the problems that a customer bumped into when accessing a web server that is parsing the X-Forwarded-* headers, in this particular case RStudio, when served behind multiple proxies.

The specific bugs we’ll describe are specific to RStudio, similar bugs can occur with any such server that parses the X-Forwarded headers. While I’ll be talking from the perspective of an engineer focused on Kubeflow this could happen in any platform with multiple proxies.

Readers will get a slightly better understanding around the X-Forwarded headers, how to debug them, and lastly how to successfully use RStudio behind multiple proxies. In this case, in a production Kubeflow cluster.

The problem

We initially noticed that suddenly we couldn’t access our RStudio servers in our cluster, with cryptic errors like the following:

The first question that came up while seeing this was, what kind of URL is this? After looking a little bit at the dev-tools in the browser we saw the following 302, which was the beginning of our debugging journey.

From the above, we immediately see that the Location header has a weird value. It has the 2 host values separated by a comma! This means that RStudio is affected by some piece of host-related information and ends up sending such a redirect to the client.

At this point I was scratching my head a little bit. Why would RStudio suddenly decide to give me such a redirect in my fresh new cluster? But of course, in the cloud world, nothing happens “suddenly”. The only remotely relevant mechanism I knew that could affect the generated host was the X-Forwarded-Host header. So at least there was a next step to check, even if it was a hunch.

So, I rushed and created an echoserver in my namespace just to check the headers that end up in the workloads. And voila!

Then I tried to send a request to my RStudio Pod from another Pod in the cluster, playing around with the X-Forwarded-Host header. Indeed, when I wouldn’t set a value, or set only one value and not a list, in X-Forwarded-Host, then the URL would be correct.

So this means 2 things:

  1. There are intermediate proxies, handling the X-Forwarded headers between my browser and the RStudio Pod.
  2. RStudio fails to properly handle X-Forwarded-Host, if it contains multiple values.

Intermediate proxies? X-Forwarded headers?

Before diving into the next steps of the journey we bumped into with RStudio, let’s first expose some more information regarding the nature of the X-Forwarded headers.

In the cloud, and Kubernetes, all of our applications are deployed behind proxies. No Pods are exposed directly to the outside world. But, this means that the final Pods never get access to information such as:

  • The IP of the client that made the request
  • The host the client used when making the request
  • The protocol the client used when making the request

The most common case that this information is needed is when an app needs to generate location-dependent content or links. An example of this can be a 302 request for redirecting users to authenticate. But to do this, if the app won’t use a relative path, then it will need to know the host that the client is using to reach the server.

To mitigate this, there is a list of, non-standard headers that aim to preserve this information, when a request is forwarded from proxies. These are the X-Forwarded headers.

While these headers are the de facto standard for relaying this information, it’s important to note that they are not part of any current specification. The standardized version is the Forwarded header.

RStudio is one of the apps that relies on these headers in case it is exposed behind proxies, to be able to get that client information.

Where’s the catch?

So at this point we’ve identified

  1. The information that gets lost when there are proxies between the client and the server
  2. Why this information could be useful to the server
  3. How to successfully pass this information to the server

With this we took a look at RStudio and saw that it has support for running behind proxies. So if we have all the pieces, why did RStudio fail to reconstruct the correct URL?

The catch is that since these X-Forwarded headers are currently not part of a standard. This means that people can deviate a little bit on how to use them. The most common scenario, that bit us here as well, is to use the X-Forwarded-Host to contain a list of hosts rather than the original host requested by the client.

The most common use-case for this is being able to trace the chain of hosts used through routing. For example an edge proxy might use a different internal host when routing the request inside the internal infrastructure.

And this is what triggered the first bug with RStudio. I would guess the devs from RStudio would expect that the X-Forwarded-Host header would only contain a single value. Which is very fair, considering what the common understanding is around these non-standard headers.

Luckily there was an existing issue for this exact problem
https://github.com/rstudio/rstudio/issues/10965

The final obstacle

At this point I wanted to verify that if I’d manually send a request where the X-Forwarded-Host would be correct then everything would work as expected, my understanding would be accurate and I’d respond in a timely manner to my ticket. But, deadlines exist to be missed.

So I spin up an RStudio instance in a container and hit it with the following request:

Which to my surprise returned a 302 with Location: /auth-sign-in?appUri=%2F, but I was expecting a URL that would have the correct prefix:

Location: https://localhost:8787/rstudio/kubeflow-user/kimwnasptd/auth-sign-in?appUri=%2F

At this point, being used to the nature of the X-Forwarded headers, I tried to set a single value for X-Forwarded-Proto. And it worked. So it’s the same story with X-Forwarded-Host. The “common” understanding of these non-standard headers would be to only track the first value used by the client. But in this case we were appending the different protocols as well, for tracing purposes like the X-Forwarded-Host.

So another issue:
https://github.com/rstudio/rstudio/issues/11010

The verdict

Thankfully the RStudio community was very responsive and fixed both of the issues for the 2 headers. We’ve also updated the RStudio images in Kubeflow 1.7 with the above fixes https://github.com/kubeflow/kubeflow/pull/6890.

The goal of this post was to mostly expose readers to the world of proxying, the information that could be lost between proxies as well as the tools we have to preserve this information. In this case, the non-standard X-Forwarded headers.

Also, one more lesson is that if a feature is not based on a well defined standard then it’s bound to be used in unpredicted ways. And unfortunately, the X-Forwarded headers are such a case.

So hopefully at this point you’ll have a better understanding of when and how these headers could be used, as well as things to look at in case you bump into weird URLs returned by a server in the cloud native world.

About the author

Kimonas is a Software Engineer at Arrikto, working on storage solutions on the cloud. He loves Open Source and has been a core contributor to the Kubeflow project for more than a year. Kimonas is the owner of the platform's Jupyter infrastructure and his main goal is to improve the way users manage the lifecycle of their ML tools, like Notebooks, and data on top of Kubeflow. He is also a mentor at the Kubeflow project at Google Summer of Code 2020 providing guidance for adding seamless support for launching Tensorboard instances.

About Kubeflow

Kubeflow is an open source, cloud-native MLOps platform originally developed by Google that aims to provide all the tooling that both data scientists and machine learning engineers need to run workflows in production. Features include model development, training, serving, AutoML, monitoring and artifact management. 

Kubeflow is the open source machine learning toolkit for Kubernetes.

About Arrikto

We are a Machine Learning platform powered by Kubeflow and built for Data Scientists. We make Kubeflow easy to adopt, deploy and use, having made significant contributions since the 0.4 release and continuing to contribute across multiple areas of the project and community. Our projects/products include:

  • Enterprise Kubeflow (EKF) is a complete MLOps platform that reduces costs, while accelerating the delivery of scalable models from laptop to production.
  • Kubeflow as a Service is the easiest way to get started with Kubeflow in minutes! It comes with a Free 7-day trial (no credit card required).
  • Rok is a data management solution for Kubeflow. Rok’s built-in Kubeflow integration simplifies operations and increases performance, while enabling data versioning, packaging, and secure sharing across teams and cloud boundaries.
  • Kale, an open source workflow tool for Kubeflow, which orchestrates all of Kubeflow’s components seamlessly.

 

Free Technical Workshop

Turbocharge your team’s Kubeflow and MLOps skills with a free workshop.