Denoising CT Scans with Kubeflow, Apache Spark, and Apache Mahout – Part 2

Part 2 (of 3): The Plot Thickens

Welcome to the second installment in the series all about denoising CT scans with open source friends. In this post, we’ll explore our problem a bit more, and explain why we didn’t just use rapid tests in the first place.

In Spring of 2020 when we first started our research that is now the subject of this series, COVID-19 was a fresh cut, and researchers everywhere were throwing everything they could AND the kitchen sink at it. In the earliest days of COVID-19, the big problem was how to rapidly detect if someone had it or not. I mention this because as you read the rest of this post you might think, “this is a lot of work, why not just use a rapid test?” 

Well, the point of this post is to show our motivation for setting out on this adventure. If you’re really pressed for time and just want to know how to get Apache Spark working in Kubeflow, you can skip this post (and part one where we introduced our open source heroes) and go straight to part three (coming soon). 


Going Back in Time…The Need for Rapid Tests

Let’s do a thought exercise- close your eyes, well, you have to keep reading, but use your imagination.

You’re in a very large and very empty parking lot. You see yourself walking up to a 1982 DMC DeLorean. You walk to the driver’s side and open the gull wing. You sit in the driver’s seat and punch into the LCD display: March 28, 2020. 

You put the DeLorean in drive and start driving. You slowly accelerate…very slowly…hopefully there is enough parking lot…. Once you hit 88 miles per hour (that’s about 141 kph) you travel through time, and next thing you know you’re driving in an almost identical parking lot, just as large, just as empty, but now you’re in the year 2020! You’ve traveled about 22 months back in time to late March 2020. 

*This is a joke/dig on the DMC DeLorean being infamously under powered.

Fun…but not really. It’s March of 2020 and people are dying.


Src: CNN

In Spain, Italy, and New York City, hospitals are being overrun with suspected COVID patients. 

Note the date here: it’s hard to read, but it says March 28, 2020. 

This person is wearing plastic wraps (garbage bags?) as a cootie shield. It looks silly now, but at the time—it was scary. In March of 2020, no one knew what was going on, and people were scared. If you weren’t having a meltdown, you knew someone who was.

Src: NPR

Not everyone had COVID, and it took a long time to figure out who did and who didn’t, as is shown in this piece from NPR also on March 28th. 

Specifically, PCR tests took three days to return results (the same length of time as you get today if you get a non-rapid one).


Src: ABCNews

At the time people were trying to come up with rapid tests, but as we see here from March 26th, only detecting 60% of true positives at the time was considered “promising”.

Note: these weren’t even widely available yet—they were more like “coming soon” teasers. 

This was compounding issues with hospitals being overrun because not everyone who goes to the hospital thinking they have COVID actually has COVID. But people had to sit there for three days waiting for tests to come back and possibly getting COVID during that time.

This Could Work…CT Scans for Rapid Diagnosis

Here we see an article from March 16th, which concisely illustrates a point: creativity in finding new ways to rapidly detect COVID with equipment already available at the hospital was at a premium. 

This article notes that CT scans are the best way to see lung damage, but aren’t always available in emergency rooms, and recommends a way to use ultrasounds in their place. 

It closes by noting the American College of Radiology (ACR) had recently only recommended using CT scans in advanced symptomatic cases of COVID.


Less Radiation Please

But CT scans have issues…

According to the ACR they actually have several issues such as cleaning machines between patients, but the one we’re going to focus on is the radiation dose you get when you get one. 

A thoracic CT scan—thoracic being your chest region—gives you around 6 to 7 millisieverts (mSv) of radiation. Here is a chart to put that number in perspective. 

So 6 to 7 mSv of radiation exposure for a thoracic CT scan is not horrible, but it’s also really high for a diagnostic procedure. Another metric to put this into perspective: in your entire life you’re only supposed to have a maximum of 400 mSv. CT scans have also been used for a while to diagnose lung cancer. But full dose CT scans were also considered too much radiation for diagnostics in this realm as well.

So in and around the late 90s/early 00s a lower dose method of CT scanning was developed. An entire lecture could be given on the differences between regular CT scans and “low-dose” CT scans, but for our purposes, allow me to hand waive a bit here:

All CT scans are in essence a series of x-rays that are compiled into a ‘stack’ of 2D images, which can be rendered in three dimensions. 

Low-dose CT scans use x-rays with SIGNIFICANTLY less radiation, but the resultant scans are “noisier”, and here think of noise like static on a TV with no input. Did you ever try watching a channel when you were a kid that your parents didn’t pay for (or if you had an antenna, a channel that was just a bit too far away)? You could squint your eyes and kinda get it, but it wasn’t great. 

Src: The Art Gallery of Trevor Grant

Since the advent of (really, in parallel with) low-dose CT scans, people have been coming up with ways to ‘denoise’ them. The main approach is a thing sort of similar to Principal Component Analysis or PCA for those who learned statistics from sklearn docs. The thing to remember is that even in low-dose CT scans, the dataset is large and cubic (not square).

For instance 300×500 is not great resolution, but not horrible; figure another 500 “slices” and you’ve got a dense matrix that has 75 million data points.

PCA uses singular value decomposition (the part of the PCA that does the matrix inversion). 

I knew I couldn’t do it locally, but I tried it just for fun with NumPy—and it threw a warning saying I would need 500GB of RAM.

You can get computers now with 500GB of RAM, but they’re pricey—even to rent in the cloud. But we are wizards of open source, so I’ll leave that as the teaser for the next post.

Landing the Plane

So to recap the blog post series up to this point:

  1. In March 2020 there was a large need for COVID rapid tests that could utilize existing hospital equipment (as opposed to having to be produced and distributed). 
  2. CT scans were known to be a good tool BUT among other things, they deliver high doses of radiation. 
  3. Low-dose CT scans, a technique that had been around for around 20 years, solved the radiation problem but produced “noisy” images.
  4. Through clever use of open source software and renting cloud time, we deliver a quickly and easily reproducible method for denoising low-dose CT scans. (The details of which we’ll get into in the next post). 

So we didn’t exactly land a plane, but we did (I hope) a pretty thorough job of setting up our problem, and why even though it doesn’t seem like such a problem now, at the time it was, and it still makes a compelling case for Kubeflow.



Free Technical Workshop

Turbocharge your team’s Kubeflow and MLOps skills with a free workshop.