What Does a Data Scientist in Healthcare Do?

1,2 As awareness grows about the threat posed by the latest Ebola outbreak, the world is also anticipating the arrival of another such crisis. Concerns about such pandemics are likely to continue to grow as the rate of infection in Africa continues to accelerate and as research about the virus is improved. Data scientists are now serving as the primary scientists of the field, using the tools of data science to estimate the disease’s spread, monitor the effects of treatment and measures to minimize its spread.

Much like their counterparts on Wall Street using historical data to find information that can make better decisions about the use of capital, healthcare experts use data science to better understand the way infectious diseases spread. This information has potentially enormous effects on the lives of people in their communities.The ability to communicate this information to health systems is the key to successful prevention efforts, and a doctor’s first impulse is to use the latest data science to assist with this process.

And though it might seem simple and familiar to those in the tech sector, it is anything but.The NHS has put together an online course called PICUTECH, taught by a leading computer science lecturer, to train healthcare workers in using data science tools such as BigQuery and Tableau. It also has an annual Data Engineering Day and a Data Science for Healthcare summit that aims to bring together different groups of healthcare organisations and practitioners.

Data scientists are often presented as the saviours of the data that they use for these tasks.When their work does not directly assist with patient care, though, they are more often a burden than a resource. There is no denying the massive public interest in this particular pandemic. With so many unanswered questions, and so many resources from various agencies available, healthcare workers are under great pressure. Their brains, not only of course, are filling with ideas about how they can help, but also of how they can stop the outbreak.A little over two weeks after the first case was reported on 30th March, at least 50 scientists from several UK, US, German, French and Israeli universities flew to Uganda to help: they used BigQuery to try to understand how the virus spread.

They were able to deduce where the cases were located on the map, which gave them more confidence in using cellphones to track the cases. They found out where the disease had spread from: this enabled the scientists to plan to deploy mobile lab testing devices. All of this was done within just a few days.The scale of the Ebola outbreak and the unexpected scale of the response and analysis is such that it is hard to overstate just how incredibly important this data is to help with the pandemic response. As that initial story that the BBC published yesterday points out, the US National Science Foundation gave PICUTECH $2m for the next academic year, allowing the group to be extended to include additional researchers.

PICUTECH already has a series of other partnerships, including a scholarship scheme offering £100,000 to fund UK PhD students who research on the data and models used. However, while there is a serious lack of people working in this space in the UK, there is a strong movement to fill this gap. A number of companies and organisations such as Fintech accelerator Chain are also placing ads in the Cambridge Technology Review, and setting up the Data Science for Healthcare Summit, where they hope to raise an additional $25,000 in funding.This crisis is still ongoing, with patients’ lives in danger

All the data from around the world points to the importance of reducing costs and improving safety in the U.S. healthcare system. But there was one data point in particular that stuck out. The COVID-19 pandemic actually involved not just one disease, but four. The four strains were caused by different strains of the same bacteria from different parts of Asia, so they were impossible to predict individually, yet the U.S. government diagnosed COVID-19 using real time data from two separate hospitals using a cocktail of early detection technology to identify infections before they reached critical levels and within the given window of time. What happened? First of all, from the first day of the pandemic until COVID-19 went into effect in September, only one infection was confirmed and five were detected in the U.S., resulting in no deaths.

Second, the COVID-19 were found only in one hospital in the U.S., leading to an overall lower number of infections in the entire U.S. network of 50 hospitals. The researchers note that the identification of COVID-19 in only one hospital drove the results of the pandemic faster and allowed officials to take action earlier, after it was detected. The reason the pandemic became the COVID-19 was because the first symptoms of the outbreak were spread throughout the entire network, whereas COVID-19 was only detected in one hospital, leading to a greater impact of the disease and a faster end to the outbreak, leading to a faster assessment of how to prevent future outbreak. But was that a coincidence? Yes, but there’s more to the story than that. Read on to learn more.

Hospital patient data is critical to identifying and treating diseases and saving lives. At a hospital level, patient data is critical to identifying and treating diseases and saving lives. In the case of a pandemic outbreak like COVID-19, the patient data is where the magic happens. The ability to see which infections were seen, which were not seen, what was missed in screening, and what might have gone wrong during screening is important information for a hospital to identify and treat for patients. Though there are some notable examples of healthcare data being used in public health initiatives, most of the data collected is on patients who are outside of the healthcare system, outside of the eyes of researchers, and outside of the potential for manipulation.

But, with a pandemic, that’s where data scientists have the opportunity to add value. It’s often thought that the study of medicine doesn’t work well for two reasons: First, most doctors and scientists don’t have access to the patient data required to conduct real-time clinical studies. Second, the data is usually filtered, stripped of patients, not aggregated for individual-level analysis. This is known as raw, unanalyzed data, which often takes a fair amount of time to process. By using the patient data to their advantage, the COVID-19 researchers were able to develop a “multi-dimensional protocol” to assess how well U.S. hospitals were treating and identifying COVID-19 patients using real-time patient data. That protocol came to be called “BOSS-ON,” meaning that it was designed to start with a patient at the beginning of the pandemic (prevalence of COVID-19, infection rate, and risk of mortality) and increase the risk as it progressed (infection rates decreased, mortality increased). This is important