Saturday, February 23, 2013

Using a Recursive CTE for Logic in VAE Protocol



Until this week, all of the recursive CTE posts I had read used an employee hierarchy example. Then my colleague pointed me to the best technical blog post I have ever read where Brad Schulz uses a sales running total example in his blog post entitled,  “This Article On Recursion Is Entitled “This Article On Recursion Is Entitled “This Article… ∞” ” Even if you are not interested in SQL you should check out this article for the great story, writing style, examples or recursion, and links to Wikipedia pages about things like fractals and mobius strips.

My blog post is about using a recursive CTE in SQL Server for part of the logic in the Center for Disease Control’s (CDC) Ventilator Associated Event (VAE) Protocol

Imagine you have a temp table called #vent_days where each row represents a patient and a day that they were on a mechanical ventilator. You have already identified which vent days meet the criteria for a VAE day of event except you haven’t yet taken into account the rule: when a VAE day of event occurs, another VAE day of event cannot occur for 14 days. You have filtered the temp table down to just those patient visits that have a VAE. Your temp table has the columns represented by the following SELECT statement.
 
SELECT
  
patient_visit_id
  
, vent_day_date      /*date of mechanical ventilation*/
  
, vent_day           /*the number of days a patient is on a mechanical ventilator as of the vent_day_date*/
  
, pos_VAE_doe_flg    /*possible VAE day of event flag. 1 indicates that this day meets the
                       criteria for a VAE day of event except the rule that a VAE day of event may not occur until 14 days after a previous VAE day of event*/
  
, pos_VAE_doe_order /*only populated where pos_VAE_doe_flg is 1. The ordered number of possible VAE days of event per patient visit.*/ 

FROM
  
#vent_days
 

Below is how you can use a recursive CTE to complete the logic by fulfilling the criteria that a VAE day of event must not occur until 14 days have passed since a previous VAE day of event.

; 
WITH VAE_doe (patient_visit_id, vent_day_date, vent_day, pos_VAE_doe_order, prev_doe, doe_flg) 
AS (
  
/*  the anchor query is the first possible VAE day of event for each patient visit. doe_flag is set to 1 because it will always be an actual day of event.*/
  
SELECT
      
v.patient_visit_id
      
, v.vent_day_date
      
, v.vent_day
      
, v.pos_VAE_doe_order
      
, 0 AS prev_doe
      
, 1 AS doe_flag
  
FROM
      
#vent_days AS v
  
WHERE
      
v.pos_VAE_doe_order = 1
      
AND v.pos_VAE_doe_flg = 1
  
UNION ALL
  
/*  second query recurses across the anchor query*/
  
SELECT
      
v.patient_visit_id
      
, v.vent_day_date
      
, v.vent_day
      
, v.pos_VAE_doe_order
      
/*  when the previous VAE date of event occurs less than 14 days early
           then do not change prev_doe, otherwise set prev_doe equal to vent_day
           and assign a 1 to doe_flag. */
      
, CASE   WHEN v.vent_day - VAE_doe.doe_vent_day >= 14
              
THEN VAE_doe.vent_day ELSE VAE_doe.prev_doe END AS prev_doe
      
, CASE   WHEN v.vent_day - VAE_doe.doe_vent_day >= 14
              
THEN 1 ELSE 0 END AS doe_flag
  
FROM
      
#vent_days AS v
  
INNER JOIN
      
VAE_doe
      
ON v.patient_visit_id = VAE_doe.patient_visit_id
      
AND v.pos_VAE_doe_order = VAE_doe.pos_VAE_doe_order + 1 /*for each visit recurse by order of possible VAE days of envet*/
  
WHERE
      
v.pos_VAE_doe_flg = 1
  

SELECT
  
v.patient_visit_id
  
, v.vent_day_date
  
, v.vent_day
  
, COALESCE(VAE_doe.doe_flg,0) AS doe_flg 

FROM
  
#vent_days AS v 

LEFT OUTER JOIN
  
VAE_doe
  
ON v.patient_visit_id = VAE_doe.patient_visit_id
  
AND v.vent_day = VAE_doe.vent_day 

ORDER BY
  
v.patient_visit_id
  
, v.vent_day 

OPTION(MAXRECURSION 25);

If I had read Brad Schulz’s post before coming up with this I would have used a WHILE loop. Recursive CTEs do not use set-based processing but process everything row-by-row using a Stack Spool. I don’t know if I will ever use a recursive CTE again. 

Besides a possible argument for convenience, does a legitimate reason for using a recursive CTE in SQL Server exist?

Thursday, February 21, 2013

Using Regular Expressions to Clean Data



Regular expressions are a very useful tool for any data professional. It is often the case in a data analytics project that the vast majority of the work is preparing the data. Among other things, Regular Expressions allow for advanced logic in filter, find, or find and replace criteria.

For example, the following regular expression is intended to only match a valid Medicare HIC number.

I found this on Regular Expression Library (regexlib.com).

 (?![A-z](\d)\1{5,})(^[A-z]{1,3}(\d{6}|\d{9})$)|(^\d{9}[A-z][0-9|A-z]?$)  

This expression says, "Except in the case when it is one letter followed by the same digit 6 or more times in a row, match on one to three letters followed by six or nine digits. Alternatively, match on 9 digits followed by a letter then a digit or a letter."

More robust validation rules could be applied. For example, there is a more specific subset of valid suffixes than a letter followed by a digit or a letter. Also, I used this in a RegexMatch function in SQL Server and found it slow. Getting more precise and using set based logic in SQL would be more effective and possibly more efficient; however, that would take time away from other work. By doing a quick Google search to find a regular expression a task like validating a HIC number can be completed in minutes.

Tuesday, February 19, 2013

Architects and Scientists



Is the “Architect” in “Analytics Architect”, “Infrastructure Architect” or “Data Architect” analogues to the “Scientist” in “Data Scientist”? Clearly the answer is no. 

“Architect” in a technical job title generally implies that responsibilities include structural design of the information system, like a traditional Architect’s responsibilities include structural design of a building or other physical structure. In these technical job titles the word “Architect” is not intended to be interpreted as a traditional Architect. Traditional Architects’ opinions on this seem to range from insult to flattery. See this article titled I'm an Architect by Amanda Kolson Hurley at architectmagazine.com.

 The “Scientist” in “Data Scientist” is intended to say that the person is an actual scientist. As I discussed in my first post to this blog, a data scientist generally does not publish their work, an important part of the scientific method; therefore, I am not convinced that the work they do is actually science and so I do not believe that they are actual scientists.

In conclusion, an Analytics Architect is not an actual architect and a Data Scientist is not an actual scientist. The title “Analytics Architect” is not intended to mean that the person is an actual architect. The title “Data Scientist” is intended to mean that the person is an actual scientist. Stopping there is not entirely fair because while an Analytics Architect’s work barely resembles the work of an actual architect, a Data Scientist’s work comes pretty close to being actual science. 

For my next post I will cover a technical topic.

Monday, February 18, 2013

What is a Data Scientist?



According to the Harvard Business Review’s October 2012 article titled, Data Scientist: The Sexiest Job of the 21st Century, a data Scientist is:

A high-ranking professional with the training and curiosity to make discoveries in the world of big data.

From my understanding, these are the skills a data scientist must have: expertise in data technology, statistics and probability, and business. A data scientist must also be a relentless problem solver driven by an entrepreneur-like passion to improve the world by understanding it.

I listened to some podcasts recently that seemed to give somewhat varying emphasis within their definitions.

Randy Zwitch, Sr. Data Scientist at Keystone Solution described the continuum between an Analyst and a Data Scientist on the company’s podcast, Podcast #2 – R U Serious?. In his explanation, Mr. Zwitch emphasizes the Data Scientist’s knowledge of technology and problem solving abilities.

In the episode titled Rebuffing "Buyer Beware"Attitude on Data Science from IBM’s The Big Data Hub podcast, the data scientist David Smith, vice president of Marketing and Community at Revolution Analytics, emphasizes statistics.

David Smith does talk about the need to know the statistical platform and language R; however, Randy Zwitch expands the technical skill requirements.

Sunday, February 17, 2013

What is a Scientist? Is a Data Scientist a true Scientist?

The past week I was thinking and reading about the definition of a scientist. The title “scientist” is not like “attorney" or “police officer” which have clear legal definitions. Therefore, anyone can really apply their own definition. For example: Christian Scientists. Another example might be someone who called themselves a homeopathic scientist. I, for one, would not consider either of these examples scientists but there is nothing stopping the word “scientist” from being used in these ways. That being said, I want to explore the working definition(s) of “scientist” that most scientifically literate people should be able to agree on.

The online dictionary entries that I found all essentially defined a scientist as a person who is an expert in or is studying one of the natural or physical sciences. This definition makes me uncomfortable because it doesn’t necessitate that the person be doing science on a regular basis. Here is a good discussion related to this subject http://www.physicsforums.com/showthread.php?t=515662.

Wikipedia basically breaks it down into two definitions:

Broad definition: person engaging in a systematic activity to acquire knowledge.

Narrow definition: person who uses the scientific method.

I have trouble with the broad definition simply because if a person is not using the scientific method, are they really doing science? At this point, I like the following definition:

Scientist: A person who does science on a daily basis.

The question then becomes: what does it mean to be doing science? The answer has to be: following the scientific method. Therefore, I have to conclude that Wikipedia’s narrower definition is the correct definition of a scientist.

I was thinking about the definition of a scientist because I was trying to answer the question: Is a data scientist a true scientist? My next post will be on the definition of a data scientist. For now, I’ll just say: no, a data scientist is not a real scientist. Please comment if you don't agree but a data scientist generally does not practice full disclosure, an important part of the scientific method. Full disclosure being the sharing of data and methodology. This is because a data scientist's work is usually corporate intellectual property.