Data and Science

Saturday, June 1, 2013

What’s So Natural About Natural Log?

natural log of x ln(x) log_e(x)

What is a logarithm?

Take the equation 10^x = 1,000,000. Solving for x in this equation is the same as solving for x in the following equation: log₁₀(1,000,000) = x. This can be stated as, “x is the logarithm of 1,000,000 to base 10.”

Logarithms changed the world because they allowed for time consuming multiplication and division of large numbers to be reduced to table look-ups and addition or subtraction. This is because the following is true:

log_b(xy) = log_b(x) + log_b(y)

and

log_b(x/y) = log_b(x) - log_b(y)

Now imagine you are an early 17^th century engineer spending a majority of your time doing tedious multiplication and division problems. Along comes John Napier with a bunch of look-up tables that give the log value for numbers to the base (1-1/10^7)^(10^7) which we can call b.

You have these two large numbers x and y that you need to multiply. Thanks to John’s laborious pre-work, you can now look up log_b(x) and log_b(y) in seconds. With a much faster to calculate addition problem and a reverse of the previous look-up, you now have xy.

It is hard to overstate the importance of this innovation. Just consider the human accomplishments that utilized a version of this tool from Napier’s first tables in 1614 to the retirement of the slide rule in the 1960’s; these accomplishments include: Johannes Kepler’s third law of planetary motion, commercial airliners and everything in-between.

I went off on a tangent about the history of logarithms when this article is supposed to be focused on natural log. Speaking of tangents, a tangent line is just what makes the natural log so natural. If you think about the graph of the two columns in your new look-up tables, that is y = log_b(x), it must pass through the point (1,0) because log_b(1) = 0 just as b⁰=1 for any base b. The slope of the tangent line to y = log_b(x) at (1,0) depends on the base b.

What base b do you think has a tangent line at (1,0) with a slope of 1?

Why the answer is e, naturally.

Thank you Leonard Euler.

For the sake of brevity, I’ve stated a lot without proof and simplified the history a bit. Here are some great links to learn more:

http://en.wikipedia.org/wiki/Logarithm#From_Napier_to_Euler

http://www.komal.hu/cikkek/2004-ang/e.e.shtml

http://www.mathsisfun.com/numbers/e-eulers-number.html

http://www-history.mcs.st-and.ac.uk/HistTopics/e.html

Monday, April 15, 2013

Logic and Probability Puzzle Response

This post is a response to a comment on my March 3rd post titled, Logic and Probability Puzzle.

JeffJo, thanks for the comment. Throughout this response I will refer to the person giving the puzzle as “the puzzler” and I will use the notation P(X) to mean the probability of event X.

These are the assumptions needed to get to your answer of ½.

Assumption 1: Each possible gender of a child is equally likely.
Assumption 2: There are two possible versions of this puzzle:

Boy version: “I have two children. One of them is a boy. What is the probability that I have two boys?”
Girl version” “I have two children. One of them is a girl. What is the probability that I have two girls?”

Assumption 3: Each possible version of the puzzle is equally likely if the puzzler is able to use either one because the puzzler has both a boy and a girl.

Using these assumptions, the following shows the percents of the universe of parents with two children by child combination and puzzle version:

	Boy Version	Girl Version
Boy, Boy	0.25	0
Boy, Girl	0.125	0.125
Girl, Boy	0.125	0.125
Girl, Girl	0	0.25

The probability of A given B is defined as: P(A|B) = P(A B)/P(B). In other words, the probability of the intersection of A and B divided by the probability of B. In other other words, the probability of both A and B occurring divided by the probability of B occurring.

P(the puzzler has two boys) given that the puzzler asked the boy version of the puzzles is P(the puzzler has two boys and asked the boy version of the puzzle) divided by the P(the puzzler asked the boy version of the puzzle). So, given the assumption listed above, we have:

(0.25)/(0.25+0.125+0.125) = (0.25)/(0.50) = 0.50 = 50% = ½.

I can see your point that it is necessary to know why we are given the information in a puzzle like this; however, I cannot accept a couple of these assumptions.

1) Assumption 3 is not reasonable. How can we know that each possible version of the puzzle is equally likely when the puzzler is able to use either one? A plausible argument can be made that the puzzler would be more likely to ask the boy version because the boy version has been told more often in the past which could influence the puzzler. This blog doesn't get a lot of views but the fact that the Skeptic’s Guide to the Universe used this puzzle could plausibly increase the likelihood of the boy version being asked by a puzzler with both a boy and a girl.

Assumption 1, each possible gender of a child is equally likely, is a reasonable assumption. This is common knowledge and while it might not be exactly 50/50, it is close enough to assume so in the context of a logic puzzle. Dissimilarly, the probability of a puzzler choosing the boy version over the girl version of the puzzle is not common knowledge and can only be determined by conducting a study. Such a study is likely outside the intended scope of a logic puzzle.

Lets scratch Assumption 3 and rework the puzzle using P(bv) to mean the probability of the boy version being chosen when the puzzler is able to use either version. We then have:

	Boy Version	Girl Version
Boy, Boy	0.25	0
Boy, Girl	0.25 * P(bv)	0.25 * (1 - P(bv))
Girl, Boy	0.25 * P(bv)	0.25 * (1 - P(bv))
Girl, Girl	0	0.25

(0.25)/(0.25+(0.25*P(bv)) + (0.25*P(bv))) = (0.25)/(0.25 + 0.50*P(bv)) = 1/(1+2*P(bv)). That is as simple as we can make the answer.

2) Assumption 2 is not reasonable either. How can we assume that there are only two versions of the puzzle? What about, “I have two children. One of them is a boy. What is the probability that I have two girls?” or “I have two children. They are both boys. What is the probability that I have two boys?” While it might seem silly to ask a version of the puzzle with an obvious answer, there must be some chance that the puzzler would ask these versions too. This fact impacts the puzzler with two boys also. So let’s try again using P(parent of two boys will ask the boy version) = P(X) and P(parent of both a boy and a girl will ask the boy version) = P(Y). Then we have:

	Boy Version	Girl Version
Boy, Boy	P(X)	0
Boy, Girl	0.25 * P(Y)	0.25 * (1 - P(Y))
Girl, Boy	0.25 * P(Y)	0.25 * (1 - P(Y))
Girl, Girl	0	P(X)

(0.25*P(X))/((0.25*P(X))+(0.25*P(Y)) +(0.25*P(Y))) = (0.25*P(X))/((0.25*P(X))+(0.50*P(Y))). That is the most simplified answer.

My Conclusions:
1) You argument is valid.
2) Your answer is incorrect because it uses unreasonable assumptions.
3) The result is a complicated answer that is not in-line with the intentions of the puzzle.
4) The puzzle should be restated to “Randomly select a person from the set of people who have two children at least one of which is a boy. What is the probability that that person has two boys?” The full version being, “Randomly select a person from the set of people who have two children at least one of which is a boy born on a Tuesday. What is the probability that the selected person has two boys?”

Thursday, March 7, 2013

The term “Big Data” will fade away, the title “Data Scientist” will lose value, but the importance of “Data Science” will continue to grow..

This post contains my predictions for “Big Data”, “Data Scientist” and “Data Science”. Rather than start each sentence with “in my opinion” or “my best guess is”, I will state everything that follows as if it were fact. Please argue with me if you disagree with any of it.

“Big Data” has been a buzzword recently and is actually a useful term right now to use when talking about advances made in technologies and methods for handling and utilizing large data sets. Many business and others are leveraging statistical analysis of large data sets for the first time. As data analytics advocates across the world test their influence while navigating budget committees and corner offices, they benefit from the memory-locking powers of talking points and buzzwords. Since I am a fan of the growth of data analytics I support any tool that helps spread its prevalence, including buzzwords. Over time, however, as “Big Data Science” approaches the asymptote of ubiquity, the term “Big Data” will be less useful as a buzzword. Another problem for the term “Big Data” is that the size that a data set must be to qualify as “Big Data” is temporally bound, always growing as a function of time. Data storage hardware is analogous in this respect. We generally do not refer to a piece of data storage hardware as “big”. Instead, we specify its size when expressing its bigness. For these reasons the use of the term “Big Data” will decline.

The title “Data Scientist” is destining to befall a similar fate as the title “<fill-in-the-blank> Architect”. With no universally accepted definition the continuum between an Analyst and a Data Scientist will blur. The term will succumb to market pressures from job seekers who will prefer a title that is perceived to have status and improve career prospects. There are already positions posted for entry level “Jr. Data Scientists”.

“Data Science” itself is not going anywhere. As more and more data is available, understanding that data is increasingly necessary for organizations to succeed. Statisticians and data architects will be in increasing demand. People that can bridge those skills will be more valuable still.

Sunday, March 3, 2013

Logic and Probability Puzzle

I heard a good puzzle a few weeks ago on The Skeptics Guide to the Universe, a podcast about scientific skepticism. I strongly recommend this weekly podcast to anyone interested in learning the pitfalls of flawed reasoning while keeping up with the latest science news. Here is the puzzle:

“I have two children. One of them is a boy born on a Tuesday. What is the probability that I have two boys?”

Spoiler Alert! Stop now if you want to figure this out on your own.

The answer is thirteen twenty-sevenths. My gut reaction to the solution was to question how the day of the week that one child is born can influence the probability of the other child being a boy. That was incorrect reasoning because the information, “One of them is a boy born on a Tuesday”, is not information about one of the children. It is information about the pair.

In other words, it is not equivalent to say:

“I have two children. Child #1 is a boy born on a Tuesday. What is the probability that child #2 is a boy?”

The answer to this altered version is clearly 50%. The difference being that the original puzzle does not give certainty that either child in particular is a boy born on a Tuesday. Try thinking through the puzzle starting with the boy born on a Tuesday as a given. Then cycle through all the possible combinations of genders and days of week for the other child. When you get to “boy born on a Tuesday” for the other child you can no longer take “boy born on a Tuesday” as a given for the first child. The logic in the previous sentence is easy to miss when thinking through the puzzle this way. If you do miss this fact, then you are left with the non-equivalent altered puzzle and incorrectly conclude that the answer is 50%.

A better way to think through this is to start by splitting the universe into equally probably parts; the universe being the set of people with two children. For each child there are 14 possibilities (2 genders times 7 days of the week). For each of those possibilities there are 14 possibilities for the other child. That means there are 14 times 14 total possibilities which equals 196 equally probable combinations. If we limit these 196 combinations to just those with at least one of the children being a boy born on a Tuesday, we are left with 27 equally probable combinations. Of those, 13 have two boys; therefore, the probability is 13/27.

If you change the puzzle to, "I have two children. One of them is a boy. What is the probability that I have two boys?", then answer becomes 1/3. The universe of two-child families is [(boy,boy), (boy,girl), (girl,boy), (girl,girl)]. We learn that at least one child is a boy which excludes (girl,girl). We are left with three equally probable combinations; one of which is two boys. 1/3.

If you want to read a robust debate and discussion of this puzzle, check out this Skeptics Guide to this Universe forums thread.

I'll finish with my T-SQL solution:

WITH child AS
   (
   SELECT g.gender, w.weekday_born
   FROM
       (VALUES('boy'),('girl')) AS g(gender)
   CROSS APPLY
       (VALUES('Sun'),('Mon'),('Tue'),('Wed'),('Thu'),('Fri'),('Sat')) 
       AS w(weekday_born)
   )

SELECT
   CAST(SUM(CASE 
       WHEN c1.gender = 'boy' AND c2.gender = 'boy' 
       THEN 1 ELSE 0 END) AS FLOAT)
   / COUNT(*) AS answer

FROM 
   child AS c1

CROSS JOIN
   child AS c2

WHERE 
   (c1.gender = 'boy' AND c1.weekday_born = 'Tue')
   OR
   (c2.gender = 'boy' AND c2.weekday_born = 'Tue')

/*
answer
----------------------
0.481481481481481
*/

Saturday, February 23, 2013

Using a Recursive CTE for Logic in VAE Protocol

Until this week, all of the recursive CTE posts I had read used an employee hierarchy example. Then my colleague pointed me to the best technical blog post I have ever read where Brad Schulz uses a sales running total example in his blog post entitled, “This Article On Recursion Is Entitled “This Article On Recursion Is Entitled “This Article… ∞” ” Even if you are not interested in SQL you should check out this article for the great story, writing style, examples or recursion, and links to Wikipedia pages about things like fractals and mobius strips.

My blog post is about using a recursive CTE in SQL Server for part of the logic in the Center for Disease Control’s (CDC) Ventilator Associated Event (VAE) Protocol.

Imagine you have a temp table called #vent_days where each row represents a patient and a day that they were on a mechanical ventilator. You have already identified which vent days meet the criteria for a VAE day of event except you haven’t yet taken into account the rule: when a VAE day of event occurs, another VAE day of event cannot occur for 14 days. You have filtered the temp table down to just those patient visits that have a VAE. Your temp table has the columns represented by the following SELECT statement.


SELECT
   patient_visit_id 
   , vent_day_date      /*date of mechanical ventilation*/
   , vent_day           /*the number of days a patient is on a mechanical ventilator as of the vent_day_date*/
   , pos_VAE_doe_flg    /*possible VAE day of event flag. 1 indicates that this day meets the 
                       criteria for a VAE day of event except the rule that a VAE day of event may not occur until 14 days after a previous VAE day of event*/
   , pos_VAE_doe_order /*only populated where pos_VAE_doe_flg is 1. The ordered number of possible VAE days of event per patient visit.*/

FROM
   #vent_days

Below is how you can use a recursive CTE to complete the logic by fulfilling the criteria that a VAE day of event must not occur until 14 days have passed since a previous VAE day of event.

;
WITH VAE_doe (patient_visit_id, vent_day_date, vent_day, pos_VAE_doe_order, prev_doe, doe_flg)

AS (
   /*  the anchor query is the first possible VAE day of event for each patient visit. doe_flag is set to 1 because it will always be an actual day of event.*/
   SELECT
       v.patient_visit_id
       , v.vent_day_date
       , v.vent_day
       , v.pos_VAE_doe_order
       , 0 AS prev_doe
       , 1 AS doe_flag
   FROM
       #vent_days AS v
   WHERE 
       v.pos_VAE_doe_order = 1
       AND v.pos_VAE_doe_flg = 1
   UNION ALL
   /*  second query recurses across the anchor query*/
   SELECT
       v.patient_visit_id
       , v.vent_day_date
       , v.vent_day
       , v.pos_VAE_doe_order
       /*  when the previous VAE date of event occurs less than 14 days early
           then do not change prev_doe, otherwise set prev_doe equal to vent_day
           and assign a 1 to doe_flag. */
       , CASE   WHEN v.vent_day - VAE_doe.doe_vent_day >= 14 
               THEN VAE_doe.vent_day ELSE VAE_doe.prev_doe END AS prev_doe
       , CASE   WHEN v.vent_day - VAE_doe.doe_vent_day >= 14 
               THEN 1 ELSE 0 END AS doe_flag
   FROM
       #vent_days AS v
   INNER JOIN
       VAE_doe
       ON v.patient_visit_id = VAE_doe.patient_visit_id
       AND v.pos_VAE_doe_order = VAE_doe.pos_VAE_doe_order + 1 /*for each visit recurse by order of possible VAE days of envet*/
   WHERE
       v.pos_VAE_doe_flg = 1
   )


SELECT
   v.patient_visit_id
   , v.vent_day_date
   , v.vent_day
   , COALESCE(VAE_doe.doe_flg,0) AS doe_flg

FROM
   #vent_days AS v

LEFT OUTER JOIN
   VAE_doe
   ON v.patient_visit_id = VAE_doe.patient_visit_id
   AND v.vent_day = VAE_doe.vent_day

ORDER BY
   v.patient_visit_id
   , v.vent_day

OPTION(MAXRECURSION 25);

If I had read Brad Schulz’s post before coming up with this I would have used a WHILE loop. Recursive CTEs do not use set-based processing but process everything row-by-row using a Stack Spool. I don’t know if I will ever use a recursive CTE again.

Besides a possible argument for convenience, does a legitimate reason for using a recursive CTE in SQL Server exist?

Thursday, February 21, 2013

Using Regular Expressions to Clean Data

Regular expressions are a very useful tool for any data professional. It is often the case in a data analytics project that the vast majority of the work is preparing the data. Among other things, Regular Expressions allow for advanced logic in filter, find, or find and replace criteria.

For example, the following regular expression is intended to only match a valid Medicare HIC number.

I found this on Regular Expression Library (regexlib.com).

 (?![A-z](\d)\1{5,})(^[A-z]{1,3}(\d{6}|\d{9})$)|(^\d{9}[A-z][0-9|A-z]?$)

This expression says, "Except in the case when it is one letter followed by the same digit 6 or more times in a row, match on one to three letters followed by six or nine digits. Alternatively, match on 9 digits followed by a letter then a digit or a letter."

More robust validation rules could be applied. For example, there is a more specific subset of valid suffixes than a letter followed by a digit or a letter. Also, I used this in a RegexMatch function in SQL Server and found it slow. Getting more precise and using set based logic in SQL would be more effective and possibly more efficient; however, that would take time away from other work. By doing a quick Google search to find a regular expression a task like validating a HIC number can be completed in minutes.