Thursday, March 7, 2013

The term “Big Data” will fade away, the title “Data Scientist” will lose value, but the importance of “Data Science” will continue to grow..



This post contains my predictions for “Big Data”, “Data Scientist” and “Data Science”. Rather than start each sentence with “in my opinion” or “my best guess is”, I will state everything that follows as if it were fact. Please argue with me if you disagree with any of it.

“Big Data” has been a buzzword recently and is actually a useful term right now to use when talking about advances made in technologies and methods for handling and utilizing large data sets. Many business and others are leveraging statistical analysis of large data sets for the first time. As data analytics advocates across the world test their influence while navigating budget committees and corner offices, they benefit from the memory-locking powers of talking points and buzzwords. Since I am a fan of the growth of data analytics I support any tool that helps spread its prevalence, including buzzwords. Over time, however, as “Big Data Science” approaches the asymptote of ubiquity, the term “Big Data” will be less useful as a buzzword. Another problem for the term “Big Data” is that the size that a data set must be to qualify as “Big Data” is temporally bound, always growing as a function of time. Data storage hardware is analogous in this respect. We generally do not refer to a piece of data storage hardware as “big”. Instead, we specify its size when expressing its bigness. For these reasons the use of the term “Big Data” will decline.

The title “Data Scientist” is destining to befall a similar fate as the title “<fill-in-the-blank> Architect”. With no universally accepted definition the continuum between an Analyst and a Data Scientist will blur.  The term will succumb to market pressures from job seekers who will prefer a title that is perceived to have status and improve career prospects. There are already positions posted for entry level “Jr. Data Scientists”.

“Data Science” itself is not going anywhere. As more and more data is available, understanding that data is increasingly necessary for organizations to succeed. Statisticians and data architects will be in increasing demand. People that can bridge those skills will be more valuable still.

Sunday, March 3, 2013

Logic and Probability Puzzle

I heard a good puzzle a few weeks ago on The Skeptics Guide to the Universe, a podcast about scientific skepticism. I strongly recommend this weekly podcast to anyone interested in learning the pitfalls of flawed reasoning while keeping up with the latest science news. Here is the puzzle:

 “I have two children. One of them is a boy born on a Tuesday. What is the probability that I have two boys?”

Spoiler Alert! Stop now if you want to figure this out on your own.

The answer is thirteen twenty-sevenths. My gut reaction to the solution was to question how the day of the week that one child is born can influence the probability of the other child being a boy. That was incorrect reasoning because the information, “One of them is a boy born on a Tuesday”, is not information about one of the children. It is information about the pair.

In other words, it is not equivalent to say:

 “I have two children. Child #1 is a boy born on a Tuesday. What is the probability that child #2 is a boy?”

The answer to this altered version is clearly 50%. The difference being that the original puzzle does not give certainty that either child in particular is a boy born on a Tuesday. Try thinking through the puzzle starting with the boy born on a Tuesday as a given. Then cycle through all the possible combinations of genders and days of week for the other child. When you get to “boy born on a Tuesday” for the other child you can no longer take “boy born on a Tuesday” as a given for the first child. The logic in the previous sentence is easy to miss when thinking through the puzzle this way. If you do miss this fact, then you are left with the non-equivalent altered puzzle and incorrectly conclude that the answer is 50%.

A better way to think through this is to start by splitting the universe into equally probably parts; the universe being the set of people with two children. For each child there are 14 possibilities (2 genders times 7 days of the week). For each of those possibilities there are 14 possibilities for the other child. That means there are 14 times 14 total possibilities which equals 196 equally probable combinations. If we limit these 196 combinations to just those with at least one of the children being a boy born on a Tuesday, we are left with 27 equally probable combinations. Of those, 13 have two boys; therefore, the probability is 13/27.

If you change the puzzle to, "I have two children. One of them is a boy. What is the probability that I have two boys?", then answer becomes 1/3. The universe of two-child families is [(boy,boy), (boy,girl), (girl,boy), (girl,girl)]. We learn that at least one child is a boy which excludes (girl,girl). We are left with three equally probable combinations; one of which is two boys. 1/3.

If you want to read a robust debate and discussion of this puzzle, check out this Skeptics Guide to this Universe forums thread.

I'll finish with my T-SQL solution:

WITH child AS
  
(
  
SELECT g.gender, w.weekday_born
  
FROM
      
(VALUES('boy'),('girl')) AS g(gender)
   CROSS
APPLY
      
(VALUES('Sun'),('Mon'),('Tue'),('Wed'),('Thu'),('Fri'),('Sat'))
      
AS w(weekday_born)
   )
 

SELECT
  
CAST(SUM(CASE
      
WHEN c1.gender = 'boy' AND c2.gender = 'boy'
      
THEN 1 ELSE 0 END) AS FLOAT)
   /
COUNT(*) AS answer 

FROM
  
child AS c1 

CROSS JOIN
  
child AS c2 

WHERE
  
(c1.gender = 'boy' AND c1.weekday_born = 'Tue')
   OR
   (
c2.gender = 'boy' AND c2.weekday_born = 'Tue')


/*
answer
----------------------
0.481481481481481
*/