Data and Science: Using Regular Expressions to Clean Data

Regular expressions are a very useful tool for any data professional. It is often the case in a data analytics project that the vast majority of the work is preparing the data. Among other things, Regular Expressions allow for advanced logic in filter, find, or find and replace criteria.

For example, the following regular expression is intended to only match a valid Medicare HIC number.

I found this on Regular Expression Library (regexlib.com).

 (?![A-z](\d)\1{5,})(^[A-z]{1,3}(\d{6}|\d{9})$)|(^\d{9}[A-z][0-9|A-z]?$)

This expression says, "Except in the case when it is one letter followed by the same digit 6 or more times in a row, match on one to three letters followed by six or nine digits. Alternatively, match on 9 digits followed by a letter then a digit or a letter."

More robust validation rules could be applied. For example, there is a more specific subset of valid suffixes than a letter followed by a digit or a letter. Also, I used this in a RegexMatch function in SQL Server and found it slow. Getting more precise and using set based logic in SQL would be more effective and possibly more efficient; however, that would take time away from other work. By doing a quick Google search to find a regular expression a task like validating a HIC number can be completed in minutes.

Data and Science

Thursday, February 21, 2013

Using Regular Expressions to Clean Data

No comments:

About Me