Wednesday, July 7, 2010

data integrity check from CSV

To test the data integrity in your CSV file and locate any lines that contain the delimiter in the data field so that it can be removed manually.

Regular Expression:
/(?:.*?\|){4,}.*/

Will select lines that contain 4 or more pipes per line.
(?: ) is a non-capturing group. Similar to ( ), except that it doesn't store the result for later reference.

A more greedy approach is to write something like this:
/^(?:[^|]*\|){4}[^|]*$/

This will match lines that contain only four pipes

Test it:
http://rubular.com/r/f7Vd9O1c4k

Or using ruby one-liners:

#print only lines that match a regular expression (emulates 'grep')
$ ruby -pe 'next unless $_ =~ /regexp/' < file.txt

#print only lines that DO NOT match a regular expression (emulates 'grep')
$ ruby -pe 'next if $_ =~ /regexp/' < file.txt