Thursday, January 23, 2014

When Talend is not enough - Sectioned CSV parsing with awk

Sectioned CSV files are commonly used for reporting purposes. It is a way to present several datasets in sections. Clearly this is a format intended to be used for reporting purposes and not really for post processing. However sometimes there is no option as "it is all the external party can give you" and you will try your best from your ETL tool.

In the case of Talend I found no component capable of doing this and since my time is limited at the moment I figured to better build something quick that could potentially be ported as a Talend component in the future like for example a tFileSectionInput which accepts a section string and a delimiter, it looks for the section and after matched it collects the first line as header and all after it as records. It inserts the section keyword in an additional header field. At least for a balanced section records file this approach should be enough. Let us illustrate with an example:
$ cat ~/sectioned.csv 
This is a sample of a typical sectioned scv file

 It contains indented fields

It also contains sections identified by a title, possible header and rows

For this proof of concept header and row are treated the same even though with awk specific rules could be applied in the future.

section1
c1,c2,c3,c4,c5
a,b,c,d,e
a,b,c,d,e

section2
c1,c2,c3,c4,c5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5

section3
c1,c2,c3,c4,c5
A,B,C,D,E
A,B,C,D,E
A,B,C,D,E
A,B,C,D,E
Our script should be able to provide the below output:
$ cat ~/sectioned.csv | ~/extract_csv_section.sh section1 ,
c1,c2,c3,c4,c5,section
a,b,c,d,e,section1
a,b,c,d,e,section1
$ cat ~/sectioned.csv | ~/extract_csv_section.sh section2 ,
c1,c2,c3,c4,c5,section
1,2,3,4,5,section2
1,2,3,4,5,section2
1,2,3,4,5,section2
$ cat ~/sectioned.csv | ~/extract_csv_section.sh section3 ,
\c1,c2,c3,c4,c5,section
A,B,C,D,E,section3
A,B,C,D,E,section3
A,B,C,D,E,section3
We use for this a bash wrapper script:
#!/bin/bash
#/usr/local/bin/extract_csv_section.sh
# author: Nestor Urquiza
# date: 20140123
# description: bash wrapper to call awk for extracting a section of a file

dirname=`dirname $0`
awk -v section=$1 -v separator=$2 -f $dirname/extract_csv_section.awk
Which calls the awk script below. Note that there is no need in this particular implementation to use any array. Instead of declaring header and records we can just print '$0 separator "section"' for the header and '$0 separator section' for each record:
#!/usr/bin/awk
#/usr/local/bin/extract_csv_section.awk
# author: Nestor Urquiza
# date: 20140123
# description: section and separator are expected as shell variables (see extract_csv_section.sh)

BEGIN {
 section_found = 0; 
 #section_regex = "^[[:blank:]]*i"section"[[:blank:]]*$";
 section_regex = "^[[:blank:]]*"section"[[:blank:]]*$";
 header = "";
 records[0] = ""
 record_regex = "^.*"separator".*$"
 record_number = 0;
}
{
 if( match($0, section_regex) ) {
  section_found = 1;
  next;
 }
 if( section_found == 1 ) {
  if( header == "" ) {
   header = $0 separator "section";
  } else if ( match($0, record_regex) ) {
   records[record_number] = $0 separator section;
   record_number++; 
  } else {
   nextfile;
  }
 }
}
END { 
 print header;
 for( i = 0; i < length(records); i++ ) {
       print records[i];
 }
}

No comments:

Followers