In the case of Talend I found no component capable of doing this and since my time is limited at the moment I figured to better build something quick that could potentially be ported as a Talend component in the future like for example a tFileSectionInput which accepts a section string and a delimiter, it looks for the section and after matched it collects the first line as header and all after it as records. It inserts the section keyword in an additional header field. At least for a balanced section records file this approach should be enough. Let us illustrate with an example:
$ cat ~/sectioned.csv This is a sample of a typical sectioned scv file It contains indented fields It also contains sections identified by a title, possible header and rows For this proof of concept header and row are treated the same even though with awk specific rules could be applied in the future. section1 c1,c2,c3,c4,c5 a,b,c,d,e a,b,c,d,e section2 c1,c2,c3,c4,c5 1,2,3,4,5 1,2,3,4,5 1,2,3,4,5 section3 c1,c2,c3,c4,c5 A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D,E
Our script should be able to provide the below output: $ cat ~/sectioned.csv | ~/extract_csv_section.sh section1 , c1,c2,c3,c4,c5,section a,b,c,d,e,section1 a,b,c,d,e,section1 $ cat ~/sectioned.csv | ~/extract_csv_section.sh section2 , c1,c2,c3,c4,c5,section 1,2,3,4,5,section2 1,2,3,4,5,section2 1,2,3,4,5,section2 $ cat ~/sectioned.csv | ~/extract_csv_section.sh section3 , \c1,c2,c3,c4,c5,section A,B,C,D,E,section3 A,B,C,D,E,section3 A,B,C,D,E,section3We use for this a bash wrapper script:
#!/bin/bash #/usr/local/bin/extract_csv_section.sh # author: Nestor Urquiza # date: 20140123 # description: bash wrapper to call awk for extracting a section of a file dirname=`dirname $0` awk -v section=$1 -v separator=$2 -f $dirname/extract_csv_section.awkWhich calls the awk script below. Note that there is no need in this particular implementation to use any array. Instead of declaring header and records we can just print '$0 separator "section"' for the header and '$0 separator section' for each record:
#!/usr/bin/awk #/usr/local/bin/extract_csv_section.awk # author: Nestor Urquiza # date: 20140123 # description: section and separator are expected as shell variables (see extract_csv_section.sh) BEGIN { section_found = 0; #section_regex = "^[[:blank:]]*i"section"[[:blank:]]*$"; section_regex = "^[[:blank:]]*"section"[[:blank:]]*$"; header = ""; records[0] = "" record_regex = "^.*"separator".*$" record_number = 0; } { if( match($0, section_regex) ) { section_found = 1; next; } if( section_found == 1 ) { if( header == "" ) { header = $0 separator "section"; } else if ( match($0, record_regex) ) { records[record_number] = $0 separator section; record_number++; } else { nextfile; } } } END { print header; for( i = 0; i < length(records); i++ ) { print records[i]; } }
No comments:
Post a Comment