Monday, August 29, 2011

Transform XML with XSLT in Talend

Talend Open Studio is an excellent ETL tool that can be used beyond the typical database and CSV manipulation. XML processing for example is today all over the places in the Enterprise.

As a consequence XML transformations are a key skill for those folks doing data transformations.

Even though there are more efficient tools XSLT is a standard which is supported in Talend through the tXSLT component. You just need to provide your XML, XSL and output files and Talend will apply the transformation for you. Talend uses Saxon at the moment so you get the benefit of clear error messages when trying to build your XSL.

Of course XSLT might be a skill that even scares some people, however XSLT is not difficult at all and the more you work with it the better you get as with any other human skill. Do not try to avoid it, if you have XML to process and your ETL tool is Talend then do your homework and learn some XSL.

As there is no better way to teach than providing an example I decided to write this quick showcase that will pivot the data resulting from running an Advent Geneva RSL report (A SOAP service) which comes in the form of key value pairs into a tabular output. I will provide two responses: XML and HTML. The first is probably what you need in Talend while the second is probably what you need if you want to provide a quick add hoc HTML report page.

Here is the XML:
<?xml version='1.0' encoding='UTF-8'?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:nsg="http://geneva.advent.com">
	<SOAP-ENV:Body id="_0">
		<reportResults xmlns="http://geneva.advent.com">
			<return xsi:type="nsg:reportResultsPortfolioStruct">
				<results xsi:type="nsg:reportResultsStruct">
					<portfolioName xsi:type="xsd:string">Fund1</portfolioName>
					<header xsi:type="nsg:reportResultsRecordStruct">
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">Head</name>
							<value xsi:type="xsd:string">Fund 1 Example</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">Head</name>
							<value xsi:type="xsd:string">DIVIDENDS RECEIVABLE AND PAYABLE</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">Head</name>
							<value xsi:type="xsd:string">FOR THE PERIOD INCEPTION TO July 31, 2011</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">Head</name>
							<value xsi:type="xsd:string"/>
						</field>
					</header>
					<record xsi:type="nsg:reportResultsRecordStruct">
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">CcyCode</name>
							<value xsi:type="xsd:string">BRL</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">Custodian</name>
							<value xsi:type="xsd:string">My Custodian</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">IDesc</name>
							<value xsi:type="xsd:string">My IDesc</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">TransID</name>
							<value xsi:type="xsd:string">10145834</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">TDate</name>
							<value xsi:type="xsd:string">March 15, 2011 12:00:00 am</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">SDate</name>
							<value xsi:type="xsd:string">December 31, 9999 11:59:59 pm</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">Qty</name>
							<value xsi:type="xsd:string">13528.013</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">PerShare</name>
							<value xsi:type="xsd:string">0.15100000</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">TaxRate</name>
							<value xsi:type="xsd:string"/>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">DivLocal</name>
							<value xsi:type="xsd:string">2042.73</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">TaxLocal</name>
							<value xsi:type="xsd:string">0.00</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">DivLocalNet</name>
							<value xsi:type="xsd:string">2042.73</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">DivBook</name>
							<value xsi:type="xsd:string">1225.76</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">TaxBook</name>
							<value xsi:type="xsd:string">0.00</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">DivBookNet</name>
							<value xsi:type="xsd:string">1225.76</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">UnrealFXGL</name>
							<value xsi:type="xsd:string">0.00</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">BookBal</name>
							<value xsi:type="xsd:string">1225.76</value>
						</field>
					</record>
					<record xsi:type="nsg:reportResultsRecordStruct">
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">CcyCode</name>
							<value xsi:type="xsd:string">USD</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">Custodian</name>
							<value xsi:type="xsd:string">My Custodian</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">IDesc</name>
							<value xsi:type="xsd:string">My IDesc</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">TransID</name>
							<value xsi:type="xsd:string">10756740</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">TDate</name>
							<value xsi:type="xsd:string">April 27, 2011 12:00:00 am</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">SDate</name>
							<value xsi:type="xsd:string">July 1, 2011 12:00:00 am</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">Qty</name>
							<value xsi:type="xsd:string">205212.046</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">PerShare</name>
							<value xsi:type="xsd:string">0.16918500</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">TaxRate</name>
							<value xsi:type="xsd:string"/>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">DivLocal</name>
							<value xsi:type="xsd:string">34718.80</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">TaxLocal</name>
							<value xsi:type="xsd:string">0.00</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">DivLocalNet</name>
							<value xsi:type="xsd:string">34718.80</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">DivBook</name>
							<value xsi:type="xsd:string">22153.39</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">TaxBook</name>
							<value xsi:type="xsd:string">0.00</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">DivBookNet</name>
							<value xsi:type="xsd:string">22153.39</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">UnrealFXGL</name>
							<value xsi:type="xsd:string">0.00</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">BookBal</name>
							<value xsi:type="xsd:string">23379.15</value>
						</field>
					</record>
					<addendumErrors xsi:type="nsg:reportResultsRecordStruct">
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">BegDesc</name>
							<value xsi:type="xsd:string"/>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">EndDesc</name>
							<value xsi:type="xsd:string">DIVIDENDS RECEIVABLE - CLOSING BALANCE</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">BegBal</name>
							<value xsi:type="xsd:string"/>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">EndBal</name>
							<value xsi:type="xsd:string">71643.79</value>
						</field>
					</addendumErrors>
					<addendumErrors xsi:type="nsg:reportResultsRecordStruct">
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">BegDesc</name>
							<value xsi:type="xsd:string"/>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">EndDesc</name>
							<value xsi:type="xsd:string">DIVIDENDS PAYABLE - CLOSING BALANCE</value>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">BegBal</name>
							<value xsi:type="xsd:string"/>
						</field>
						<field xsi:type="nsg:reportResultsVectorElement">
							<name xsi:type="xsd:string">EndBal</name>
							<value xsi:type="xsd:string">-20200.35</value>
						</field>
					</addendumErrors>
				</results>
			</return>
		</reportResults>
	</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

If you open it with Firefox and you choose "View XPath" from a right click on the body of the page (contextual menu) you could try several XPATH expresions. Together with XSLT skill it comes XPATH which is just a way to address a node, an attribute or textual content in the XML. See below how I tested one of the XPATH using this tool. Pay attention to the namespace definition, I use simple letters to abbreviate more verbose prefixes.


Now that you have a quick tool for finding nodes in the XML document let us see the desired document structure. Here is a screenshot of what we would like to see in HTML:

Here is in the XML we would like to obtain for further processing in Talend:

Here is the XSL that will output HTML:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:e="http://schemas.xmlsoap.org/soap/envelope/"
  xmlns:g="http://geneva.advent.com">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>

    <xsl:strip-space elements="*"/>

    <xsl:key name="kFieldNameByValue" match="/e:Envelope/e:Body/g:reportResults/g:return/g:results/g:record/g:field/g:name"
         use="."/>

    <xsl:variable name="vCols" select=
       "/e:Envelope/e:Body/g:reportResults/g:return/g:results/g:record/g:field/g:name
                [generate-id()
                =          
                 generate-id(key('kFieldNameByValue',.)[1])
                 ]"/>

    <xsl:template match="/">
             <table>
               <tr>
                 <xsl:apply-templates select="$vCols"/>
               </tr>

               <xsl:for-each select=
                 "/e:Envelope/e:Body/g:reportResults/g:return/g:results/g:record">                   
                 <tr>

                   <xsl:variable name="vPos" select="position()"/>

                   <xsl:for-each select="$vCols">
                     <td>
                       <xsl:value-of select=
                           "../../../g:record[$vPos]/g:field[g:name = current()]/g:value"/>
                     </td>
                   </xsl:for-each>

                 </tr>
              </xsl:for-each>
            </table>

    </xsl:template>

    <xsl:template match="/e:Envelope/e:Body/g:reportResults/g:return/g:results/g:record/g:field/g:name">
            <td>
              <xsl:value-of select="." />
            </td>   
    </xsl:template>
</xsl:stylesheet>

Here is the XSL that will output XML:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:e="http://schemas.xmlsoap.org/soap/envelope/"
  xmlns:g="http://geneva.advent.com" 
>
    <xsl:output omit-xml-declaration="no" indent="yes"/>

    <xsl:strip-space elements="*"/>

    <xsl:key name="kFieldNameByValue" match="/e:Envelope/e:Body/g:reportResults/g:return/g:results/g:record/g:field/g:name"
         use="."/>

    <xsl:variable name="vCols" select=
       "/e:Envelope/e:Body/g:reportResults/g:return/g:results/g:record/g:field/g:name
                [generate-id()
                =          
                 generate-id(key('kFieldNameByValue',.)[1])
                 ]"/>

    <xsl:template match="/">
	  	<xsl:element name="root">
               	<xsl:element name="header">
				  <xsl:apply-templates select="$vCols"/>
				</xsl:element>  

               <xsl:for-each select="/e:Envelope/e:Body/g:reportResults/g:return/g:results/g:record">                   
                 <xsl:element name="record">
                   <xsl:variable name="vPos" select="position()"/>
                   <xsl:for-each select="$vCols">
                       <xsl:element name="value"><xsl:value-of select="../../../g:record[$vPos]/g:field[g:name = current()]/g:value"/></xsl:element>
                   </xsl:for-each>
                 </xsl:element>
              </xsl:for-each>

        </xsl:element>
    </xsl:template>

    <xsl:template match="/e:Envelope/e:Body/g:reportResults/g:return/g:results/g:record/g:field/g:name">
            <xsl:element name="name"><xsl:value-of select="."/></xsl:element>
    </xsl:template>

</xsl:stylesheet>

A quick explanation of what the XSL code does:
1-6 Is just stating it is an XML document which uses a namespace prefixed "xsl" for the transaformation instructions, a soap namespace with prefix "e" and an Advent Geneva namespace with prefix "g". Note that I abreviated the original namespace suffixes for the last two.

7-9 How the final output should look like.

11-19 We use the Muenchian grouping that allows us to have a list of all possible column names.

21-37 XSLT is a functional language which works matching nodes and applying transformations to them. The logic can be affected using xsl:apply-template. We use xsl:element to create our custom nodes: First the headers which come out of the Muenchian key and later the records for which we again use the keys while addressing the correct record through an xsl:foreach nested loop.

39-41 The template responsible for generating the name nodes.

4 comments:

Macin said...

Dear sir,

I followed your tutorial, but nothing happens.
I just took a tXSLT job in talend and filled in the file parameters. When i run the job nothing happens. no errors nor file output.

log:

Starting job multiSchemaXml at 09:53 12/04/2012.

[statistics] connecting to socket on port 3815
[statistics] connected
[statistics] disconnected
Job multiSchemaXml ended at 09:53 12/04/2012. [exit code=0]

Endlog

Nestor Urquiza said...

@Macin this should work out of the box. I retested locally using the tXSLT component, the XML and the XSL that generates XML and it did work for me.

I recommend you reinstall Talend in a different directory and create a test project, drop the tXSL and point to the input files and the output file.

If you still have problems you should export your job and share it with me. I will attempt to help.

Cheers,
-Nestor

Alisa said...

Hi!
Is there a way to save the xsl under svn together with the job itself? We have problem versioning xsl and also xsd files.

Thanks

Nestor Urquiza said...

@Alisa Could you post your question in http://www.talendforge.org? I promise to at least BUMP it there so let me know the link for it. Embedding resources in talend projects should be provided IMO out of the box.

Followers