Thursday, February 14, 2013

Parsing JSON with Talend

JSON parsing has been an issue in Talend for a while. Instead of keeping here the whole history I decided to keep this old post updated with just the latest issue found in Talend tFileInputJSON component. However it is worth mentioning that the next step after the component gets to a decent shape is to measure its performance which is an important issue closed with resolution "suggestion noted".

Basically as it stands the component is still not useful for real world scenarios where JSON data is mostly unbalanced or where the data can be represented by keys that are not necessarily alphanumeric starting with a letter or when the datasets are to be extracted from different nested levels.

The common way to go around this is to code everything needed in a tJavaFlex component like:
  1. Given the json:
    {"arr":[{"id":1, "next":2},{"id":2}, {"id":3, "next":4}]}
    
  2. We build the below sample job:
  3. Using the below in the tJavaFlex
    //imports
    import java.util.Map.Entry;
    import java.util.Set;
    import org.json.simple.JSONObject;
    import org.json.simple.JSONArray;
    import org.json.simple.parser.JSONParser;
    import java.util.ArrayList;
    import java.io.ByteArrayInputStream;
    import java.io.BufferedReader;
    import java.io.ByteArrayOutputStream;
    import java.io.FileReader;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    
    //start
    JSONParser parser = new JSONParser();  
    JSONObject jsonObject = (JSONObject) parser.parse(new FileReader("/opt/samples/big.json"));
    JSONArray arr = (JSONArray) jsonObject.get("arr");
    for(int i = 0; i < arr.size(); i++)  {
    
    //main
        JSONObject item = (JSONObject) arr.get(i);
     Long itemId = (Long) item.get("id");
     Long next = (Long) item.get("next");
     row5.id = itemId;
     row5.next = next;
    
    //end
    }
  4. Once we run it we correctly get our list of id and next values:
    [statistics] connecting to socket on port 3472
    [statistics] connected
    1|2
    2|
    3|4
    [statistics] disconnected
    Job sample ended at 10:21 14/02/2013. [exit code=0]
    
Clearly that is a lot of code to be writing for a Data Analyst who should be concentrated in plumbing components to achieve Business Intelligence goals. The way to go here is to fix the tFileInputJSON component.

JsonPath

Using JsonPath allows for a shorter implementation though:
 //import
 ...
 import com.jayway.jsonpath.JsonPath;
 ...
 //main
 ...
 Long itemId = (Long) JsonPath.read(item, "$.id");
 Long next = (Long) JsonPath.read(item, "$.next");
 ...
Let us suppose we have a little bit more complicated JSON:
{"arr":[{"book":{"id":1, "next":2}}, {"book":{"id":2}}, {"book":{"id":3, "next":4}}]}
If you want to parse nested properties and the JSON is unbalanced like in this case (there is no "next for book.id=2) you will need to catch PathNotFoundException:
//import
 ...
 import com.jayway.jsonpath.JsonPath;
 import com.jayway.jsonpath.PathNotFoundException;
 ...
 //main
 ...
 Long itemId = "";
 Long next = "";
 //try needed only for nested path like in cases like "$.book.name"
 try {
   itemId = (Long) JsonPath.read(item, "$.book.id");
 } catch (PathNotFoundException e) {
 }
 try {
   next = (Long) JsonPath.read(item, "$.book.next");
 } catch (PathNotFoundException e) {
 }
 ...
This approach requires the below libraries which you can get from json-path project:
json-path-0.9.0.jar
slf4j-api-1.7.5.jar
json-smart.1.2.jar

External Parser

Even though JsonPath from Java should be OK you might find other surprises, who knows what ;-) In such case you might want to consider processing JSON with external tools. One of those tools is described here.

2 comments:

arawan said...

Thank you for posting this! I ran into this exact issue with the tFileInputJSON component and was looking for a workaround. This worked beautifully for me. Great writeup!

Nestor Urquiza said...

@arawan. Glad it helped. Cheers, -Nestor

Followers