The question is then how to remove the whole "tu" node containing duplicated segments and leaving just one of them (again we are losing precision in the translation output but it might be worth it because of the savings when using a free CAT Tool).
The straightforward answer would be to export the TMX from the original tool using some options provided by such tool that would allow less data to be exported, specifically ignoring context specific translations. If that is not as possibility we are left with building a tool to clean it up.
First we can get an idea of which segments are duplicated and how many times each:
cat input.tmx | grep '<seg>' \ | sort | uniq -c | sort -nr \ | grep -v '^ *1 ' > tmx-repetitions.txtThen we can replace them by a string like DUPLICATE_NODE_PLEASE_REMOVE
cat input.tmx \ | awk '{if($0 ~ /<eg>/ && !seen[$0]++ || $0 !~ /<seg>/) print $0; \ else print "DUPLICATE_NODE_PLEASE_REMOVE"}' > input-with-marked-duplicates.tmxFinally we can try removing the whole translation unit (tu) node with perl:
cat input-with-marked-duplicates.tmx \ | perl -0pe 's#<tu(.*?)DUPLICATE(.*?)</tu>##gs'But if the file is big enough this won't work as expected, probably because of how perl does multiline parsing in this particular commend (in memory). This is the reason why I built open sourced bash-multiline-replace project which contains a simple bash script (multilineReplace.sh) that will eliminate full blocks from start to end patterns if they contain an inner pattern.
cat input-with-marked-duplicates.tmx \ | ./multilineReplace.sh '<tu ' 'DUPLICATE' '</tu>'