Friday, April 06, 2018

Removing text blocks containing repetition with Unix or Linux Power Tools

Let us illustrate the issue with an example. In the Translation Industry a TMX file is an XML representation of a translation memory (TM). This format is useful to exchange TMs. It contains translation units (tu node) with properties (prop node) with translation unit variants (tuv node) and segments (seg node) that contain the source language and the target for translation language. Many times the same segment is added again and again by the Computer Aided Translation (CAT) Tool and while useful to get more precise translations it can become a burden if you try to process such a big TMX with an open source CAT Tool like OmegaT. Since OmegaT is client side only, processing big TMX would be problematic. In such case you might want to compromise on more precise translations versus being able to use the free tool. These repetitions are mostly related to the addition of context around the specific segment (x-context-post and x-context-post seg type attribute).

The question is then how to remove the whole "tu" node containing duplicated segments and leaving just one of them (again we are losing precision in the translation output but it might be worth it because of the savings when using a free CAT Tool).

The straightforward answer would be to export the TMX from the original tool using some options provided by such tool that would allow less data to be exported, specifically ignoring context specific translations. If that is not as possibility we are left with building a tool to clean it up.

First we can get an idea of which segments are duplicated and how many times each:
cat input.tmx | grep '<seg>' \
| sort | uniq -c | sort -nr \
| grep -v '^ *1 ' > tmx-repetitions.txt
Then we can replace them by a string like DUPLICATE_NODE_PLEASE_REMOVE
cat input.tmx \
| awk '{if($0 ~ /<eg>/ && !seen[$0]++ || $0 !~ /<seg>/) print $0; \
else print "DUPLICATE_NODE_PLEASE_REMOVE"}' > input-with-marked-duplicates.tmx
Finally we can try removing the whole translation unit (tu) node with perl:
cat input-with-marked-duplicates.tmx \
| perl -0pe 's#<tu(.*?)DUPLICATE(.*?)</tu>##gs'
But if the file is big enough this won't work as expected, probably because of how perl does multiline parsing in this particular commend (in memory). This is the reason why I built open sourced bash-multiline-replace project which contains a simple bash script ( that will eliminate full blocks from start to end patterns if they contain an inner pattern.
cat input-with-marked-duplicates.tmx \
| ./ '<tu ' 'DUPLICATE' '</tu>' 

Saturday, March 24, 2018

Pdf Bash Tools - Ghostscript - Watermarks, password protection, search, split, merge and beyond

So much pdf processing that you can do including searching, splitting, merging, pdf password protection and watermarking. Yup, for free. Check and contribute to my pdf bash tools project.

Friday, March 16, 2018

Manage HP ProCurve Switches programmatically from *nix

Just released ProCurve Commander. Repeating yourself is not fun. This is not only true when it comes to management multiple switches but also to auditing them. This same idea can be used to manage Cisco switches and in general any device accessible via SSH but not friendly to remove command invocation.

Thursday, March 15, 2018

Hardening HP ProCurve HP switches

Enable SSH:
# config
(config)# crypto key generate ssh
(config)# ip ssh
(config)# show ip ssh
(config)# exit
# exit
> exit
Confirm ssh works and disable telnet:
# config
(config)# no telnet
(config)# exit
# exit
> exit
Change default users and set complex passwords:
password operator user-name 
password manager user-name 
Identify the switch:
# config
(config)# hostname "My ProCurve Switch  "

Wednesday, March 14, 2018

Java Applets in MAC OS X

Your only option is Safari, just as your only option is Internet Explorer for Windows. If the applet is insecure it won't run but you can always add exceptions at your own risk. From Apple System Preferences click on Java | Security tab | Edit Site List | Add | Apply | OK | Restart Safari.

Parsing CSV from bash

In one word csvtool.

To install it in Ubuntu:
sudo apt-get install csvtool
To install it in OS X:
brew install opam
opam init
eval `opam config env`
opam install csv
csvtool --help
To extract the second column (index 1) from sample.csv:
cat sample.csv | csvtool col 1 -
Find more from:
csvtool --help

Monday, February 26, 2018

pm2 error: unknown option `--auto-exit'

Kubernetes cluster panic!!! App down!!! Libraries changes cause these specially when you do not make sure you use specific package versions. Our docker had just the below configuration which of course will deploy latest version of pm2:
RUN npm install pm2 -g
RUN pm2 update
CMD ["pm2-docker", "start", "--auto-exit", "process.yml"]
Not only pm2-docker was renamed to pm2-runtime (a symlink still exists) but in addition the --auto-exit flag does not exist any longer. Now you need to specify --no-auto-exit if needed.

Friday, February 23, 2018

Gmail to EML Add-On Terms of Service

I have coded GMail to EML because I wanted to avoid multiple clicks to download a gmail message. While I expect this Gmail add-on to help others I cannot make myself liable for any loses. Therefore: GMAIL TO EML ADD-ON (THE SOFTWARE) IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY.