Thursday, October 07, 2010

Pdf extraction: Split by bookmarks and merge back

Here is the task: Extract from a PDF those pages bookmarked as "you name it" in a new file keeping the original bookmarks related to the extracted pages.


1. Download pdfsam
2. Extract the content and execute run-console command (x permission is needed in unix/linux/osx).
3. Run the below to split the original file. Look how you can split at any bookmark level. In this case we split the file using the first level of bookmarks. There are other options available (for more info run ./ -h split)
./ -f /Users/nestor/Downloads/pdf/20100722_dailystm.pdf -o /Users/nestor/Downloads/pdf/pdfsam_out -s BLEVEL -bl 1 -p [BOOKMARK_NAME] split
4. Now let us say that you want to join only two files (you can join a whole directory if you want: for more info run ./ -h concat). We just provide the input file paths and the output file path:
./ -f /Users/nestor/Downloads/pdf/pdfsam_out/F-111-11111.pdf -f /Users/nestor/Downloads/pdf/pdfsam_out/F-444-AA33D.pdf -o /Users/nestor/Downloads/pdf/pdfsam_out/merged.pdf concat
5. In my case I had to rebuild the project from the sources as the resulting jar file (pdfsam-console-2.3.0e.jar) failed to correctly merge the files back. Basically files were containing first level bookmarks, all pointing to page 1.

Below are the changes I did to the project so I could build it.
Nestor-Urquizas-MacBook-Pro:trunk nestor$ svn diff ./ant/
Index: ant/
--- ant/ (revision 1128)
+++ ant/ (working copy)
@@ -1,8 +1,8 @@
 #where classes are compiled, jars distributed, javadocs created and release created

Nestor-Urquizas-MacBook-Pro:trunk nestor$ svn diff ./bin/
Index: bin/
--- bin/ (revision 1128)
+++ bin/ (working copy)
@@ -15,8 +15,8 @@

Below is the output of the ant command that generates the output jar which you will need to replace in the lib folder of your downloaded binaries (step 1)
Nestor-Urquizas-MacBook-Pro:trunk nestor$ ant -f ant/build.xml 
Buildfile: ant/build.xml




      [jar] Building jar: /Users/nestor/pdf/pdfsam/pdfsam-enhanced/pdfsam-console/trunk/dist/pdfsam-console/dist/pdfsam-console-2.3.0e.jar

Total time: 1 second

No comments: