LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Read text in pdf files

Hi Ppl,

 

Is it possible to read text from pdf file ? We can use activex controls to open and display pdf files, but these activex doesn seem to support reading of text from these pdf files. Help me out plz.

 

Thanks 

0 Kudos
Message 1 of 32
(16,245 Views)

Hi there

 

it seems there's a solution using .NET. Take a look at this:

 

http://www.codeproject.com/KB/string/pdf2text.aspx

 

You may call .NET from LabVIEW.

 

 

 

Best regards
chris

CL(A)Dly bending G-Force with LabVIEW

famous last words: "oh my god, it is full of stars!"
0 Kudos
Message 2 of 32
(16,219 Views)

OK, download PDFBox-0.7.3.zip from http://sourceforge.net/projects/pdfbox/files/ and open the VI attached.

 

You need to relink the .NET constructors for PDDocument and PDFTextStripper with PDFBox-0.7.3.dll (installed at the \bin\ directory of PDFBox-0.7.3). See block diagram for details.

Best regards
chris

CL(A)Dly bending G-Force with LabVIEW

famous last words: "oh my god, it is full of stars!"
Message 3 of 32
(16,211 Views)

This is exactly what I need!  I've downloaded and installed PDFBox-0.7.3.  When I run the "readpdf_8.6.vi" I get this error code = 1172 and this message

 

Error calling method org.pdfbox.util.PDFTextStripper.getText, (System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
    Inner Exception: System.NullReferenceException: Object reference not set to an instance of an object.) <append><b>System.NullReferenceException</b> in readpdf_8.6.vi

 

The comments section in the block diagram explains this:

http://www.codeproject.com/KB/string/pdf2text.aspx


Finally: PDFBox
PDFBox is another Java PDF library. It is also ready to use with the original Java Lucene (see LucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package, it's in the bin directory).

Using PDFBox in .NET requires adding references to:

•PDFBox-0.7.2.dll
•IKVM.GNU.Classpath
and copying IKVM.Runtime.dll to the bin directory.

Using the PDFBox to parse PDFs is fairly easy:


 Collapse Copy Codeprivate static string parseUsingPDFBox(string filename)
{
    PDDocument doc = PDDocument.load(filename);
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
}

 

I'm running LV2011 on a XP Pro OS.  I have Adobe Reader X installed.  I called NI support and they were able to run this VI on their LV2011 W7 machine and they were able to run this vi on a LV8.6 XP machine.

 

Not sure what I'm missing to get this to run.  Does anybody have another solution for reading the Text from a PDF file?

 

Thanks!

 

Doug

0 Kudos
Message 4 of 32
(15,442 Views)

Hi there

 

I just installed PDFBox-0.7.3 on a clean Win7 with 8p6p1. Then i downloaded the "readpdf_8.6.vi" from my original post. The VI ran without any error code.

 

Please check this:

 

- Try to relink the .NET constructors (right click constructor node and select "Select Constructor")

    

     be aware that there are TWO constructors: "PDDocument" and "PDFTextStripper"

 

- Check where the error code 1172 is thrown (Constructor, invoke node "load" or "getText")

 

- I observed that the error code 1172 is thrown in case the file name is wrong. Check for correct filename.

 

- Have you tried a different pdf? Can you post the pdf you are using?

 

 

Kind Regards

 

 

 

 

Best regards
chris

CL(A)Dly bending G-Force with LabVIEW

famous last words: "oh my god, it is full of stars!"
0 Kudos
Message 5 of 32
(15,432 Views)

Thanks for the reply, it ended up being the PDF file I was using.  When I tried another PDF file it worked.  I found another command line utility that works called "ptconverter".  http://www.digitzone.com/download/ptconverter.zip

 

This app cost $35 but it did read my original PDF where the readpdf_8.6.vi didn't read it.  I've used the "readpdf_8.6.vi" with success on other files, but I need to reliably read my PDFs, so I purchased "ptconverter".  My PDFs are Scanned documents, that are OCR'd by Acrobat, so I don't have a clean text file copy to begin with...the PDF I'm trying to read is an OCR output.

 

Doug

0 Kudos
Message 6 of 32
(15,417 Views)

We are also trying to read some TAGS within a PDF file and although PTConverter works fine (ie. Convert PDF file into a text file and then use LabVIEW to read the Text file) it is rather an in-direct way of reading the Tags directly from the PDF file.

 

Isn't there a nice easy solution (ie. not PDFBox) that we can read/search a PDF file directly in LabVIEW?

 

Chris

0 Kudos
Message 7 of 32
(15,260 Views)

The full PDF format is VERY complex. Probably the reason why PDFBox was choking on one of the PDF files of a former poster. You are of course free to implement a PDF parser in LabVIEW but expect this to be a project where a man year of effort certainly won't be enough to even get close to what PDFBox can do. Then decide if you want to give it away for free just for the good karma of it, or attempt to sell it with a potential of maybe one license every year. Smiley Very Happy

 

Just look at the opposite direction: Creating a PDF file from within LabVIEW. There are several Toolkits out there who can do that and they already took a considerable amount of time to develop. Yet the generation of a small subset of PDF features in a file is several exponents easier than parsing and interpreting any exisiting PDF document that might have been created by tools like Adobe Acrobate, with Adobe as the creater of PDF potentially using all the bells and whistles they eventually put into the PDF standard over those two or more decades, including quite a few bugs that eventually got documented as a feature.

Rolf Kalbermatter
My Blog
0 Kudos
Message 8 of 32
(15,255 Views)

Hi Rolf,

Just to clarify what we are trying to do.

 

A medical application is generating the PDF reports and I want to be able to read certain text fields in these reports such as the Patient Record Number etc.. so that I can then rename the PDF Filename to include this Tag.

 

There seem to be a lot of applications out there that can Extract text from a PDF file (eg. XPDF, Aspose, PDFlib TET, iTextSharp) I was just wondering if anyone had done this in LabVIEW and could recommend what tool they thought was the easiest to implement.

 

Chris 

0 Kudos
Message 9 of 32
(15,249 Views)

Hi Chris,


That's exactly what I do with "ptconverter".  I convert the PDF to a TEXT file, look for a static KEYWORD then parse my data out, and use it to name the file.  This I found was the easiest method to use.  See the sample attachment to get you started.

 

Doug

0 Kudos
Message 10 of 32
(15,236 Views)