Read text in pdf files

lordsathish · ‎09-01-2009

Hi Ppl,

Is it possible to read text from pdf file ? We can use activex controls to open and display pdf files, but these activex doesn seem to support reading of text from these pdf files. Help me out plz.

Thanks

chrisger · ‎09-01-2009

Hi there

it seems there's a solution using .NET. Take a look at this:

http://www.codeproject.com/KB/string/pdf2text.aspx

You may call .NET from LabVIEW.

Best regards
chris

CL(A)Dly bending G-Force with LabVIEW

famous last words: "oh my god, it is full of stars!"

chrisger · ‎09-01-2009

OK, download PDFBox-0.7.3.zip from http://sourceforge.net/projects/pdfbox/files/ and open the VI attached.

You need to relink the .NET constructors for PDDocument and PDFTextStripper with PDFBox-0.7.3.dll (installed at the \bin\ directory of PDFBox-0.7.3). See block diagram for details.

Best regards
chris

CL(A)Dly bending G-Force with LabVIEW

famous last words: "oh my god, it is full of stars!"

Dhubbell · ‎12-22-2011

This is exactly what I need! I've downloaded and installed PDFBox-0.7.3. When I run the "readpdf_8.6.vi" I get this error code = 1172 and this message

Error calling method org.pdfbox.util.PDFTextStripper.getText, (System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
Inner Exception: System.NullReferenceException: Object reference not set to an instance of an object.) <append><b>System.NullReferenceException</b> in readpdf_8.6.vi

The comments section in the block diagram explains this:

http://www.codeproject.com/KB/string/pdf2text.aspx

Finally: PDFBox
PDFBox is another Java PDF library. It is also ready to use with the original Java Lucene (see LucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package, it's in the bin directory).

Using PDFBox in .NET requires adding references to:

•PDFBox-0.7.2.dll
•IKVM.GNU.Classpath
and copying IKVM.Runtime.dll to the bin directory.

Using the PDFBox to parse PDFs is fairly easy:

Collapse Copy Codeprivate static string parseUsingPDFBox(string filename)
{
    PDDocument doc = PDDocument.load(filename);
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
}

I'm running LV2011 on a XP Pro OS. I have Adobe Reader X installed. I called NI support and they were able to run this VI on their LV2011 W7 machine and they were able to run this vi on a LV8.6 XP machine.

Not sure what I'm missing to get this to run. Does anybody have another solution for reading the Text from a PDF file?

Thanks!

Doug

chrisger · ‎12-22-2011

Hi there

I just installed PDFBox-0.7.3 on a clean Win7 with 8p6p1. Then i downloaded the "readpdf_8.6.vi" from my original post. The VI ran without any error code.

Please check this:

- Try to relink the .NET constructors (right click constructor node and select "Select Constructor")

be aware that there are TWO constructors: "PDDocument" and "PDFTextStripper"

- Check where the error code 1172 is thrown (Constructor, invoke node "load" or "getText")

- I observed that the error code 1172 is thrown in case the file name is wrong. Check for correct filename.

- Have you tried a different pdf? Can you post the pdf you are using?

Kind Regards

Best regards
chris

CL(A)Dly bending G-Force with LabVIEW

famous last words: "oh my god, it is full of stars!"

Dhubbell · ‎12-22-2011

Thanks for the reply, it ended up being the PDF file I was using. When I tried another PDF file it worked. I found another command line utility that works called "ptconverter". http://www.digitzone.com/download/ptconverter.zip

This app cost $35 but it did read my original PDF where the readpdf_8.6.vi didn't read it. I've used the "readpdf_8.6.vi" with success on other files, but I need to reliably read my PDFs, so I purchased "ptconverter". My PDFs are Scanned documents, that are OCR'd by Acrobat, so I don't have a clean text file copy to begin with...the PDF I'm trying to read is an OCR output.

Doug

ChrisReed · ‎04-26-2012

We are also trying to read some TAGS within a PDF file and although PTConverter works fine (ie. Convert PDF file into a text file and then use LabVIEW to read the Text file) it is rather an in-direct way of reading the Tags directly from the PDF file.

Isn't there a nice easy solution (ie. not PDFBox) that we can read/search a PDF file directly in LabVIEW?

Chris

rolfk · ‎04-26-2012

The full PDF format is VERY complex. Probably the reason why PDFBox was choking on one of the PDF files of a former poster. You are of course free to implement a PDF parser in LabVIEW but expect this to be a project where a man year of effort certainly won't be enough to even get close to what PDFBox can do. Then decide if you want to give it away for free just for the good karma of it, or attempt to sell it with a potential of maybe one license every year.

Just look at the opposite direction: Creating a PDF file from within LabVIEW. There are several Toolkits out there who can do that and they already took a considerable amount of time to develop. Yet the generation of a small subset of PDF features in a file is several exponents easier than parsing and interpreting any exisiting PDF document that might have been created by tools like Adobe Acrobate, with Adobe as the creater of PDF potentially using all the bells and whistles they eventually put into the PDF standard over those two or more decades, including quite a few bugs that eventually got documented as a feature.

Rolf Kalbermatter
My Blog

ChrisReed · ‎04-26-2012

Hi Rolf,

Just to clarify what we are trying to do.

A medical application is generating the PDF reports and I want to be able to read certain text fields in these reports such as the Patient Record Number etc.. so that I can then rename the PDF Filename to include this Tag.

There seem to be a lot of applications out there that can Extract text from a PDF file (eg. XPDF, Aspose, PDFlib TET, iTextSharp) I was just wondering if anyone had done this in LabVIEW and could recommend what tool they thought was the easiest to implement.

Chris

Dhubbell · ‎04-26-2012

Hi Chris,

That's exactly what I do with "ptconverter". I convert the PDF to a TEXT file, look for a static KEYWORD then parse my data out, and use it to name the file. This I found was the easiest method to use. See the sample attachment to get you started.

Doug

LabVIEW

Read text in pdf files

Read text in pdf files

Re: Read text in pdf files

Re: Read text in pdf files

Re: Read text in pdf files

Re: Read text in pdf files

Re: Read text in pdf files

Re: Read text in pdf files

Re: Read text in pdf files

Re: Read text in pdf files

Re: Read text in pdf files