Select language
  1. Products
  2. Aspose.PDF
  3. PDF.Extractor for .NET
clearbit icon

PDF.Extractor for .NET

PDF.Extractor for .NET is a plug-in that aims to extract text from PDF document. It supports three modes of operation: pure, raw and plain. If the mode is not set by the developer, the default mode is 'raw'.

Buy for $99
Share
Instagram Logo Dribbble Logo Twitter Logo Youtube Logo

Features:

1. Extracts text from the PDF file.

2. Supports three types of mode: pure, raw, plain.

3. The default mode is 'raw'.

4. Supports combinations of the file path and file streams in input and output.

Extract Text from PDF Document via .NET Library

PDF.Extractor for .NET supports three types of operating mode:

1. Pure gives the possibility to extrat the text from the PDF file using several formatting procedures, which include taking into account relative positions and adding extra spaces align text to the width of the page.

2. Raw extracts text the PDF file without formatting it.

3. Plain extracts text from the PDF file considering relative positioning of the text fragments but (unlike the Pure Mode) without adding extra space.

If the mode isn't set by developer the default mode is 'Raw'

Sample code for extracting text with 'default' options:

1
2
3
4
5
6
7
{
    PdfTextExtractionOptions extractorOptions = new PdfTextExtractionOptions();
    extractorOptions.AddDataSource(new FileDataSource(dataDir + "input.pdf"));
    ResultContainer resultContainer = extractor.Process(extractorOptions);
    string textExtracted = resultContainer.ResultCollection[0].ToString();
    Console.WriteLine(textExtracted);
}

Sample code to extract text with mode set:

1
2
3
4
5
6
7
{
    PdfTextExtractionOptions extractorOptions = new PdfTextExtractionOptions(PdfTextExtractionOptions.TextFormattingMode.Pure);
    extractorOptions.AddDataSource(new FileDataSource(dataDir + "input.pdf"));
    ResultContainer resultContainer = extractor.Process(extractorOptions);
    string textExtracted = resultContainer.ResultCollection[0].ToString();
    Console.WriteLine(textExtracted);
}

Multiple input files (or streams) can be specified as input. Then the ResultCollection will contain the corresponding number of results. Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
    GetInputPath(dataDir + "text_sample1.pdf"),
    GetInputPath(dataDir + "text_sample2.pdf")
};
using (PdfExtractor extractor = new PdfExtractor())
{
    PdfTextExtractionOptions extractorOptions = new PdfTextExtractionOptions(PdfTextExtractionOptions.TextFormattingMode.Pure);
    foreach (string inputPath in inputPaths)
    {
        extractorOptions.AddDataSource(new FileDataSource(inputPath));
    }
    ResultContainer resultContainer = extractor.Process(extractorOptions);
    for (int i = 0; i < resultContainer.ResultCollection.Count; i++)
    {
        string textExtracted = resultContainer.ResultCollection[i].ToString();
        Console.WriteLine(textExtracted);
    }
}

How to Extract Text from PDF Document


  1. Install PDF.Extractor for .NET.
  2. Create and object of TextDevice class.
  3. Use the object of TextExtractOptions class to specify extraction options.
  4. Save the text to the output file.

System Requirements

Just make sure that you have the following prerequisites.

  • Microsoft Windows or a compatible OS with .NET Framework or .NET Core
  • VBScript, Delphi, C++ via COM Interop.
  • Development environment like Microsoft Visual Studio.
  • Aspose.Imaging Conversion for .NET DLL referenced in your project.