You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 20, 2021. It is now read-only.
Deprecation Note:This is the old version of the Tabula extraction engine. New projects wishing to integrate Tabula should use tabula-java (the new Java version of this extraction engine) unless you prefer to use JRuby. Users looking for the command-line version of Tabula should also use tabula-java.
Extract tables from PDF files. tabula-extractor is the table extraction engine that used to power Tabula.
If you're beginning a new project, consider using tabula-java, a pure-Java version of the extraction engine behind Tabula. If you want Ruby bindings and are okay using JRuby (or have already begin a project), you may continue to use this project. This project's JRuby backend has been replaced with the Java backend; all that remains here is a thin wrapper for Ruby compatibility. This wrapper maintains API backwards-compatibility with the old, pure-JRuby implementation that we all know and love.
Installation
tabula-extractor only works with JRuby 1.7 or newer. Install JRuby and run
jruby -S gem install tabula-extractor
Usage
Tabula helps you extract tables from PDFs
Usage:
tabula [options] <pdf_file>
where [options] are:
Tabula helps you extract tables from PDFs
--pages, -p <s>: Comma separated list of ranges. Examples: --pages
1-3,5-7 or --pages 3. Default is --pages 1 (default:
1)
--area, -a <s>: Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire page
--columns, -c <s>: X coordinates of column boundaries. Example --columns
10.1,20.2,30.3
--password, -s <s>: Password to decrypt document. Default is empty
(default: )
--guess, -g: Guess the portion of the page to analyze per page.
--debug, -d: Print detected table areas instead of processing.
--format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV)
--outfile, -o <s>: Write output to <file> instead of STDOUT (default: -)
--spreadsheet, -r: Force PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines separating each
cell, as in a PDF of an Excel spreadsheet)
--no-spreadsheet, -n: Force PDF not to be extracted using spreadsheet-style
extraction (if there are ruling lines separating each
cell, as in a PDF of an Excel spreadsheet)
--silent, -i: Suppress all stderr output.
--use-line-returns, -u: Use embedded line returns in cells.
--version, -v: Print version and exit
--help, -h: Show this message
Scripting examples
tabula-extractor is a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but the tests are a good source of information.
Here's a very basic example:
require'tabula'pdf_file_path="whatever.pdf"outfilename="whatever.csv"out=open(outfilename,'w')extractor=Tabula::Extraction::ObjectExtractor.new(pdf_file_path,:all)extractor.extract.eachdo |pdf_page|
pdf_page.spreadsheets.eachdo |spreadsheet|
out << spreadsheet.to_csvout << "\n\n"endendout.close