You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF.
You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV or a JSON file.
Ensure you have a Java runtime and set the PATH for it.
pip install tabula-py
If you want to leverage faster execution with jpype, install with jpype extra.
pip install tabula-py[jpype]
Example
tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.
importtabula# Read pdf into list of DataFramedfs=tabula.read_pdf("test.pdf", pages='all')
# Read remote pdf into list of DataFramedfs2=tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")
# convert PDF into CSV filetabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')
# convert all PDFs in a directorytabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')