CARVIEW |
Select Language
HTTP/2 200
date: Wed, 23 Jul 2025 01:00:12 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
etag: W/"a9ad1bc02617554a2349b4ffd376bed8"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=3FFSaSLIpniaHD54XvSV6MnRf8U1g03whGWndAlJ%2BoqcgeuJKbW%2FdJnyRuBUAvom%2BG8xr7MFI%2BOaki4eY0u6xQFaPVa1qgWKxYTKmGCkTW%2FIs5cLB6rJ%2BbzWJAnCwNmk6LSohh2afqvRiXCwA7V4EpNdR3vqkPXpCFNAyicNpt2WYPzI2XJTJIvX2mjZJ17MaxvPk%2FLgDqeqHrkvO019GGaMHaB%2BoafqQOG8%2BOU5AZSAvBl1QpnWiYNhI8rVgW454amHqe91Gf%2Bo8B1t3ZVclA%3D%3D--JCxlaIEl9rzxyMkv--Bl5vmlPhn%2BUzWfzPrMupWg%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.5012186.1753232411; Path=/; Domain=github.com; Expires=Thu, 23 Jul 2026 01:00:11 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 23 Jul 2026 01:00:11 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: EC88:17EAA:218C69:2EF7D9:6880341B
Releases · Unstructured-IO/unstructured · GitHub
18 Jul 17:31
Loading
16 Jul 22:48
Loading
15 Jul 20:59
Loading
15 Jul 19:08
Loading
08 Jul 08:13
Loading
05 Jul 19:34
Loading
01 Jul 23:42
Loading
24 Jun 23:52
Loading
13 Jun 02:43
Loading
20 Mar 16:52
Loading
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 993
Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.18.10
a040483
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Compare
0.18.10
Enhancements
Features
- Add OCR_AGENT_CACHE_SIZE environment variable Added configurable cache size for OCR agents to control memory usage.
Assets 2
0.18.9
909716f
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Compare
0.18.9
Enhancements
Features
- Convert elements to markdown for output Added function to convert elements to markdown format for easy viewing.
Fixes
- Language detection nit Handle empty text
- Properly handle password protected xlsx - detect password protection on XLSX files and raise appropriate
Assets 2
0.18.7
344202f
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Compare
0.18.7
Enhancements
Features
- Add language detection for PDFs Add document and element level language detection to PDFs.
Fixes
Assets 2
0.18.6
2ffaf6f
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Compare
0.18.6
Enhancements
Features
Fixes
- Improved epub partition errors EPUB partition will now produce new type of error on unprocessable files.
- Fix type for serialized TableChunks Use
TableChunk
for the string value of the fieldtype
when serializing elements of typeTableChunk
, rather than using the valueTable
.
Assets 2
0.18.4
f078cd9
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Compare
What's Changed
- fix(partition, csv): increase csv field limit by @ds-filipknefel in #4046
Full Changelog: 0.18.3...0.18.4
Assets 2
0.18.3
8a9abdd
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Compare
What's Changed
Full Changelog: 0.18.2...0.18.3
Assets 2
0.18.2
d7dfda9
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Compare
What's Changed
- fix [NEX-49] : Fix TypeError for empty HTML content by @yuming-long in #4032
- fix: add try/except wrap over row.cells to failproof tc grid_offset by @Klaijan in #4033
- fix: xml processing not escaped by @jiajun-unstructured in #4034
- fix: update md to reads umlauts on non-utf-8 files by @Klaijan in #4037
- bump version by @Klaijan in #4038
- fix: fix header and footer not parsed as Header/Footer types by @badGarnet in #4041
- bump version to make a release by @badGarnet in #4042
Full Changelog: 0.18.1...0.18.2
Assets 2
0.18.1
3f87946
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Compare
Enhancements
Features
- Add DocumentData element type This is helpful in scenarios where there is large data that does not make sense to represent across each element in the document.
Fixes
- The
encoding
property of the_CsvPartitioningContext
is now properly used.
Assets 2
0.17.11-dev1
5e43e36
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Compare
What's Changed
- Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names by @srisudarsan in #3959
- manual trigger of workflows to publish new image and new vers tag in … by @luke-kucing in #3965
- chore: deprecate stage_for_label_studio by @qued in #3968
- build: remove test and dev deps from docker image by @qued in #3969
- feat: convenience unstructured-get-json.sh update by @cragwolfe in #3971
- chore: allow changing default output dir for unstructured-get-json.sh by @cragwolfe in #3973
- chore: add html path to ingest-test-fixtures-update-pr by @cragwolfe in #3977
- fix: hi_res PDF parsing: only uncategorized text for extracted elements by @cragwolfe in #3975
- Fix sort_page_element. ensures that sorting is stable and not random. by @pprados in #3978
- Update pdfminer_utils.py by @Nathan-GoSupply in #3974
- fix cve by @potter-potter in #3989
- fix: Add missing diffstat command to test_json_to_html CI job by @mpolomdeepsense in #3992
- fix: failing build by @mpolomdeepsense in #3993
- fix: properly handle the case when an element's text is None by @badGarnet in #3995
- fix: Fix for Pillow error when extracting PNG images by @awalker4 in #3998
- fix: throw validation error when json is passed with invalid unstructured json by @jordan-homan in #4002
- Replace Serverless API to Platform announcement on README page by @ron-unstructured in #4003
- fix: resolve warnings of logger library by @emmanuel-ferdman in #3999
- chore: script to verify unstructured image outbound connectivity by @cragwolfe in #4008
- resolve CVEs and HF issue by @luke-kucing in #4009
- Feat/bump inference by @badGarnet in #4013
- Bump requests to address CVEs by @PastelStorm in #4015
- Drop Python 3.9 support due to dependency conflicts by @PastelStorm in #4017
- Remove IDs from HTML code by @plutasnyy in #4012
- fix chucking text None type has no attribute stripe by @yuming-long in #4018
- recompile on arm64 to get minimum reqs by @badGarnet in #4020
New Contributors
- @srisudarsan made their first contribution in #3959
- @Nathan-GoSupply made their first contribution in #3974
- @jordan-homan made their first contribution in #4002
- @emmanuel-ferdman made their first contribution in #3999
- @PastelStorm made their first contribution in #4015
Full Changelog: 0.17.2...0.17.11-dev1
Assets 2
0.17.2
0fa5174
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Compare
Enhancements
-
Add image_url of images in html partitioner
<img>
tags with non-data content include a new image_url metadata field with the content of the src attribute. -
Use
lxml
instead ofbs4
to parse hOCR data.lxml
is much faster thanbs4
given the hOCR data format is regular (garanteed because it is programatically generated) -
bump
numpy
to>2
. And upgradepaddlepaddle
,unstructured-paddleocr
,onnx
so they are compatible withnumpy>2
.
Fixes
- Fix Image in a tag is "UncategorizedText" with no .text
What's Changed
- feat: support extracting image url in html by @ryannikolaidis in #3955
- feat: use lxml instead of bs4 to parse hOCR data by @badGarnet in #3960
- Feat/bump numpy to 2 by @badGarnet in #3961
- Image within div or span with no text is annotated as Image by @ajjimeno in #3962
Full Changelog: 0.17.0...0.17.2
Assets 2
Previous Next
You can’t perform that action at this time.