CARVIEW |
Select Language
HTTP/2 200
date: Wed, 23 Jul 2025 15:13:35 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
etag: W/"712e96f5d6ac260fd2b944052dfb18c0"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=%2BAHgMypVSZTfR0doSmSHqPAo%2Fj91CSWnrAca%2F9nCfLFljT96WdEpMHsK17B3NjGoGTyrOQQwY%2FtnsQJgt1moy%2FvJ9h0M3pyb8XYoB4wkOER9QpH3LcM%2FBsGf%2B6AsJ%2F4hX4I0j760HZ8h6Idpz3ddp76XSq7Q4C5UbJ7uGBRKh3llEDqLsuZRTbpO5iklhAHnMgYYK%2BfuYgo8OjAk%2FPSbGiGFFiffu2MEpF%2B5yG7MJGaPOwUpTN1FOr%2FQZsrVNkm5s0t8Y%2FqHlityON0jUb%2FhjA%3D%3D--w2qtgDpOCJeB0Vyg--mTNeQ8eGIS7v1yx9cYabYw%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.193459490.1753283614; Path=/; Domain=github.com; Expires=Thu, 23 Jul 2026 15:13:34 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 23 Jul 2026 15:13:34 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: D484:1A99C5:DD116F:1053D3D:6880FC1E
Tags · Unstructured-IO/unstructured · GitHub
Toggle 0.18.10's commit message
Toggle 0.18.9's commit message
Toggle 0.18.7's commit message
Toggle 0.18.6's commit message
Toggle 0.18.5's commit message
Toggle 0.18.1's commit message
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 993
Tags: Unstructured-IO/unstructured
Tags
0.18.10
Add OCR_AGENT_CACHE_SIZE environment variable (#4066) ## Problem OCR agents used unlimited caching, causing excessive memory usage. Each cached OCR agent consumes different amounts of memory, but can easily consume ~800MB. ## Solution Add `OCR_AGENT_CACHE_SIZE` environment variable to limit cached OCR agents per process. - **Default**: 1 cached agent - **Configurable**: Set to 0 to disable caching, or higher for more languages
0.18.9
feat: keep input tag's class attr in table (#4064) This change affects partition html. Previously when there is a table in the html, we clean any tags inside the table of their class and id attributes except for the class attribute for `img` tags. This change also preserves the class attribute for `input` tags inside a table. This change is reflected in a table element's metadata.text_as_html attribute.
0.18.7
feat: detect language for PDFs (#4051) The `@apply_metadata` decorator already contains logic to detect the language of the element text (on either a document or element level). Update pdfs, and later images, to use this decorator to get accurate element language results outputted. Test ``` from unstructured.partition.auto import partition def test_partition_pdf(): pdf_path = "example-docs/language-docs/fr_olap.pdf" elements = partition(pdf_path) # optionally set `detect_language_per_element=True)` print(f"Number of elements partitioned: {len(elements)}") # Check if elements are returned assert len(elements) > 0, "No elements were partitioned from the PDF." # check language outputted for each element for element in elements: print(element) print(element.metadata.languages) print("-------------------------------") test_partition_pdf() ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
0.18.6
fix: type for serialized TableChunks (#4056) #### To test, simply serialize a TableChunk element with and without the changes in the PR ____ **Without the changes:** ``` In [1]: from unstructured.documents.elements import TableChunk In [2]: TableChunk("hi") Out[2]: <unstructured.documents.elements.TableChunk at 0x110113410> In [3]: TableChunk("hi").to_dict() Out[3]: {'type': 'Table', 'element_id': '6267e99a-46d8-4f2d-a206-51c691469c72', 'text': 'hi', 'metadata': {}} ``` ____ **With the changes:** ``` In [1]: from unstructured.documents.elements import TableChunk In [2]: TableChunk("hi") Out[2]: <unstructured.documents.elements.TableChunk at 0x10367f050> In [3]: TableChunk("hi").to_dict() Out[3]: {'type': 'TableChunk', 'element_id': 'f91af3ac-0dea-4dc4-8a6a-69c28cfcca3b', 'text': 'hi', 'metadata': {}} ``` ____
0.18.5
feat: keep img tag's class attr (#4050) This change affects partition html. Previously when there is a table in the html, we clean any tags inside the table of their `class` and `id` attributes. However, sometimes there are images, `img` tags, present in a table and its `class` attribute identifies some important information about the image. This change preserves the `class` attribute for `img` tags inside a table. This change is reflected in a table element's `metadata.text_as_html` attribute.
0.18.1
feat: add DocumentData type (#4031) In scenarios where there is a large amount of data that represents the document rather than individual elements in the document, it may be preferable to specify this in a single location rather than duplicating the data across all elements (as we do for smaller metadata like filename or filetype) This PR adds DocumentData element type which can be used to uniquely capture this data.
PreviousNext
You can’t perform that action at this time.