Skip to main content

Documentation Index

Fetch the complete documentation index at: https://langchain-5e9cc07a-preview-mdrxyd-1779813393-7298843.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The unstructured package from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents. This page covers how to use the unstructured ecosystem within LangChain.

Installation and setup

If you are using a loader that runs locally, use the following steps to get unstructured and its dependencies running.
  • For the smallest installation footprint and to take advantage of features not available in the open-source unstructured package, install the Python SDK with pip install unstructured-client along with pip install langchain-unstructured to use the UnstructuredLoader and partition remotely against the Unstructured API. This loader lives in a LangChain partner repo instead of the langchain-community repo and you will need an api_key. You can generate a free key on the Unstructured API key page.
  • To run everything locally, install the open-source python package with pip install unstructured along with pip install langchain-community and use the same UnstructuredLoader as mentioned above.
    • You can install document specific dependencies with extras, e.g. pip install "unstructured[docx]". Learn more about extras in the full installation documentation.
    • To install the dependencies for all document types, use pip install "unstructured[all-docs]".
  • Install the following system dependencies if they are not already available on your system with e.g. brew install for Mac. Depending on what document types you’re parsing, you may not need all of these.
    • libmagic-dev (filetype detection)
    • poppler-utils (images and PDFs)
    • tesseract-ocr(images and PDFs)
    • qpdf (PDFs)
    • libreoffice (MS Office docs)
    • pandoc (EPUBs)
  • When running locally, Unstructured also recommends using Docker by following this guide to ensure all system dependencies are installed correctly.
The Unstructured API requires API keys to make requests. You can request an API key and start using it today! Check out the Unstructured API README to get started making API calls. We’d love to hear your feedback, let us know how it goes in our community slack. And stay tuned for improvements to both quality and performance! Check out the Docker self-hosting instructions if you’d like to self-host the Unstructured API or run it locally.

Data loaders

The primary usage of Unstructured is in data loaders.

UnstructuredLoader

See a usage example to see how you can use this loader for both partitioning locally and remotely with the serverless Unstructured API.
from langchain_unstructured import UnstructuredLoader