Back to Help Menu

DocumentCloud Add-Ons

Add-Ons make it easy for anyone to add and share additional features within the DocumentCloud platform, ranging from automating repetitive tasks to integrating machine learning and data visualization techniques.

Overview

For end users, using an Add-On is as simple as selecting documents or executing a search, picking which feature they’d like to apply to the results, and then submitting.

On the backend, Add-Ons execute Python scripts organized in a standard way, hosted and processed right on GitHub.

Add-Ons can take advantage of the full DocumentCloud API as well as the ability to call other third-party services and a few Add-On specific functions such as the ability to store arbitrary files, send a user emails, and track the progress of an Add-Ons' execution and display messages to the user.

In addition to being able to execute Add-Ons via the DocumentCloud user interface, these extensions are also designed to run smoothly on your local computer — simply clone the repository to your local device, install the DocumentCloud Python wrapper, and then invoke the main.py file of the Add-On. Invocation requires your DocumentCloud username and password if the add-on requires authentication, which is used to fetch a refresh and access token. They can be passed in as command line arguments (--username and --password), or as environment variables (DC_USERNAME and DC_PASSWORD).

You can also pass in a list of document IDs (--documents), a search query (--query), and JSON parameters for your Add-On (--data) - be sure to properly quote your JSON at the command line.

Example invocation:

python main.py --documents 123 --data '{"name": "World"}'

We have an Add-On template hosted on GitHub that demonstrates basic features, as well as a variety of other example Add-Ons that might serve as a useful base for your own work.

DocumentCloud Premium and AI Credits

Some Add-Ons require AI credits to run as they use paid services to perform operations like OCR, document translation or the Add-On uses AI tools which have costs.
DocumentCloud Premium comes with AI credits for both professional and organizational accounts on MuckRock. You can upgrade your account or organization on the MuckRock Select Plan page. You can also upgrade your plan by clicking on the drop-down menu named "Premium" once you log in. It will link you to the same upgrade plan page.

To check your AI credit balance, you can:

  1. Click on the name of your organization in the top navigation bar and your monthly allowance will appear there. If you are a freelancer, you click on the second drop-down menu (next to your account name) and your balance should appear there as well.
  2. Click on a premium Add-On from the Add-On run menu. Your allowance will appear in the run menu.
  3. When uploading a document, your AI credit balance also appears as text in the upload menu where you can select which OCR engine you want to use.

AI Credit usage is logged. If you want to know how your AI credits were used, contact us.

At this time, there are eight premium Add-Ons:

  • Azure Document Intelligence OCR uses Azure's Document Intelligence system to OCR documents. This Add-On requires AI credits.
  • Azure Table Extractor uses Azure Document Intelligence Form Analyzer to detect tables and extract them to a CSV or JSON file for you.
  • Google Cloud Vision OCR uses Google Cloud Vision OCR engine to OCR documents. This Add-On requires AI credits.
  • GPT 3.5 Turbo PlayGround Use GPT 3.5 Turbo to help analyze your documents, right within DocumentCloud. Give this Add-On a prompt as well as an optional key for a key/value pair to add information as a tag. This Add-On requires AI credits.
  • GPT 4 Vision Table Extractor Use GPT 4 Vision to extract tables from documents in CSV or JSON format.
  • Textract OCR uses Amazon's Textract OCR engine to OCR documents. This Add-On requires AI credits.
  • Textract Table Extractor uses Amazon Textract to detect tables and extract them to a CSV or excel file for you.
  • Translate Documents Uses Google Translate API to translate documents which will automatically be uploaded to DocumentCloud. This Add-On requires AI credits.

Types of Add-Ons

There are several different types of Add-Ons, including ones that use AI, perform bulk operations, others that specialize in data extraction, ones that calculate DocumentCloud statistics, some are used to export documents or data contained in documents, some monitor websites for changes or for newly uploaded documents, and others transform other types of files DocumentCloud doesn't natively support into more readily analyzable documents.

AI-Based Add-Ons

  • GPT-3.5 Turbo PlayGround Use GPT3.5 Turbo to help analyze your documents, right within DocumentCloud. Give this Add-On a prompt as well as an optional key for a key/value pair to add information as a tag. This Add-On requires AI credits.
  • GPT 4 Vision Table Extractor Use GPT 4 Vision to extract tables from documents in CSV or JSON format.

Bulk Operations Add-Ons

Data Extraction & Analysis Add-Ons

  • Bad Redactions: Building off the excellent X-Ray library from Free Law Project, Bad Redactions looks for instances where there are redaction fails leaving the underlying data intact. This is useful for both investigating if there's more information than meets the eye as well as making sure you properly and fully delete information from your own uploads. Note that DocumentCloud automatically flatten pages and deletes underlying data when you use our redaction tools or force OCR. We recommend trying it on the infamous Manafort filing, which the Add-On flagged and highlighted 25 redaction errors for us during our test. You can have the Add-On leave a private annotation around the mis-redacted information or have it go ahead and properly redact it for you.
  • Regex Extractor: Let’s you define a Regex string to pull out specified text matches into a spreadsheet across a selection of documents.
  • Multiple Regex Extractor Let's you define multiple regex patterns to search across the document selection and returns a CSV file of all the strings with given regex matches.
  • PII Detector Detects PII in a document, annotate where, and automatically e-mail you when sensitive PII is detected if you choose. It supports detecting addresses, zipcodes, SSNs, emails, phone numbers, and credit card numbers.
  • Tabula Spreadsheet Extraction Runs the open source Tabula library against a selected PDF and tries to identify and extract any tables. You can provide a Google Drive or Dropbox URL to a Tabula template you have generated already to run against the documents. If no template is provided, tabula will try to guess the boundaries of the tables within the document.
  • Azure Document Intelligence OCR uses Azure's Document Intelligence system to OCR documents. This Add-On requires AI credits.
  • Google Cloud Vision OCR uses Google Cloud Vision OCR engine to OCR documents. This Add-On requires AI credits.
  • docTR OCR uses the docTR OCR library to OCR documents.
  • Azure Table Extractor uses Azure Document Intelligence Form Analyzer to detect tables and extract them to a CSV or JSON file for you.
  • Textract Table Extractor uses Amazon Textract to detect tables and extract them to a CSV or excel file for you.

Export Add-Ons

File Transformation Add-Ons

  • Transcribe Audio: This Add-On transcribes audio/video files using OpenAI's Whisper. You may upload audio files from any publically accessible URL. You may also use share links from Google Drive, Dropbox, Mediafire, Wetransfer and YouTube. If you use a share link for a folder, it will process all files in that folder.
  • Translate Documents Uses Google Translate API to translate documents which will automatically be uploaded to DocumentCloud. This Add-On requires AI credits.
  • Email Conversion Add-On Converts EML & MSG files to PDFs and uploads them to DocumentCloud. Also has optional attachment extraction which will be presented for download to the user.
  • PDF Compression Add-On Uses ghostscript to compress large PDFs to upload to DocumentCloud.
  • PDF Re-Flow Add-On Resizes a document using k2pdfopt to optimize the document for reading on smaller screens, such as e-readers and phones.
  • Document Splitter Split a document into two along a specified page using this Add-On which uses pdftk in the background.

Site Monitoring Add-Ons

  • Scraper: This Add-On will monitor a given site for documents and upload them to your DocumentCloud account, alerting you to any documents that meet given keyword criteria.
  • Klaxon Site Monitor: This Add-On will monitor a given site for changes based on a specified CSS selector and email you when there are changes on the site. It will additionally archive the newly seen page using The Wayback Machine provided by the Internet Archive.

Statistical Add-Ons

  • N-Gram Graphs: Feel like your seeing a term pop up more and more often? Now it's easier to get validation of your hunch — this Add-On maps the occurrence of words over time you input and then compares them to each other across a given search.
  • Page Stats: Gives you basic statistics about the total length of a selection of documents, the longest document, shortest document and average pages per document.
  • User upload frequency graph: Curious whether you're more productive during some months than others? Want to see the progress of your sharing with the public? Use this Add-On to graph your uploads over time. Tip: Put your username in as it appears in the search field (i.e., michael-morisy-658)

New Add-Ons are being added all the time. Under the Add-Ons dialog, click "Browse All Add-Ons" to explore and activate or deactivate Add-Ons. Register for the DocumentCloud newsletter to get updates on additional features and other announcements.

Run Your Add-On in DocumentCloud

If you write your own Add-On, you can run it from with DocumentCloud's user interface through a few simple steps.

First, install the Github DocumentCloud App. Note that for this to work properly, you must have your primary Github and MuckRock accounts set to use the same email address. You can set your primary MuckRock account email here.

As you add the Github DocumentCloud App, give it access to only those repositories in your Github account that are Add-Ons you want to run. You can modify this from this page once you have the app installed in Github.

A screenshot of the above linked webpage, showing a single repository linked to the DocumentCloud Github app.

Then your Add-On will appear for you under "Browse All Add-Ons" and you can activate it there.

Permissions and Security

Currently, the DocumentCloud team reviews and vets each Add-On that is integrated directly within the site (i.e., the ones you see in the Add-On dropdown). Add-Ons that a user downloads and runs locally, however, are not necessarily vetted or reviewed by the DocumentCloud team and you should only run Add-Ons that are published by individuals you trust.

Currently, Add-Ons are essentially given full access to your user account, and can do anything you can while logged in, including reading all of your documents, deleting or modifying them, sharing documents with other users, and much more.

For Add-Ons run through the site, they do not see your account credentials, just a unique token granted to that Add-On. For Add-Ons run through a GitHub Action or run locally, there is the potential for a maliciously written Add-On to obtain your credentials so it is particularly important to understand and trust the source of the Add-On before you run it.

As we open up Add-Ons to additional third-party contributions, we'll begin to offer more limited access tokens that constrain permissions to just the documents and actions explicitly granted to them, as well as defining certain time scopes for that access.

Calling Existing Add-Ons Programmatically

Add-Ons can be called using the API. To call an Add-On, you need to know the Add-On ID of the Add-On you would like to call. To find the Add-On ID, you can query for the name of the Add-On first and the Add-On ID will be returned.

For example, I can query for the Bulk Edit Add-On like so:
GET /api/addons/?query=Azure+Document+Intelligence+OCR

Fully expanded on the API URL it looks like so:
https://api.www.documentcloud.org/api/addons/?query=Azure+Document+Intelligence+OCR
The response for this call includes the ID of the Azure Document Intelligence OCR Add-On, which is 544.

{
    "next": null,
    "previous": null,
    "results": [
        {
            "id": 544,
            "user": 102112,
            "organization": 125,
            "access": "public",
            "name": "Azure Document Intelligence OCR",
            "repository": "MuckRock/documentcloud-azure-document-intelligence-ocr-addon",
            "parameters": {
                "cost": {
                    "unit": "page",
                    "price": 1,
                    "amount": 1
                },
                "type": "object",
                "title": "Azure Document Intelligence OCR",
                "documents": [
                    "selected"
                ],
                "categories": [
                    "extraction",
                    "premium"
                ],
                "properties": {
                    "to_tag": {
                        "type": "boolean",
                        "title": "Tag OCR'd documents?",
                        "description": "If selected, the key/value pair of ocr_engine:azure will be added to the document as metadata"
                    }
                },
                "description": "<p>This Add-On uses Azure&rsquo;s Document Intelligence API to OCR documents.</p>",
                "instructions": ""
            },
            "created_at": "2023-09-13T15:00:43.149354Z",
            "updated_at": "2024-07-30T18:02:37.812200Z",
            "active": false,
            "default": false,
            "featured": false
        }
    ]
}

We can then invoke this Add-On by using the endpoint:
POST /api/addon_runs/

You must pass JSON in the request body that describes the Add-On including its ID found earlier. You must include parameters that are required by the Add-On in order to run successfully.
The required parameters are explicitly specified in the Add-On's config.yaml file in its GitHub repository. If the Add-On performs actions on documents, you must also pass the document IDs of documents you want to run the Add-On on.
If you are invoking a lot of simultaneous or sequential Add-On runs we recommend to set dismissed to True. Setting dismissed to True will hide the Add-On run's progress bar from appearing in your account when logged in.
The OCR Scheduler Add-On includes sample code in Python for how to call the OCR Add-Ons for reference.

{
    "addon": 544,
    "parameters": {
        "to_tag": true
    },
    "documents": [
        doc_id_1,
        doc_id_2
    ],
    "dismissed": true
}

Once you have called an Add-On, you can poll the Add-On run occasionally, to see if it has succeeded or failed. We recommend to run your Add-On runs in batches and occasionally, poll for success before calling more Add-On runs to avoid rate limits and performance issues.

You will want to access the following endpoint and provide the UUID of the Add-On run which is provided to you in a response when you initially invoke the Add-On.
GET /api/addon_runs/<uuid>/

This will return a status for each Add-On run which you can use to monitor and either retry, debug or invoke new Add-On runs.

Document Selection

When you run an Add-On via the DocumentCloud web interface, it will take one of four options for what documents to act on:

  • Selected: When you run the Add-On, it will try to act on the documents that are currently selected with a check mark.
  • Query: When you run the Add-On, it will try to act on all of the documents that are currently listed in the search results, including documents that are not in the current view. Note that large numbers of search results or search results that include documents you don't have permissions to will often be more likely to have errors.
  • Both: Some Add-Ons will let you select between the two options above. If you don't currently have any documents selected, it will default to acting on the documents in the search results while letting you know that you may select documents instead. To do so, cancel the Add-On, select the documents, and pick the Add-On again.
  • Neither: Some Add-Ons don't actually take any documents as input, such as an Add-On that imports documents from a link or scrapes a webpage for document links.

Note that currently, these options determine what specific document IDs are sent to the Add-On, but the Add-On still has permissions to your entire document collection. In the future, as we better understand Add-On use cases, we plan to restrict access permissions to only the subset of documents that an Add-On requires to successfully run.

Deep Linking Add-Ons

DocumentCloud Add-Ons have deep linking enabled, meaning you can share the link to a useful Add-On to others with ease. You will notice when clicking on an Add-On it pulls up the configuration menu and change the URL as well. For example, clicking on the PII Detector Add-On allows me to link to the Add-On directly like so: https://www.documentcloud.org/app?q=%2B#add-ons/MuckRock/PII-Detector

Add-Ons can also be shared with parameters pre-filled by modifying the URL. For example, to share a URL to the PII Detector Add-On with the Detect SSNs field pre-selected, one can do so like this: https://www.documentcloud.org/app?q=%2B&ssn=true#add-ons/MuckRock/PII-Detector

Properties defined in the Add-On's config.yaml can continue to be chained one after another and deep linked, like this one that specifies both the site to monitor along with the * (all) CSS selector for Klaxon. https://www.documentcloud.org/app?q=%2B&site=https://muckrock.com&selector=*#add-ons/MuckRock/Klaxon

Hourly, daily, or weekly event options for scheduled Add-Ons (like Klaxon and Scraper) can be passed as parameters as well. https://www.documentcloud.org/app?q=%2B&site=https://muckrock.com&selector=*&event=hourly#add-ons/MuckRock/Klaxon

Submit an Add-On Suggestion

You can submit your Add-On for review to share with all DocumentCloud users by filling out this form.

If you have other questions, suggestions, or feedback, please email info@documentcloud.org — we’re excited to see what you do with Add-Ons!