Extracting text for analysis and machine learning
How to extract text from all portions of a document, for use in analysis or machine learning.
In this section
Using the Table of Contents to extract text recursively from an entire document
Extracting text only from the lowest hierarchical element
Extracting text from different portions of a document
Sometimes it's useful to break a piece of legislation down into its constituent portions (sections, paragraphs, sub-sections, etc.) and extract the text from each of those parts and work with them individually, rather than the entire document as a whole.
Example use cases include:
Semantic search or document similarity: using language embedding models to calculate embeddings for different parts of the document.
Keyword and taxonomy tagging: using machine learning models to automatically apply taxonomy tags to different parts of a document.
Recursive text extract using the Table of Contents
In this section, we'll recursively extract text from the different portions of the document, following the hierarchy defined by the Table of Contents.
Fetch the Table of Contents (TOC)
The Laws.Africa Content API can provide the Table of Contents (TOC) hierarchy of a document in JSON format. This is often easier than trying to build it yourself.
Let's fetch it from the API for our example Cape Town Liquor By-law, using the same API token we used in Basics of text extraction.
The toc
variable now contains an array of TOC entries. Each entry has key details such as a type, a title, and an id
. The id
matches the eId of the corresponding XML element.
A TOC item can also have nested items in its children
attribute.
Extract text from TOC entries
Let's create a function that recursively extracts the text from each entry in the TOC, as follows:
iterate over each TOC entry, and its children
for each entry, get the corresponding XML element from the XML document tree
extract the text from the element, and store it on a new
text
attribute on the TOC entry
We now have the text of the entire document broken down into individual portions.
With this information we can:
Calculate a text embedding for the text of each portion of the document, and index it into a vector database along with the heading and id of the portion. We can then run a semantic search query, and return the text, heading and id of matching portions.
Apply taxonomy tags to the text of each portion of the document, using a tool like Pool Party. We can then store the resulting tags and the id of the portion, and use it to enrich the document as shown in Advanced enrichments.
Last updated