Basics of text extraction
Basic methods of extracting text from an Akoma Ntoso XML document.
In this section
parsing Akoma Ntoso XML using Python's LXML library
extracting all text in a naive way
ignoring editorial remarks
XML vs HTML
Extracting text from Akoma Ntoso XML is the preferred process. You can also extract text from the HTML version of an AKN document, but it's a little different because HTML is not as explicitly structured as XML.
We'll be using the Cape Town Liquor Trading by-law for these examples.
Parsing Akoma Ntoso XML
Let's fetch the raw XML from the Laws.Africa API.
Now, parse the bytes in raw_xml
into an XML tree with lxml.
Extracting all text content (naive approach)
If you look at the XML file in a browser (which formats it), you'll see that the XML contains a mixture of metadata, structure and text content.
The simplest way of extracting text is to ignore the structure and just extract all text nodes.
This is quick and easy, but includes all text nodes, which may not be what we want:
It includes text from all structural elements including headings, numbers, editorial comments, footnotes, quoted and embedded elements, tables, etc.
It includes all portions of the document, including attachments such as schedules and appendixes.
Ignoring editorial remarks
When legislation is edited or amended, sometimes editorial remarks are added. For example:
These are not substantive and not officially part of the legal text of the document. We usually want to exclude these remarks when extracting text for full-text search purposes.
To do this, we can use XPath, a powerful mechanism of querying elements in XML documents.
Learn more about XPath at these resources:
XPath is rich and expressive and we will only cover the basics for this example. There are two key ideas we need:
A namespace lets us ignore parts of the document that aren't part of the Akoma Ntoso XML standard. The official Akoma Ntoso XML namespace is
http://docs.oasis-open.org/legaldocml/ns/akn/3.0
The query which describes the elements we want from the XML document. In this case, our query will mean "text nodes that aren't part of
<remark>
elements".
Here's a brief explanation of what the components of the XPath query mean:
XPath | Meaning |
---|---|
| Any node at any point in the tree. Alternatively, |
| This matches text nodes. |
| This applies additional conditions to the text nodes being matched. The conditions must evaluate to true to be included in the resulting node set. |
| This negates the condition inside the brackets. |
| The |
Ignoring other content
Other types of text content you may want to ignore, depending on your use case:
Text of headings and sub-headings:
<heading>
,<subheading>
,<crossHeading>
The numbers of chapters, sections, etc.:
<num>
Quoted and embedded content:
<quotedStructure>
,<embeddedStructure>
Depending on your needs, you can adjust the XPath to include these additional elements by including them in the not(...)
clause and separating them with or
.
In the next section, we'll explore how to extract text only for certain portions of the document, such as a particular section or chapter, or only table elements.
Last updated