Basics of text extraction

Basic methods of extracting text from an Akoma Ntoso XML document.

In this section

  • parsing Akoma Ntoso XML using Python's LXML library

  • extracting all text in a naive way

  • ignoring editorial remarks

XML vs HTML

Extracting text from Akoma Ntoso XML is the preferred process. You can also extract text from the HTML version of an AKN document, but it's a little different because HTML is not as explicitly structured as XML.

We'll be using the Cape Town Liquor Trading by-law for these examples.

Parsing Akoma Ntoso XML

Let's fetch the raw XML from the Laws.Africa API.

from urllib.request import Request, urlopen

TOKEN = "your-auth-token"
url = "https://api.laws.africa/v3/akn/za-cpt/act/by-law/2014/control-undertakings-liquor/eng/.xml"
request = Request(url, headers={"Authorization": f"Token {TOKEN}"})
# raw_xml is a bytes object, not a str
raw_xml = urlopen(request).read()
print(raw_xml[:100])
# b'<akomaNtoso xmlns="http://docs.oasis-open.org/legaldocml/ns/akn/3.0"><act ...

Now, parse the bytes in raw_xml into an XML tree with lxml.

from lxml import etree
# parse the raw XML bytes (even though the method name is "fromstring")
root = etree.fromstring(raw_xml)

Extracting all text content (naive approach)

If you look at the XML file in a browser (which formats it), you'll see that the XML contains a mixture of metadata, structure and text content.

The simplest way of extracting text is to ignore the structure and just extract all text nodes.

# itertext() iterates over all text nodes in the entire document
text = ' '.join(root.itertext())
print(text[:100])
# To provide for the control of undertakings selling liquor to the public including the control of tra

This is quick and easy, but includes all text nodes, which may not be what we want:

  • It includes text from all structural elements including headings, numbers, editorial comments, footnotes, quoted and embedded elements, tables, etc.

  • It includes all portions of the document, including attachments such as schedules and appendixes.

Ignoring editorial remarks

When legislation is edited or amended, sometimes editorial remarks are added. For example:

<p eId="sec_6__subsec_4__p_1">
  <remark status="editorial">
    [subsection (4) deleted by section 1 of the <ref href="/akn/za-cpt/act/by-law/2014/control-undertakings-liquor-amendment" eId="sec_6__subsec_4__p_1__ref_1">Amendment By-law, 2014</ref>]
  </remark>
</p>

These are not substantive and not officially part of the legal text of the document. We usually want to exclude these remarks when extracting text for full-text search purposes.

To do this, we can use XPath, a powerful mechanism of querying elements in XML documents.

XPath is rich and expressive and we will only cover the basics for this example. There are two key ideas we need:

  • A namespace lets us ignore parts of the document that aren't part of the Akoma Ntoso XML standard. The official Akoma Ntoso XML namespace is http://docs.oasis-open.org/legaldocml/ns/akn/3.0

  • The query which describes the elements we want from the XML document. In this case, our query will mean "text nodes that aren't part of <remark> elements".

# The AKN namespace is the default one for this XML document,
# which is given by the None entry in the namespace map (nsmap).
# This is the same as ns = "http://docs.oasis-open.org/legaldocml/ns/akn/3.0"
ns = root.nsmap[None]

# This tells xpath that when we use the "a" alias, we mean the AKN namespace.
# It saves us from writing the full namespace in the xpath query.
nsmap = {"a": ns}

# query all the text nodes that don't have <remark> as an ancestor
text = ' '.join(root.xpath("//text()[not(ancestor::a:remark)]", namespaces=nsmap))
print(text[:100])
# To provide for the control of undertakings selling liquor to the public including the control of tra

Here's a brief explanation of what the components of the XPath query mean:

Ignoring other content

Other types of text content you may want to ignore, depending on your use case:

  • Text of headings and sub-headings: <heading>, <subheading>, <crossHeading>

  • The numbers of chapters, sections, etc.: <num>

  • Quoted and embedded content: <quotedStructure>, <embeddedStructure>

Depending on your needs, you can adjust the XPath to include these additional elements by including them in the not(...) clause and separating them with or.

# get text that isn't in a remark, heading or num
text = ' '.join(root.xpath(
  "//text()[not(ancestor::a:remark or ancestor::a:num or ancestor::a:heading)]",
  namespaces=nsmap))

In the next section, we'll explore how to extract text only for certain portions of the document, such as a particular section or chapter, or only table elements.

Last updated