Laws.Africa Developer Guide
  • Laws.Africa Developer Guide
  • Get Started
    • Introduction
    • Quick start
    • Works and expressions
    • Webhooks
    • Changelog
  • Tutorial
    • About the tutorial
    • Module 1: Build a legislation reader
      • Introductory concepts
      • Create a basic Django app
      • Create database models
      • Fetching the data
      • Work listing page
      • Expression detail page
      • Styling with Law Widgets
      • Adding interactivity
      • Staying up to date
    • Module 2: Enrichments and interactivity
      • Basic enrichments
      • Advanced enrichments
      • Advanced interactivity
    • Module 3: Text extraction for search and analysis
      • Why extracting text is important
      • Basics of text extraction
      • Advanced text extraction
      • Extracting text for analysis and machine learning
  • Content API Reference
    • About the Content API
    • Authentication
    • Pagination
    • Places
    • All work expressions
    • Single work expression
      • Commencements
      • Embedded images
      • Publication document
      • Table of Contents
      • Timeline
    • Taxonomy topics
    • Enrichment datasets
  • AI API Reference
    • About the AI API
    • Authentication
    • Knowledge Bases
  • How-to Guides
    • How to use the Table of Contents API
    • How to download images
Powered by GitBook
On this page
  • In this section
  • XML vs HTML
  • Parsing Akoma Ntoso XML
  • Extracting all text content (naive approach)
  • Ignoring editorial remarks
  • Ignoring other content
  1. Tutorial
  2. Module 3: Text extraction for search and analysis

Basics of text extraction

Basic methods of extracting text from an Akoma Ntoso XML document.

PreviousWhy extracting text is importantNextAdvanced text extraction

Last updated 1 year ago

In this section

  • parsing Akoma Ntoso XML using Python's LXML library

  • extracting all text in a naive way

  • ignoring editorial remarks

XML vs HTML

Extracting text from Akoma Ntoso XML is the preferred process. You can also extract text from the HTML version of an AKN document, but it's a little different because HTML is not as explicitly structured as XML.

We'll be using the for these examples.

Parsing Akoma Ntoso XML

Let's fetch the raw XML from the Laws.Africa API.

from urllib.request import Request, urlopen

TOKEN = "your-auth-token"
url = "https://api.laws.africa/v3/akn/za-cpt/act/by-law/2014/control-undertakings-liquor/eng/.xml"
request = Request(url, headers={"Authorization": f"Token {TOKEN}"})
# raw_xml is a bytes object, not a str
raw_xml = urlopen(request).read()
print(raw_xml[:100])
# b'<akomaNtoso xmlns="http://docs.oasis-open.org/legaldocml/ns/akn/3.0"><act ...

Now, parse the bytes in raw_xml into an XML tree with lxml.

from lxml import etree
# parse the raw XML bytes (even though the method name is "fromstring")
root = etree.fromstring(raw_xml)

Extracting all text content (naive approach)

If you look at the XML file in a browser (which formats it), you'll see that the XML contains a mixture of metadata, structure and text content.

The simplest way of extracting text is to ignore the structure and just extract all text nodes.

# itertext() iterates over all text nodes in the entire document
text = ' '.join(root.itertext())
print(text[:100])
# To provide for the control of undertakings selling liquor to the public including the control of tra

This is quick and easy, but includes all text nodes, which may not be what we want:

  • It includes text from all structural elements including headings, numbers, editorial comments, footnotes, quoted and embedded elements, tables, etc.

  • It includes all portions of the document, including attachments such as schedules and appendixes.

Ignoring editorial remarks

When legislation is edited or amended, sometimes editorial remarks are added. For example:

<p eId="sec_6__subsec_4__p_1">
  <remark status="editorial">
    [subsection (4) deleted by section 1 of the <ref href="/akn/za-cpt/act/by-law/2014/control-undertakings-liquor-amendment" eId="sec_6__subsec_4__p_1__ref_1">Amendment By-law, 2014</ref>]
  </remark>
</p>

These are not substantive and not officially part of the legal text of the document. We usually want to exclude these remarks when extracting text for full-text search purposes.

Learn more about XPath at these resources:

XPath is rich and expressive and we will only cover the basics for this example. There are two key ideas we need:

  • The query which describes the elements we want from the XML document. In this case, our query will mean "text nodes that aren't part of <remark> elements".

# The AKN namespace is the default one for this XML document,
# which is given by the None entry in the namespace map (nsmap).
# This is the same as ns = "http://docs.oasis-open.org/legaldocml/ns/akn/3.0"
ns = root.nsmap[None]

# This tells xpath that when we use the "a" alias, we mean the AKN namespace.
# It saves us from writing the full namespace in the xpath query.
nsmap = {"a": ns}

# query all the text nodes that don't have <remark> as an ancestor
text = ' '.join(root.xpath("//text()[not(ancestor::a:remark)]", namespaces=nsmap))
print(text[:100])
# To provide for the control of undertakings selling liquor to the public including the control of tra

Here's a brief explanation of what the components of the XPath query mean:

XPath
Meaning

//

Any node at any point in the tree. Alternatively, ./ means nodes at the current element (which is root in this case), or / which means the root of the entire document.

text()

This matches text nodes.

[...]

This applies additional conditions to the text nodes being matched. The conditions must evaluate to true to be included in the resulting node set.

not(...)

This negates the condition inside the brackets.

ancestor::a:remark

The ancestor:: means any ancestor element of the text node, and a:remark limits the ancestors to those that are <remark> elements in the Akoma Ntoso namespace.

Ignoring other content

Other types of text content you may want to ignore, depending on your use case:

  • Text of headings and sub-headings: <heading>, <subheading>, <crossHeading>

  • The numbers of chapters, sections, etc.: <num>

  • Quoted and embedded content: <quotedStructure>, <embeddedStructure>

Depending on your needs, you can adjust the XPath to include these additional elements by including them in the not(...) clause and separating them with or.

# get text that isn't in a remark, heading or num
text = ' '.join(root.xpath(
  "//text()[not(ancestor::a:remark or ancestor::a:num or ancestor::a:heading)]",
  namespaces=nsmap))

In the next section, we'll explore how to extract text only for certain portions of the document, such as a particular section or chapter, or only table elements.

To do this, we can use , a powerful mechanism of querying elements in XML documents.

A namespace lets us ignore parts of the document that aren't part of the Akoma Ntoso XML standard. The is http://docs.oasis-open.org/legaldocml/ns/akn/3.0

Cape Town Liquor Trading by-law
XPath
Zyte's XPath tutorial for web scraping
Using lxml and XPath to extract text in Python
official Akoma Ntoso XML namespace
Example of Akoma Ntoso XML, including metadata and text.