Basics of text extraction
Basic methods of extracting text from an Akoma Ntoso XML document.
In this section
XML vs HTML
Parsing Akoma Ntoso XML
from urllib.request import Request, urlopen
TOKEN = "your-auth-token"
url = "https://api.laws.africa/v3/akn/za-cpt/act/by-law/2014/control-undertakings-liquor/eng/.xml"
request = Request(url, headers={"Authorization": f"Token {TOKEN}"})
# raw_xml is a bytes object, not a str
raw_xml = urlopen(request).read()
print(raw_xml[:100])
# b'<akomaNtoso xmlns="http://docs.oasis-open.org/legaldocml/ns/akn/3.0"><act ...Extracting all text content (naive approach)

Ignoring editorial remarks
XPath
Meaning
Ignoring other content
Last updated