Advanced text extraction
How to extract text from particular elements or portions of a document.
Last updated
How to extract text from particular elements or portions of a document.
Last updated
Extracting text from a specific section (or other portion) of a document
Extracting text from specific elements, such as headings
Why separating text with spaces is important
Sometimes it's useful to extract text only for specific portions of a document, such as a particular chapter or section.
Let's extract the text from Section 3 of the Cape Town Liquor By-law.
Section 3's XML is below. Even a short section that looks quite simple can have complex XML.
We can use the eId attribute to find Section 3.
The eId attribute is a unique identifier that appears on (almost) all elements in an Akoma Ntoso XML document. You can read more about how they are generated in the Akoma Ntoso XML specification.
Let's use a new xpath query to find the <section>
element that has an eId
of "sec_3
".
The XPath query is the equivalent of the CSS selector .section[eId=sec_3]
and means:
//a:section
Find all AKN section elements anywhere in the document
[@eId="sec_3"]
Filter the section elements to return only those whose eId
attribute is "sec_3"
.
Once you have a reference to Section 3, it is simple to extract just the text for the section using the techniques from the previous section of this module.
Note that the XPath query now starts with a dot .//
- this is to indicate we want all text nodes starting at the current node (which is sec_3
) and not at the root of the document. Try removing the .
from .//
above and see what the value of text
is afterwards.
Sometimes it's useful to extract text only from specific elements. For example, text only in tables, chapters, headings or sub-paragraphs.
We can do this using a new XPath query. Let's extract text from only heading elements:
Note that we used a:heading//text()
and not a:heading/text()
. If we used a:heading/text()
then we would not get text inside elements that are nested inside the heading, such as superscripts and subscripts inside <sup>
and <sub>
.
For example, consider this XML and XPath outputs.
In HTML, this would be shown as: Release of atmospheric CO₂.
//a:heading/text()
Release of atmospheric CO.
Only text nodes that are immediate children of heading
are included.
//a:heading//text()
Release of atmospheric CO2.
All text nodes that are descendants of heading
are included.
Finally, let's extract the text from all headings, subheadings and cross-headings. There are two equivalent ways of doing this.
The first option joins separate XPath queries with the OR |
operator.
The second option uses one XPath and conditions to match multiple types of elements.
In all the examples so far, we have used text = ' '.join(...)
. What is the join
and why is it important?
The ' '.join(items)
takes all the elements in items
and joins them together with a single space. It's the equivalent of the Javascript items.join(' ')
.
It ensures that all the text nodes are separated with a space.
Why do we need these spaces? Look what happens if we extract the text from all headings but don't join them with a space.
Notice that there are spaces within the headings, but not between them.
A text node includes the spaces within the text, but does not add spaces between them. In the XML document, there are no spaces outside of the text nodes. All the XML is actually on one line. Only when the document is displayed by the browser as HTML, are the headings placed on lines and visual spacing is added. We must therefore add them ourselves.
It's important to separate text nodes with spaces so that we don't combine words together accidentally. This would negatively affect analysis and full-text search.