Laws.Africa Developer Guide
  • Laws.Africa Developer Guide
  • Get Started
    • Introduction
    • Quick start
    • Works and expressions
    • Webhooks
    • Changelog
  • Tutorial
    • About the tutorial
    • Module 1: Build a legislation reader
      • Introductory concepts
      • Create a basic Django app
      • Create database models
      • Fetching the data
      • Work listing page
      • Expression detail page
      • Styling with Law Widgets
      • Adding interactivity
      • Staying up to date
    • Module 2: Enrichments and interactivity
      • Basic enrichments
      • Advanced enrichments
      • Advanced interactivity
    • Module 3: Text extraction for search and analysis
      • Why extracting text is important
      • Basics of text extraction
      • Advanced text extraction
      • Extracting text for analysis and machine learning
  • Content API Reference
    • About the Content API
    • Authentication
    • Pagination
    • Places
    • All work expressions
    • Single work expression
      • Commencements
      • Embedded images
      • Publication document
      • Table of Contents
      • Timeline
    • Taxonomy topics
    • Enrichment datasets
  • AI API Reference
    • About the AI API
    • Authentication
    • Knowledge Bases
  • How-to Guides
    • How to use the Table of Contents API
    • How to download images
Powered by GitBook
On this page
  1. Tutorial
  2. Module 3: Text extraction for search and analysis

Why extracting text is important

Why would you want to extract text from an Akoma Ntoso XML document?

PreviousModule 3: Text extraction for search and analysisNextBasics of text extraction

Last updated 1 year ago

Akoma Ntoso XML documents and the resulting HTML documents contain rich, structured information. While this structure is important for displaying, formatting and analysis of the document, sometimes it's important to work with just the text content.

Extracting text content from a structure Akoma Ntoso XML or HTML document is useful for use cases such as:

  • Indexing into Elasticsearch or other search engines to support full-text search

  • Calculating for use with machine learning models such as Large-Language Models (LLMs)

  • Natural-language processing and analysis (NLP)

The text content we need depends on our use case. We may need to ignore certain text content such as metadata, editorial comments, embedded text, headings and portion numbers.

To extract text content from a document, we process the XML and use XML-based queries to extract just the text we're interested in.

In this module we'll explore how to extract text content for different uses cases.

embeddings