Laws.Africa Developer Guide
  • Laws.Africa Developer Guide
  • Get Started
    • Introduction
    • Quick start
    • Works and expressions
    • Webhooks
    • Changelog
  • Tutorial
    • About the tutorial
    • Module 1: Build a legislation reader
      • Introductory concepts
      • Create a basic Django app
      • Create database models
      • Fetching the data
      • Work listing page
      • Expression detail page
      • Styling with Law Widgets
      • Adding interactivity
      • Staying up to date
    • Module 2: Enrichments and interactivity
      • Basic enrichments
      • Advanced enrichments
      • Advanced interactivity
    • Module 3: Text extraction for search and analysis
      • Why extracting text is important
      • Basics of text extraction
      • Advanced text extraction
      • Extracting text for analysis and machine learning
  • Content API Reference
    • About the Content API
    • Authentication
    • Pagination
    • Places
    • All work expressions
    • Single work expression
      • Commencements
      • Embedded images
      • Publication document
      • Table of Contents
      • Timeline
    • Taxonomy topics
    • Enrichment datasets
  • AI API Reference
    • About the AI API
    • Authentication
    • Knowledge Bases
  • How-to Guides
    • How to use the Table of Contents API
    • How to download images
Powered by GitBook
On this page
  1. Tutorial

Module 3: Text extraction for search and analysis

How to extract text from Akoma Ntoso XML documents for use in full-text search indexing and machine learning analysis.

PreviousAdvanced interactivityNextWhy extracting text is important

Last updated 1 year ago

In this module we'll cover the following:

  • Why extracting text is important

  • The basics of extracting text

  • Extracting text for specific document portions or provisions (eg. chapters, sections)

We will use:

  • Python and the XML library

The full code for this module is available on Google Colab:

lxml
https://colab.research.google.com/drive/1BEPG5abvOKoCB5zmHWDUBNNPIof1cE0M