Parsing

Overview

langumo supports various corpus formats by using Parsers.

Base class

class langumo.building.parsing.Parser[source]

Abstract base class for parsing raw-formatted corpus.

extract(raw)[source]

Extract documents from corpus file.

Note

This method must be implemented.

Parameters

raw (AuxiliaryFile) – input raw-formatted corpus file.

Yields

raw-formatted documents extracted from the file.

Return type

Iterable[str]

parse(text)[source]

Parse raw-formatted document to plain text.

To improve parsing performance, this methods will be called in parallel (by creating multi-processes). So if some prior informations about corpus are required, use prepare.

Note

This method must be implemented.

Parameters

text (str) – raw-formatted documents extracted from extract.

Return type

str

Returns

parsed plain text.

prepare(raw)[source]

Read informations before extracting and parsing.

While parse methods are executed in different processes, they cannot get informations about the corpus file from extract. This method is called before creating the processes, so the required informations collected from this method will be copied to the parse processes, and hence, they can use the informations in parsing.

Parameters

raw (AuxiliaryFile) – input raw-formatted corpus file.

Built-in Parsers

langumo provides a few built-in Parser s to use popular datasets directly, without creating new parsers.

Wikipedia

class langumo.parsers.wikipedia.WikipediaParser[source]

Bases: langumo.building.parsing.Parser

Wikipedia dump file parser.

This parser use mwparserfromhell library to parse MediaWiki contents. To normalize contents, the belows will be removed.

  • Wikilinks

  • Templates

  • HTML tags (reference, table)

  • Texts wrapped with parenthesis

  • Paragraphs which does not end with punctuations

Moreover, all irregular quotes will be replaced to normal ones (‘’ and “”).

Escaped JSON-Style String

In json package, json.encoder.encode_basestring() encodes a Python string to a JSON-style representation. Especially line-break characters are escaped to \n. It helps to separate documents which consists of multi-line paragraphs by line-break delimiter.

class langumo.parsers.jsonstring.EscapedStringParser[source]

Bases: langumo.building.parsing.Parser

Escaped JSON-Style String Parser.

This parser normalizes the contents by removing duplicated spaces, empty lines and replacing irregular quotes to normal ones (‘’ and “”).