Parsing¶
Overview¶
langumo supports various corpus formats by using
Parsers.
Base class¶
-
class
langumo.building.parsing.Parser[source]¶ Abstract base class for parsing raw-formatted corpus.
-
extract(raw)[source]¶ Extract documents from corpus file.
Note
This method must be implemented.
- Parameters
raw (
AuxiliaryFile) – input raw-formatted corpus file.- Yields
raw-formatted documents extracted from the file.
- Return type
-
parse(text)[source]¶ Parse raw-formatted document to plain text.
To improve parsing performance, this methods will be called in parallel (by creating multi-processes). So if some prior informations about corpus are required, use
prepare.Note
This method must be implemented.
-
prepare(raw)[source]¶ Read informations before extracting and parsing.
While
parsemethods are executed in different processes, they cannot get informations about the corpus file fromextract. This method is called before creating the processes, so the required informations collected from this method will be copied to theparseprocesses, and hence, they can use the informations in parsing.- Parameters
raw (
AuxiliaryFile) – input raw-formatted corpus file.
-
Built-in Parsers¶
langumo provides a few built-in Parser s
to use popular datasets directly, without creating new parsers.
Wikipedia¶
-
class
langumo.parsers.wikipedia.WikipediaParser[source]¶ Bases:
langumo.building.parsing.ParserWikipedia dump file parser.
This parser use mwparserfromhell library to parse MediaWiki contents. To normalize contents, the belows will be removed.
Wikilinks
Templates
HTML tags (reference, table)
Texts wrapped with parenthesis
Paragraphs which does not end with punctuations
Moreover, all irregular quotes will be replaced to normal ones (‘’ and “”).
Escaped JSON-Style String¶
In json package, json.encoder.encode_basestring() encodes a Python
string to a JSON-style representation. Especially line-break characters are
escaped to \n. It helps to separate documents which consists of multi-line
paragraphs by line-break delimiter.
-
class
langumo.parsers.jsonstring.EscapedStringParser[source]¶ Bases:
langumo.building.parsing.ParserEscaped JSON-Style String Parser.
This parser normalizes the contents by removing duplicated spaces, empty lines and replacing irregular quotes to normal ones (‘’ and “”).