Building

Overview

langumo is an unified corpus building environment. Precisely, langumo is an integrated build-pipeline which consists of micro-building layers. The layers are independent in building and only uses the given input auxiliary files. langumo internally uses the builders for building corpus dataset by constructing the integrated pipeline with them.

Base class

class langumo.building.base.Builder[source]

Abstract base class of build layer.

build(afm, *inputs)[source]

Build something with input files.

Note

This method must be implemented.

Parameters
Return type

Union[None, AuxiliaryFile, Tuple[AuxiliaryFile, …]]

Returns

build output auxiliary files.

run(parent)[source]

Execute the builder.

All builders can be executed directly and independently, without any input auxiliary files. We recommend to execute builders with miscellaneous ones (e.g. ImportFrom and ExportTo) to pass build inputs correctly.

Parameters

parent (str) – parent workspace directory which will be used for containing all auxiliary files.

Implementations

Parse corpus files

Every corpora have their own formats to store data in files. langumo only needs the contents in the files to build unified corpus dataset. Parsing the raw-formats and extracting plain texts from the files are necessary.

class langumo.building.parsing.ParseRawFile(parser, lang, min_len, max_len, newline='[NEWLINE]', num_workers=1)[source]

A builder for parsing raw-formatted corpus files.

Parameters
  • parser (Parser) – an implementation of raw-formatted corpus parser.

  • lang (str) – language code of the target corpus dataset.

  • min_len (int) – minimum length of each document.

  • max_len (int) – maximum length of each document.

  • newline (str) – newline token which is used for replacing the line-break characters.

  • num_workers (int) – number of worker processes which runs parse

Mergence

All parsed plain text files should be merged into a single file to handle them as an unified large corpus data.

class langumo.building.mergence.MergeFiles[source]

Merge files into a single one.

Note

All documents are separated by new-line character(\n) and this builder automatically appends the new-line character to avoid mixing the last document of a file and the first document of another one.

Shuffling text file

Commonly, deep learning models are trained with mini-batches from whole dataset. Theoretically, the mini-batches should be sampled from the data distribution. However, they are usually gathered sequentially fetched from the dataset. So the randomness of the dataset is important. Therefore, after collecting plain texts from corpora, it is necessary to shuffle the documents to ensure randomness of mini-batches.

class langumo.building.shuffling.ShuffleLines(best_seek_cnt=100000, max_buckets=512)[source]

Shuffle lines in text file approximately.

Common shuffling algorithms provide perfect randomness in shuffling but they consume a lot of memory while shuffling large files. Shuffling extremely large corpora perfectly is almost impossible. This builder is designed to shuffle immensely large file with lower memory usage and ensuring almost-perfect randomness by approximating a shuffling.

Parameters
  • best_seek_cnt (int) – maximum number of seek count.

  • max_buckets (int) – maximum number of temporary bucket files.

Tokenize into subwords

Recently, NLP models use subword-tokenization for embedding texts to vectors. One-hot encoding needs too many embedding vectors in lookup table and character-level embedding brings worse performance. langumo trains a subword tokenizer and encodes the corpora into subword tokens.

class langumo.building.tokenization.TrainTokenizer(vocab_size=32000, subset_size=512000000, limit_alphabet=6000, unk_token='[UNK]', special_tokens=[])[source]

Train WordPiece tokenizer.

Parameters
  • vocab_size (int) – number of subwords in vocabulary.

  • subset_size (int) – size of subset which is a part of dataset for training the tokenizer.

  • limit_alphabet (int) – maximum different characters to keep in the alphabet.

  • unk_token (str) – unknown token name.

  • special_tokens (List[str]) – list of special token names.

class langumo.building.tokenization.TokenizeSentences(unk_token, special_tokens=[], batch_size=10000)[source]

Tokenize sentences into subwords with trained WordPiece tokenizer.

Parameters
  • unk_token (str) – unknown token name.

  • special_tokens (List[str]) – list of special token names.

  • batch_size (int) – encode batch size to tokenize at once.

Splitting

Divide an evaluation dataset from the whole corpus dataset. It is necessary to evaluate the model by predicting unseen data. langumo creates isolated evaluation dataset for evaluation.

class langumo.building.splitting.SplitValidation(val_ratio=0.1)[source]

Split text file into training and evaluation datasets.

Parameters

val_ratio (float) – ratio of evaluation dataset to train dataset.

Miscellaneous Builders

langumo provides a few miscellaneous builders to help constructing the build pipeline simply. They are not the main implementation of corpus building, but necessary to compose the pipeline.

class langumo.building.miscellaneous.Sequential(*builders)[source]

A sequential container of builders.

The builders in this container have same auxiliary level. The build outputs from each builder will be passed to the next build layer.

class langumo.building.miscellaneous.ImportFrom(*paths)[source]

Import external files to auxiliary environment.

This builder imports external files to the auxiliary environment to use in other builders. It builds nothing but simply wraps the external files with AuxiliaryFile and returns them for passing to next layers.

Note

The imported auxiliary files are not managed by AuxiliaryFileManager to prevent from being removed by automatically clean-up.

Parameters

paths (str) – import file paths.

class langumo.building.miscellaneous.ExportTo(*paths)[source]

Export auxiliary files to external workspace.

After building somethings, the output auxiliary files would be removed by AuxiliaryFileManager ‘s clean-up. This builder exports the output files to the given external paths and returns the given auxiliary files identically.

Note

The auxiliary files will be copied to the given export file paths.

Parameters

paths (str) – export file paths.

class langumo.building.miscellaneous.Residual(*builders)[source]

Concatenate the inputs with outputs from sequential layers.

The given sequential layers are wrapped with Sequential internally. Due to the reason, the auxiliary level in the builders may be increased. This builder returns the given inputs and the outputs from the sequential layers.

class langumo.building.miscellaneous.StackOutputs(builder_group)[source]

Stack the outputs from the build layers.

While Sequential runs builders by chaining the inputs and their outputs, this builder runs them in parallel – each builder would take the same input files which is given to this builder – and returns the stack of the outputs.

Parameters

builder_group (Iterable[Union[Builder, Tuple[Builder, …]]]) – an iterator of builders or builder sequences.