Building¶

Overview¶

langumo is an unified corpus building environment. Precisely, langumo is an integrated build-pipeline which consists of micro-building layers. The layers are independent in building and only uses the given input auxiliary files. langumo internally uses the builders for building corpus dataset by constructing the integrated pipeline with them.

Base class¶

class langumo.building.base.Builder[source]¶

Abstract base class of build layer.

build(afm, *inputs)[source]¶

Build something with input files.

Note

This method must be implemented.

Parameters

afm (AuxiliaryFileManager) – auxiliary file manager in current context and layer.
inputs (AuxiliaryFile) – input auxiliary files for building.

Return type

Union[None, AuxiliaryFile, Tuple[AuxiliaryFile, …]]

Returns

build output auxiliary files.

run(parent)[source]¶

Execute the builder.

All builders can be executed directly and independently, without any input auxiliary files. We recommend to execute builders with miscellaneous ones (e.g. ImportFrom and ExportTo) to pass build inputs correctly.

Parameters: parent (str) – parent workspace directory which will be used for containing all auxiliary files.

Implementations¶

Parse corpus files¶

Every corpora have their own formats to store data in files. langumo only needs the contents in the files to build unified corpus dataset. Parsing the raw-formats and extracting plain texts from the files are necessary.

class langumo.building.parsing.ParseRawFile(parser, lang, min_len, max_len, newline='[NEWLINE]', num_workers=1)[source]¶

A builder for parsing raw-formatted corpus files.

Parameters

parser (Parser) – an implementation of raw-formatted corpus parser.
lang (str) – language code of the target corpus dataset.
min_len (int) – minimum length of each document.
max_len (int) – maximum length of each document.
newline (str) – newline token which is used for replacing the line-break characters.
num_workers (int) – number of worker processes which runs parse

Mergence¶

All parsed plain text files should be merged into a single file to handle them as an unified large corpus data.

class langumo.building.mergence.MergeFiles[source]¶: Merge files into a single one.

Note

All documents are separated by new-line character(\n) and this builder automatically appends the new-line character to avoid mixing the last document of a file and the first document of another one.

Shuffling text file¶

Commonly, deep learning models are trained with mini-batches from whole dataset. Theoretically, the mini-batches should be sampled from the data distribution. However, they are usually gathered sequentially fetched from the dataset. So the randomness of the dataset is important. Therefore, after collecting plain texts from corpora, it is necessary to shuffle the documents to ensure randomness of mini-batches.

class langumo.building.shuffling.ShuffleLines(best_seek_cnt=100000, max_buckets=512)[source]¶

Shuffle lines in text file approximately.

Common shuffling algorithms provide perfect randomness in shuffling but they consume a lot of memory while shuffling large files. Shuffling extremely large corpora perfectly is almost impossible. This builder is designed to shuffle immensely large file with lower memory usage and ensuring almost-perfect randomness by approximating a shuffling.

Parameters

best_seek_cnt (int) – maximum number of seek count.
max_buckets (int) – maximum number of temporary bucket files.

Tokenize into subwords¶

Recently, NLP models use subword-tokenization for embedding texts to vectors. One-hot encoding needs too many embedding vectors in lookup table and character-level embedding brings worse performance. langumo trains a subword tokenizer and encodes the corpora into subword tokens.

class langumo.building.tokenization.TrainTokenizer(vocab_size=32000, subset_size=512000000, limit_alphabet=6000, unk_token='[UNK]', special_tokens=[])[source]¶

Train WordPiece tokenizer.

Parameters

vocab_size (int) – number of subwords in vocabulary.
subset_size (int) – size of subset which is a part of dataset for training the tokenizer.
limit_alphabet (int) – maximum different characters to keep in the alphabet.
unk_token (str) – unknown token name.
special_tokens (List[str]) – list of special token names.

class langumo.building.tokenization.TokenizeSentences(unk_token, special_tokens=[], batch_size=10000)[source]¶

Tokenize sentences into subwords with trained WordPiece tokenizer.

Parameters

unk_token (str) – unknown token name.
special_tokens (List[str]) – list of special token names.
batch_size (int) – encode batch size to tokenize at once.

Splitting¶

Divide an evaluation dataset from the whole corpus dataset. It is necessary to evaluate the model by predicting unseen data. langumo creates isolated evaluation dataset for evaluation.

class langumo.building.splitting.SplitValidation(val_ratio=0.1)[source]¶

Split text file into training and evaluation datasets.

Parameters: val_ratio (float) – ratio of evaluation dataset to train dataset.

Miscellaneous Builders¶

langumo provides a few miscellaneous builders to help constructing the build pipeline simply. They are not the main implementation of corpus building, but necessary to compose the pipeline.

class langumo.building.miscellaneous.Sequential(*builders)[source]¶

A sequential container of builders.

The builders in this container have same auxiliary level. The build outputs from each builder will be passed to the next build layer.

class langumo.building.miscellaneous.ImportFrom(*paths)[source]¶

Import external files to auxiliary environment.

This builder imports external files to the auxiliary environment to use in other builders. It builds nothing but simply wraps the external files with AuxiliaryFile and returns them for passing to next layers.

Note

The imported auxiliary files are not managed by AuxiliaryFileManager to prevent from being removed by automatically clean-up.

Parameters: paths (str) – import file paths.

class langumo.building.miscellaneous.ExportTo(*paths)[source]¶

Export auxiliary files to external workspace.

After building somethings, the output auxiliary files would be removed by AuxiliaryFileManager ‘s clean-up. This builder exports the output files to the given external paths and returns the given auxiliary files identically.

Note

The auxiliary files will be copied to the given export file paths.

Parameters: paths (str) – export file paths.

class langumo.building.miscellaneous.Residual(*builders)[source]¶

Concatenate the inputs with outputs from sequential layers.

The given sequential layers are wrapped with Sequential internally. Due to the reason, the auxiliary level in the builders may be increased. This builder returns the given inputs and the outputs from the sequential layers.

class langumo.building.miscellaneous.StackOutputs(builder_group)[source]¶

Stack the outputs from the build layers.

While Sequential runs builders by chaining the inputs and their outputs, this builder runs them in parallel – each builder would take the same input files which is given to this builder – and returns the stack of the outputs.

Parameters: builder_group (Iterable[Union[Builder, Tuple[Builder, …]]]) – an iterator of builders or builder sequences.