Building¶
Overview¶
langumo is an unified corpus building environment. Precisely, langumo
is an integrated build-pipeline which consists of micro-building layers. The
layers are independent in building and only uses the given input auxiliary
files. langumo internally uses the builders for building corpus dataset by
constructing the integrated pipeline with them.
Base class¶
-
class
langumo.building.base.Builder[source]¶ Abstract base class of build layer.
-
build(afm, *inputs)[source]¶ Build something with input files.
Note
This method must be implemented.
- Parameters
afm (
AuxiliaryFileManager) – auxiliary file manager in current context and layer.inputs (
AuxiliaryFile) – input auxiliary files for building.
- Return type
Union[None,AuxiliaryFile,Tuple[AuxiliaryFile, …]]- Returns
build output auxiliary files.
-
run(parent)[source]¶ Execute the builder.
All builders can be executed directly and independently, without any input auxiliary files. We recommend to execute builders with miscellaneous ones (e.g.
ImportFromandExportTo) to pass build inputs correctly.- Parameters
parent (
str) – parent workspace directory which will be used for containing all auxiliary files.
-
Implementations¶
Parse corpus files¶
Every corpora have their own formats to store data in files. langumo only
needs the contents in the files to build unified corpus dataset. Parsing the
raw-formats and extracting plain texts from the files are necessary.
-
class
langumo.building.parsing.ParseRawFile(parser, lang, min_len, max_len, newline='[NEWLINE]', num_workers=1)[source]¶ A builder for parsing raw-formatted corpus files.
- Parameters
parser (
Parser) – an implementation of raw-formatted corpus parser.lang (
str) – language code of the target corpus dataset.min_len (
int) – minimum length of each document.max_len (
int) – maximum length of each document.newline (
str) – newline token which is used for replacing the line-break characters.num_workers (
int) – number of worker processes which runsparse
Mergence¶
All parsed plain text files should be merged into a single file to handle them as an unified large corpus data.
Shuffling text file¶
Commonly, deep learning models are trained with mini-batches from whole dataset. Theoretically, the mini-batches should be sampled from the data distribution. However, they are usually gathered sequentially fetched from the dataset. So the randomness of the dataset is important. Therefore, after collecting plain texts from corpora, it is necessary to shuffle the documents to ensure randomness of mini-batches.
-
class
langumo.building.shuffling.ShuffleLines(best_seek_cnt=100000, max_buckets=512)[source]¶ Shuffle lines in text file approximately.
Common shuffling algorithms provide perfect randomness in shuffling but they consume a lot of memory while shuffling large files. Shuffling extremely large corpora perfectly is almost impossible. This builder is designed to shuffle immensely large file with lower memory usage and ensuring almost-perfect randomness by approximating a shuffling.
Tokenize into subwords¶
Recently, NLP models use subword-tokenization for embedding texts to vectors.
One-hot encoding needs too many embedding vectors in lookup table and
character-level embedding brings worse performance. langumo trains a
subword tokenizer and encodes the corpora into subword tokens.
-
class
langumo.building.tokenization.TrainTokenizer(vocab_size=32000, subset_size=512000000, limit_alphabet=6000, unk_token='[UNK]', special_tokens=[])[source]¶ Train WordPiece tokenizer.
Splitting¶
Divide an evaluation dataset from the whole corpus dataset. It is necessary to
evaluate the model by predicting unseen data. langumo creates isolated
evaluation dataset for evaluation.
Miscellaneous Builders¶
langumo provides a few miscellaneous builders to help constructing the
build pipeline simply. They are not the main implementation of corpus building,
but necessary to compose the pipeline.
-
class
langumo.building.miscellaneous.Sequential(*builders)[source]¶ A sequential container of builders.
The builders in this container have same auxiliary level. The build outputs from each builder will be passed to the next build layer.
-
class
langumo.building.miscellaneous.ImportFrom(*paths)[source]¶ Import external files to auxiliary environment.
This builder imports external files to the auxiliary environment to use in other builders. It builds nothing but simply wraps the external files with
AuxiliaryFileand returns them for passing to next layers.Note
The imported auxiliary files are not managed by
AuxiliaryFileManagerto prevent from being removed by automatically clean-up.- Parameters
paths (
str) – import file paths.
-
class
langumo.building.miscellaneous.ExportTo(*paths)[source]¶ Export auxiliary files to external workspace.
After building somethings, the output auxiliary files would be removed by
AuxiliaryFileManager‘s clean-up. This builder exports the output files to the given external paths and returns the given auxiliary files identically.Note
The auxiliary files will be copied to the given export file paths.
- Parameters
paths (
str) – export file paths.
-
class
langumo.building.miscellaneous.Residual(*builders)[source]¶ Concatenate the inputs with outputs from sequential layers.
The given sequential layers are wrapped with
Sequentialinternally. Due to the reason, the auxiliary level in the builders may be increased. This builder returns the given inputs and the outputs from the sequential layers.
-
class
langumo.building.miscellaneous.StackOutputs(builder_group)[source]¶ Stack the outputs from the build layers.
While
Sequentialruns builders by chaining the inputs and their outputs, this builder runs them in parallel – each builder would take the same input files which is given to this builder – and returns the stack of the outputs.