Getting Started¶
Introduction¶
langumo is an unified corpus building environment for Language Models.
langumo provides pipelines for building text-based datasets. Constructing
datasets requires complicated pipelines (e.g.
parsing,
shuffling and
tokenization). Moreover, if corpora are
collected from different sources, it would be a problem to extract data from
various formats. langumo helps to build a dataset with the diverse formats
simply at once.
Main features¶
Easy to build, simple to add new corpus format.
Fast building through performance optimizations (even written in Python).
Supports multi-processing in parsing corpora.
Extremely less memory usage.
All-in-one environment. Never mind the internal procedures!
Does not need to write codes for new corpus. Instead, add to the build configuration simply.
Dependencies¶
nltk
colorama
pyyaml>=5.3.1
tqdm>=4.46.0
tokenizers==0.8.1
mwparserfromhell>=0.5.4
kss==1.3.1
Installation¶
From source¶
You can install langumo from source by cloning the repository and running:
$ git clone https://github.com/affjljoo3581/langumo.git
$ cd langumo
$ python setup.py install
Command-line usage¶
usage: langumo [-h] [config]
The unified corpus building environment for Language Models.
positional arguments:
config langumo build configuration
optional arguments:
-h, --help show this help message and exit