Getting Started

Introduction

langumo is an unified corpus building environment for Language Models. langumo provides pipelines for building text-based datasets. Constructing datasets requires complicated pipelines (e.g. parsing, shuffling and tokenization). Moreover, if corpora are collected from different sources, it would be a problem to extract data from various formats. langumo helps to build a dataset with the diverse formats simply at once.

Main features

  • Easy to build, simple to add new corpus format.

  • Fast building through performance optimizations (even written in Python).

  • Supports multi-processing in parsing corpora.

  • Extremely less memory usage.

  • All-in-one environment. Never mind the internal procedures!

  • Does not need to write codes for new corpus. Instead, add to the build configuration simply.

Dependencies

  • nltk

  • colorama

  • pyyaml>=5.3.1

  • tqdm>=4.46.0

  • tokenizers==0.8.1

  • mwparserfromhell>=0.5.4

  • kss==1.3.1

Installation

With pip

langumo can be installed using pip as follows:

$ pip install langumo

From source

You can install langumo from source by cloning the repository and running:

$ git clone https://github.com/affjljoo3581/langumo.git
$ cd langumo
$ python setup.py install

Command-line usage

usage: langumo [-h] [config]

The unified corpus building environment for Language Models.

positional arguments:
  config      langumo build configuration

optional arguments:
  -h, --help  show this help message and exit