Tutorials

Build your first datset

Let’s build a Wikipedia dataset. First, install langumo in your virtual enviornment.

$ pip install langumo

After installing langumo, create a workspace to use in build.

$ mkdir workspace
$ cd workspace

Before creating the dataset, we need a Wikipedia dump file (which is a source of the dataset). You can get various versions of Wikipedia dump files from here. In this tutorial, we will use a part of Wikipedia dump file. Download the file with your browser and move to workspace/src. Or, use wget to get the file in terminal simply:

$ wget -P src https://dumps.wikimedia.org/enwiki/20200901/enwiki-20200901-pages-articles1.xml-p1p30303.bz2

langumo needs a build configuration file which contains the details of dataset. Create build.yml file to workspace and write belows:

langumo:
  inputs:
  - path: src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
    parser: langumo.parsers.WikipediaParser

  build:
    parsing:
      num-workers: 8 # The number of CPU cores you have.

    tokenization:
      vocab-size: 32000 # The vocabulary size.

Now we are ready to create our first dataset. Run langumo!

$ langumo

Then you can see the below outputs:

[*] import file from src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
[*] parse raw-formatted corpus file with WikipediaParser
[*] merge 1 files into one
[*] shuffle raw corpus file: 100%|██████████████████████████████| 118042/118042 [00:01<00:00, 96965.15it/s]
[00:00:10] Reading files (256 Mo)                   ███████████████████████████████████                 100
[00:00:00] Tokenize words                           ███████████████████████████████████ 418863   /   418863
[00:00:01] Count pairs                              ███████████████████████████████████ 418863   /   418863
[00:00:02] Compute merges                           ███████████████████████████████████ 28942    /    28942
[*] export the processed file to build/vocab.txt
[*] tokenize sentences with WordPiece model: 100%|███████████████| 236084/236084 [00:23<00:00, 9846.67it/s]
[*] split validation corpus - 23609  of 236084 lines
[*] export the processed file to build/corpus.train.txt
[*] export the processed file to build/corpus.eval.txt

After building the dataset, workspace would contain the below files:

workspace
├── build
│   ├── corpus.eval.txt
│   ├── corpus.train.txt
│   └── vocab.txt
├── src
│   └── enwiki-20200901-pages-articles1.xml-p1p30303.bz2
└── build.yml

Write a custom Parser

langumo supports custom Parser s to use various formats in building. In this tutorial, we are going to see how to build Amazon Review Data (2018) dataset in langumo.

The basic form of Parser class is as below:

class AmazonReviewDataParser(langumo.building.Parser):
    def extract(self, raw: langumo.utils.AuxiliaryFile) -> Iterable[str]:
        pass

    def parse(self, text: str) -> str:
        pass

extract method yields articles or documents from raw-formatted file and parse method returns the parsed contents from extracted raw articles.

To implement the parser, let’s analyse Amazon Review Data (2018) dataset. The data format of Amazon Review Data (2018) is one-review-per-line in json (or, JSON Lines). That is, each line is a json-formatted review data. Here is an example:

{
    "image": ["https://images-na.ssl-images-amazon.com/images/I/71eG75FTJJL._SY88.jpg"],
    "overall": 5.0,
    "vote": "2",
    "verified": true,
    "reviewTime": "01 1, 2018",
    "reviewerID": "AUI6WTTT0QZYS",
    "asin": "5120053084",
    "style": {
        "Size:": "Large",
        "Color:": "Charcoal"
    },
    "reviewerName": "Abbey",
    "reviewText": "I now have 4 of the 5 available colors of this shirt... ",
    "summary": "Comfy, flattering, discreet--highly recommended!",
    "unixReviewTime": 1514764800
}

We only need the contents in reviewText of the reviews. So the parser should only take reviewText from the json objects (extracted from extract method).

def parse(self, text: str) -> str:
    return json.loads(text)['reviewText']

Meanwhile, as mentioned above, reviews are separated by new-line delimiter. So extract method should yield each line in the file. Note that the raw files are deflated with gunzip format.

def extract(self, raw: langumo.utils.AuxiliaryFile) -> Iterable[str]:
    with gzip.open(raw.name, 'r') as fp:
        yield from fp

That’s all! You’ve just implemented a parser for Amazon Review Data (2018). Now you can use the parser in build configuration. Let the parser class is in myexample.parsers package. Here is an example of build configuration.

langumo:
  inputs:
  - path: src/AMAZON_FASHION_5.json.gz
    parser: myexample.parsers.AmazonReviewDataParser

  # other configurations...