Auxiliary File System

Overview

To build corpus dataset, each procedure requires temporary files to store data in disk, not memory. While the scale of each corpus is increased, we faced to handle enormously large files. There is no problem to save all temporary files and remove after the build with small corpora. However, if we store them with extremely large ones thoughtlessly, low disk space error may be occurred.

Hence, we designed Auxiliary File System which manages temporary (or literally, auxiliary) files simply and automatically. It manages whole auxiliary files created from their manager. The manager records their auxiliary-scope level to determine which one is currently in unused state and removes the files that are in unused state for cleaning-up. Every Builders use AuxiliaryFiles in build.

Reference

class langumo.utils.auxiliary.AuxiliaryFile(name)[source]

An auxiliary file object.

Note

It is not recommended to create this class directly without AuxiliaryFileManager.

Parameters

name (str) – auxiliary file name.

lock()[source]

Lock the file to prevent from deleting.

open(mode='r')[source]

Open the auxiliary file.

Return type

IO

static opens(files, mode='r')[source]

Open multiple auxiliary files at once.

Return type

AbstractContextManager[List[IO]]

class langumo.utils.auxiliary.AuxiliaryFileManager(parent)[source]

Auxiliary file manager.

Parameters

parent (str) – parent workspace directory which will be used for containing auxiliary files.

auxiliary_scope()[source]

Returns a context manager which increases the auxiliary level.

clear()[source]

Remove unused auxiliary files.

AuxiliaryFileManager automatically traces unused auxiliary files and remove them to manage the disk space. The manager determines that auxiliary files which are non-locked and have lower auxiliary-scope level – not created in current scope – are in unused state and unnecessary ones. If some files should be preserved, use lock and synchronize.

close()[source]

Close the auxiliary manager and cleanup the workspace directory.

create()[source]

Create new auxiliary file.

The auxiliary file is usually used as a temporary file. It will be created in parent directory and have current auxiliary level.

Return type

AuxiliaryFile

Returns

new auxiliary file object.

synchronize(files)[source]

Synchronize auxiliary levels to current.

Some files created in lower auxiliary_scope need to be handled as higher-scope ones. It synchronizes the auxiliary levels of the given files to current scope level.