In this lesson we will look at how to make a corpus to use with Annif.
The Merriam-Webster Dictionary defines a corpus as "all the writings or works of a particular kind or on a particular subject" or a "collection or body of knowledge or evidence". Various corpora are widely used in language and literature research. In this case we use the term to refer to a data collection used to train or evaluate Annif models.
For training Annif projects two kinds of document corpus formats are supported: some formats are intended for full-text or longer documents and others for metadata or very short texts. Annif corpora are usually divided into three subsets: train, validate and test. The training and testing corpora are used as they are named - the validation set can be used for fine-tuning model hyperparameters.
All Annif projects need a subject vocabulary, which defines the subjects available for automated indexing or classification. Typically this will be a thesaurus, a classification or a list of subject headings. Annif doesn't care much about the internal structure of a subject vocabulary, it just needs to know the URIs and preferred labels (a.k.a. terms or descriptors) of each subject/class/concept. If the vocabulary includes also notion codes, e.g. as in UDC, also they can be given.
The simple TSV subject vocabulary format only specifies URIs and labels for concepts and only supports one language. The vocabulary file is UTF-8 encoded TSV (tab separated values) file with the file extension .tsv, where the first column contains a subject URI and the second column its label (and the optional third column the notation code). The format is the same as the extended subject file format for documents, specified below. For example:
<http://example.org/thesaurus/subj1> computer network
<http://example.org/thesaurus/subj2> computer science
<http://example.org/thesaurus/subj3> Internet Protocol 42.42
The CSV subject vocabulary format only specifies URIs and labels for concepts; labels can be given in many languages. The vocabulary file is UTF-8 encoded CSV (comma separated values) file with the file extension .csv. The first row is a header which defines the meaning of columns in subsequent rows. The header must contain a column called uri and one or more columns called label_XX, where XX is a BCP47 language tag such as en or fr. There may also be a notation column for notations. Example:
uri,label_en,label_fr,notation
http://example.org/thesaurus/subj1,computer network,réseau informatique,
http://example.org/thesaurus/subj2,computer science,informatique,
http://example.org/thesaurus/subj3,Internet Protocol,Internet Protocol,42.42
A subject vocabulary can also be given as a SKOS/RDF file. All common RDF serializations (i.e. those supported by rdflib) are supported, including RDF/XML, Turtle and N-Triples.
A full-text corpus is a directory containing (UTF-8 encoded) document and subject files. The document files should have .txt extension, and
for each document file there should be a subject file with the same (base) name, but with the extension .tsv or .key. The main distinction between the two is whether or not subject URIs are included.
A .key subject file simply lists subject labels one per line. For example:
networking
computer science
Internet Protocol
Note that the labels must exactly match the preferred labels of concepts in the subject vocabulary.
A .tsv subject file is otherwise similar, but it has two columns separated by a tab: the first column contains the subject URIs within angle brackets <> and the second column their labels. For example:
<http://example.org/thesaurus/subj1> networking
<http://example.org/thesaurus/subj2> computer science
<http://example.org/thesaurus/subj3> Internet Protocol
Any additional columns beyond the first two are ignored.
When using this format, subject comparison is performed based on URIs, not the labels. Since URIs are (or should be) more persistent than labels, this ensures that subjects can be matched even if the labels have changed in the subject vocabulary.
In this tutorial full-text corpora (e.g. the train set data-sets/stw-zbw/docs/train, after downloading PDFs and converting them to text files) are used for evaluating models and training MLLM and NN ensemble models.
The JSON corpus format is a directory of files with the extension .json, each representing one document. See the Annif documentation about the JSON full-text corpus format for details.
A document corpus can be given in a single UTF-8 encoded TSV file. This format is especially useful for metadata about documents, when only titles are known, or for very short documents. The first column contains the text of the document (e.g. title or title + abstract) while the second column contains a whitespace-separated list of subject URIs (again within angle brackets) for that document. For example:
RFC 791: Internet Protocol <http://example.org/thesaurus/subj1> <http://example.org/thesaurus/subj3>
RFC 1925: The Twelve Networking Truths <http://example.org/thesaurus/subj1> <http://example.org/thesaurus/subj2>
Go To Statement Considered Harmful <http://example.org/thesaurus/subj2>
Note that it is also possible to separate the subjects with tabs, thus creating a variable number of columns. The TSV file may be compressed using gzip compression. The compressed file must have the extension .gz.
In this tutorial short-text corpora (e.g. /data-sets/stw-zbw/stw-econbiz.tsv.gz) are used for training associative models.
The short-text CSV document corpus format is similar to the TSV format, but supports extended metadata. See the Annif documentation about the CSV short-text corpus format for details.
The short-text JSON Lines document corpus format is another option for representing short text documents. See the Annif documentation about the JSON Lines short-text corpus format for details.
In this exercise you create a subject vocabulary and document corpus from metadata (titles and abstracts) of scientific articles. The full articles in question are deposited in the arXiv archive and the metadata set is distributed by Kaggle. The Jupyter Notebook (data-sets/arxiv/create-arxiv-corpus.ipynb) demonstrates how to construct a subject vocabulary, and short-text document corpus from the JSON formatted metadata file.
The Jupyter Notebook software is included in the VirtualBox and Docker images for this tutorial, and you can start Jupyter Server by typing jupyter notebook on the command line. You may need to click the URL provided by the Server to open the Jupyter navigator view in your internet browser (if you go straight to the address http://localhost:8888 you need to give the token displayed in the terminal). From the main view you can navigate to the notebook data-sets/arxiv/create-arxiv-corpus.ipynb.
Run the notebook cell-by-cell by e.g. pressing shift+enter in every interactive block. The notebook comments explain the steps.
Try setting up a project for the arXiv corpus and training and evaluating it.
When training you will get many warnings about unknown URIs, because many documents in the training set have been assigned to categories that are now deprecated and thus are not in the subject vocabulary. Some of the deprecation cases could possibly be handled by utilizing the information of the subsumed archives etc. details of the taxonoy, but that is beyond the scope of this exercise.
There is a multi-label text classification task in Kaggle which uses the same dataset and which you can take a look at. That task aims in predicting the general categories, not the detailed categories, so if you want to try the Kaggle task, you need to modify the notebook appropriately.
See details on how to use document metadata and the `select` transform
Some of the document corpus formats support additional metadata fields for documents. The metadata is a collection of keys (e.g. title, abstract) and corresponding values representing aspects or parts of the document that could be useful for subject indexing. This allows setting up projects that operate on the document metadata fields instead of, or in addition to, the main text.
A project can be configured to operate on specific metadata fields using the select transform with configuration like this:
[myproject]
transform=select(title,description,text)
In some cases, it has been demonstrated that subject indexing quality increases when e.g. titles and fulltext are processed separately and the predictions merged for the final result. The document metadata support and the select transform can be used to create an ensemble project that combines predictions from two separate projects: one that uses titles and another that uses fulltext. Here is such an example configuration:
[gnd-ensemble-en]
vocab=gnd
language=en
backend=ensemble
sources=gnd-omikuji-title-en,gnd-omikuji-text-en
[gnd-omikuji-title-en]
vocab=gnd
language=en
backend=omikuji
analyzer=snowball(english)
transform=select(title)
[gnd-omikuji-text-en]
vocab=gnd
language=en
backend=omikuji
analyzer=snowball(english)
transform=select(text) # not really necessary since this is the default...
For more information, see the documentation in the Annif wiki: