|
This page explains how a multimodal dataset can be created and what is required for it.
|
|
This page is a user guide which explains how a multimodal dataset can be created and what is required for it.
|
|
|
|
|
|
[[_TOC_]]
|
|
The source code documentation is in the repository "multimodalDatasetBuilder/docs".
|
|
|
|
|
|
## Prerequisites
|
|
[[_TOC_]]
|
|
|
|
|
|
### Required Packages
|
|
## Required Packages
|
|
|
|
|
|
Most of the required packages can be installed with `pip` using the `requirements.txt` file from the repository:
|
|
Most of the required packages can be installed with pip using the "requirements.txt" file from the repository:
|
|
|
|
|
|
```
|
|
```
|
|
pip install -r requirements.txt
|
|
pip install -r requirements.txt
|
... | @@ -20,7 +20,6 @@ pip install git+https://github.com/openai/CLIP.git |
... | @@ -20,7 +20,6 @@ pip install git+https://github.com/openai/CLIP.git |
|
|
|
|
|
More informations about CLIP can be found [here](https://github.com/openai/CLIP).
|
|
More informations about CLIP can be found [here](https://github.com/openai/CLIP).
|
|
|
|
|
|
|
|
|
|
The POS-tagger from nltk can be installed via the interactive mode of python:
|
|
The POS-tagger from nltk can be installed via the interactive mode of python:
|
|
|
|
|
|
```
|
|
```
|
... | @@ -28,16 +27,55 @@ The POS-tagger from nltk can be installed via the interactive mode of python: |
... | @@ -28,16 +27,55 @@ The POS-tagger from nltk can be installed via the interactive mode of python: |
|
>>> nltk.download("averaged_perceptron_tagger")
|
|
>>> nltk.download("averaged_perceptron_tagger")
|
|
```
|
|
```
|
|
|
|
|
|
### Required Files
|
|
## Required Files
|
|
|
|
|
|
|
|
### Text Documents
|
|
|
|
|
|
|
|
Documents are required to create a multimodal dataset. Every document (e.g., book, news article ...) must be in a seperate file that has a ".txt" ending. The first line of a text file will be interpreted as its title.
|
|
|
|
|
|
|
|
For example, over 50.000 simple Wikipedia articles can be found [here](https://github.com/LGDoor/Dump-of-Simple-English-Wiki). In the repository, there is a script "scripts/simpleWikiArticles.py" which seperates the simple Wikipedia articles into individual files. For that, the "corpus.txt" file must be next to the script and a directory "articles" must be created next to these files.
|
|
|
|
|
|
|
|
### Images
|
|
|
|
|
|
|
|
Images are used to enrich the sentences of the documents. They must be saved as "jpg" files.
|
|
|
|
|
|
|
|
The image dataset that is used must be related to the concreteness values and the cached image features. For example, if the image dataset changes then the other two must also change.
|
|
|
|
|
|
|
|
For example the [MS COCO dataset](http://images.cocodataset.org/zips/train2014.zip) can be used as an image dataset.
|
|
|
|
|
|
|
|
### Concreteness Values
|
|
|
|
|
|
|
|
The concreteness values of words are somehow used as a filtering step in this multimodal dataset creation. Having these values, it is easier to estimate if there is an image in the given dataset that represents a given word and thus can be highlighted according to that word.
|
|
|
|
|
|
|
|
A json file that contains a dictionary of words and their concreteness values is required for the multimodal dataset creation. In the repository, there is such a file "data/concreteness/concretenessValuesMscoco.json" containing the concreteness values of the "MS COCO train2014 dataset".
|
|
|
|
|
|
|
|
That concreteness file can be created with the [implementation](https://github.com/victorssilva/concreteness) of [Visual Concreteness](https://arxiv.org/abs/1804.06786).
|
|
|
|
|
|
|
|
## Setup Database
|
|
|
|
|
|
#### Text Documents
|
|
## Parameter Explanation
|
|
|
|
|
|
#### Images
|
|
The "main.py" file has a lot of parameters which are explained here.
|
|
|
|
|
|
#### Concreteness Values
|
|
These are the required parameters. They only have to be set if their default values aren't right.
|
|
|
|
|
|
#### Example: Directory Structure
|
|
* `-d, --documents` - the path to the directory where the documents are
|
|
|
|
* `-i, --images` - the path to the directory where the images are
|
|
|
|
* `-m, --mcimages` - the path to the directory where the highlighted images from miniCLIP will be saved
|
|
|
|
* `-c, --concreteness` - the path to the json file containing the concreteness values. The file must be changed if the image dataset changes
|
|
|
|
* `-f, --cached_image_features` - the path to the file containing the cached image features. If a file with that name exists the cached features will be loaded. Otherwise such a file will be created with the specified name. The image features must be changed if the image dataset changes
|
|
|
|
|
|
### Database Setup
|
|
The following parameters are optional.
|
|
|
|
* `--resized_main_images` - on default, the size of the main images will not change. If a directory is specified the main images of the sentences will be resized and saved there. The resized images will have the same size as the highlighted images which is 224x224 pixels
|
|
|
|
* `-t, --concreteness_threshold` - defaults to 20. Filters out the words from the concreteness file which have a score lower than this threshold
|
|
|
|
* `--cwi` - defaults to "on". The status that specifies if the cwi is "on" or "off"
|
|
|
|
* `--multimodal_sents_only` - defaults to "off". The status that specifies if only the multimodal sentences of a documents will be saved
|
|
|
|
* `--max_docs` - defaults to None. Specifies the maximum number of documents that will be saved
|
|
|
|
* `--max_sents` - defaults to None. Specifies the maximum number of sentences that a document is allowed to have to be further processed
|
|
|
|
* `--rnd_seed` - defaults to None. Is responsible for the shuffling of the documents before they are processed
|
|
|
|
* `--candidate_imgs` - defaults to 5. Specifies how many "good" images that are "similar" to a given sentence there must be for sentence to be processed further
|
|
|
|
* `--sent_img_similarity` - defaults to 0.225. The threshold that defines the minimum required similarity value between a sentence and an image
|
|
|
|
* `--focus_word_img_similarity` defaults to 0.25. The threshold that defines the minimum required similarity value between a focus word and a candidate image
|
|
|
|
* `--db_name` - defaults to "multimodalDB". The name of the database
|
|
|
|
|
|
## Example: How to Create a Multimodal Dataset |
|
## Example Run |
|
\ No newline at end of file |
|
\ No newline at end of file |