|
|
This page is a user guide which explains how a multimodal dataset can be created and what is required for it.
|
|
|
### This page is a user guide which explains how a multimodal dataset can be created and what is required for it.
|
|
|
|
|
|
The source code documentation created with Sphinx is in the repository "multimodalDatasetBuilder/docs".
|
|
|
|
... | ... | @@ -16,7 +16,7 @@ The main image of the sentence is on the left and its highlighted version accord |
|
|
|
|
|
 
|
|
|
|
|
|
At the end of the pipeline, every document with at least one multimodal sentence will be saved in a MongoDB database. The database contains two collections/tables. The first collection saves documents and informations like: title of a document, sentences of a document, SHA-512 value of sentences, boolean if a sentence is multimodal and in case which focus words it has. The second collection contains informations about the multimodal sentences. For each sentence, the path to the main image is saved and a dictionary with the focus words and the path to their highlighted images. These informations can be accessed through the [API](./api).
|
|
|
At the end of the pipeline, every document with at least one multimodal sentence will be saved in a MongoDB database. A document is an object which can consist of one or more sentences. Every sentence belongs to an document. The database contains two collections/tables. The first collection saves documents and informations like: title of a document, sentences of a document, SHA-512 value of sentences, boolean if a sentence is multimodal and in case which focus words it has. The second collection contains informations about the multimodal sentences. For each sentence, the path to the main image is saved and a dictionary with the focus words and the path to their highlighted images. These informations can be accessed through the [API](./api).
|
|
|
|
|
|
## Prerequisites
|
|
|
|
... | ... | @@ -28,34 +28,57 @@ The required packages can be installed with pip using the "requirements.txt" fil |
|
|
pip install -r requirements.txt
|
|
|
```
|
|
|
|
|
|
The POS-tagger from nltk can be installed via the interactive mode of python:
|
|
|
The POS-tagger from nltk can be installed via the nltk downloader:
|
|
|
|
|
|
```
|
|
|
>>> import nltk
|
|
|
>>> nltk.download("averaged_perceptron_tagger")
|
|
|
python -m nltk.downloader averaged_perceptron_tagger
|
|
|
```
|
|
|
|
|
|
### Required Files
|
|
|
|
|
|
#### Text Documents
|
|
|
|
|
|
Documents are required to create a multimodal dataset. Every document (e.g., book, news article ...) must be in a seperate file that has a ".txt" ending. The first line of a text file will be interpreted as its title.
|
|
|
Documents are required to create a multimodal dataset. Every document (e.g., book, news article ...) must be in a seperate file that with a ".txt" ending. The first line of a text file will be interpreted as its title.
|
|
|
|
|
|
For example, over 50.000 simple Wikipedia articles can be found [here](https://github.com/LGDoor/Dump-of-Simple-English-Wiki). In the repository, there is a script "scripts/simpleWikiArticles.py" which seperates the simple Wikipedia articles into individual files. For that, the "corpus.txt" file must be next to the script and a directory "articles" must be created next to these files.
|
|
|
For example, over 50.000 simple Wikipedia articles can be found [here](https://github.com/LGDoor/Dump-of-Simple-English-Wiki). In the repository, there is a script `scripts/simpleWikiArticles.py` which preprocesses the dataset by seperating them into individual files. For that, the `corpus.txt` file must be next to the script and a directory `articles` must be created next to these files.
|
|
|
|
|
|
#### Images
|
|
|
Example article:
|
|
|
|
|
|
Images are used to enrich the sentences of the documents. They must be saved as "jpg" files.
|
|
|
<pre>
|
|
|
|
|
|
`Autonomous communities of Spain`
|
|
|
|
|
|
`Spain is divided in 17 parts called autonomous communities.`
|
|
|
|
|
|
`"Autonomous" means that each of these autonomous communities has its own Executive Power, its own Legislative Power and its own Judicial Power.`
|
|
|
|
|
|
`These are similar, but NOT the same, to states in the United States of America, for example. Spain has fifty smaller parts called provinces.`
|
|
|
|
|
|
[`In`](http://provinces.In) `1978 these parts came together, making the autonomous communities.`
|
|
|
|
|
|
`Before then, some of these provinces were together but were broken.`
|
|
|
|
|
|
`The groups that were together once before are called "historic communities": Galicia, País Vasco and Cataluña.`
|
|
|
|
|
|
`These communities have 2 official languages: Spanish and their own language (gallego or eusquera or catalán).`
|
|
|
|
|
|
`Spain also has two cities on the north coast of Africa: Ceuta and Melilla. They are called "autonomous cities".`
|
|
|
|
|
|
</pre>#### Images
|
|
|
|
|
|
Images are used to enrich the sentences of the documents. They must be saved in jpg format.
|
|
|
|
|
|
The image dataset that is used must be related to the concreteness values and the cached image features. For example, if the image dataset changes then the other two must also change.
|
|
|
|
|
|
For example the [MS COCO dataset](http://images.cocodataset.org/zips/train2014.zip) can be used as an image dataset.
|
|
|
For example the [MS COCO dataset](https://cocodataset.org) can be used as an image dataset. A direct link to the 2014 dataset can be found [here](http://images.cocodataset.org/zips/train2014.zip).
|
|
|
|
|
|

|
|
|
|
|
|
#### Concreteness Values
|
|
|
|
|
|
The concreteness values of words are somehow used as a filtering step in this multimodal dataset creation. Having these values, it is easier to estimate if there is an image in the given dataset that represents a given word and thus can be highlighted according to that word.
|
|
|
|
|
|
A json file that contains a dictionary of words and their concreteness values is required for the multimodal dataset creation. In the repository, there is such a file "data/concreteness/concretenessValuesMscoco.json" containing the concreteness values of the "MS COCO train2014 dataset".
|
|
|
A json file that contains a dictionary of words and their concreteness values is required for the multimodal dataset creation. In the repository, there is such a file `data/concreteness/concretenessValuesMscoco.json` containing the concreteness values of the MS COCO train 2014 dataset.
|
|
|
|
|
|
That concreteness file can be created with this [implementation](https://github.com/victorssilva/concreteness) of [Visual Concreteness](https://arxiv.org/abs/1804.06786).
|
|
|
|
... | ... | @@ -65,37 +88,39 @@ A MongoDB database is used to store the multimodal documents. [Here](https://www |
|
|
|
|
|
## Parameter Explanation
|
|
|
|
|
|
The "main.py" file has a lot of parameters which are explained here.
|
|
|
The `main.py` file has a lot of parameters which are explained here.
|
|
|
|
|
|
These are the required parameters. They only have to be set if their default values aren't right.
|
|
|
|
|
|
* `-d, --documents` - the path to the directory where the documents are
|
|
|
* `-i, --images` - the path to the directory where the images are
|
|
|
* `-m, --mcimages` - the path to the directory where the highlighted images from miniCLIP will be saved
|
|
|
* `-c, --concreteness` - the path to the json file containing the concreteness values. The file must be changed if the image dataset changes
|
|
|
* `-f, --cached_image_features` - the path to the file containing the cached image features. If a file with that name exists the cached features will be loaded. Otherwise such a file will be created with the specified name. The image features must be changed if the image dataset changes
|
|
|
* `-d, --documents` - path to the document directory are
|
|
|
* `-i, --images` - path to the image directory are
|
|
|
* `-m, --mcimages` - path where the highlighted miniCLIP images are saved
|
|
|
* `-c, --concreteness` - path to the json file containing the concreteness values. The file must be changed if the image dataset changes
|
|
|
* `-f, --cached_image_features` - path to the file containing the cached image features. If a file with that name exists the cached features will be loaded. Otherwise such a file will be created with the specified name. The image features must be changed if the image dataset changes
|
|
|
|
|
|
The following parameters are optional.
|
|
|
|
|
|
* `--resized_main_images` - on default, the size of the main images will not change. If a directory is specified the main images of the sentences will be resized and saved there. The resized images will have the same size as the highlighted images which is 224x224 pixels
|
|
|
* `-t, --concreteness_threshold` - defaults to 20. Filters out the words from the concreteness file which have a score lower than this threshold
|
|
|
* `--cwi` - defaults to "on". The status that specifies if the cwi is "on" or "off"
|
|
|
* `--multimodal_sents_only` - defaults to "off". The status that specifies if only the multimodal sentences of the documents will be saved
|
|
|
* `--max_docs` - defaults to None. Specifies the maximum number of documents that will be saved
|
|
|
* `--max_sents` - defaults to None. Specifies the maximum number of sentences that a document is allowed to have to be further processed
|
|
|
* `--rnd_seed` - defaults to None. Is responsible for the shuffling of the documents before they are processed
|
|
|
* `--candidate_imgs` - defaults to 5. Specifies how many "good" images that are "similar" to a given sentence there must be for sentence to be processed further
|
|
|
* `--sent_img_similarity` - defaults to 0.225. The threshold that defines the minimum required similarity value between a sentence and an image
|
|
|
* `--focus_word_img_similarity` defaults to 0.25. The threshold that defines the minimum required similarity value between a focus word and a candidate image
|
|
|
* `--db_name` - defaults to "multimodalDB". The name of the database
|
|
|
* `-t, --concreteness_threshold` - default 20. Filters out the words from the concreteness file with a lower score than than this threshold
|
|
|
* `--cwi` - default on. Sets the complex word identifier to `on` or `off`
|
|
|
* `--multimodal_sents_only` - default off. Specifies if only the multimodal sentences of the documents will be saved
|
|
|
* `--max_docs` - default None. Specifies the maximum number of documents that will be saved
|
|
|
* `--max_sents` - default None. Specifies the maximum number of sentences that a document is allowed to have to be further processed
|
|
|
* `--rnd_seed` - default None. Is responsible for the shuffling of the documents before they are processed
|
|
|
* `--candidate_imgs` - default 5. Specifies how many "good" images that are "similar" to a given sentence there must be for sentence to be processed further
|
|
|
* `--sent_img_similarity` - default 0.225. Defines a minimum threshold similarity value between a sentence and an image
|
|
|
* `--focus_word_img_similarity` default 0.25. Threshold that defines the minimum required similarity value between a focus word and a candidate image
|
|
|
* `--db_name` - default multimodalDB. The name of the database
|
|
|
|
|
|
## Example Run
|
|
|
|
|
|
Let's assume we have the
|
|
|
* documents as ".txt" files in "../data/documents/"
|
|
|
* "MS COCO train 2014" images as ".jpg" files in "../data/train2014/"
|
|
|
* concreteness values file (which is related to the images) in "../data/concreteness/concretenessValuesMscoco.json"
|
|
|
|
|
|
Further, we want to save the highlighted images in "../data/miniclip/". Since we run the multimodal dataset creation program for the first time (or we use the image set for the first time) we specify that the cached image features are saved in "../data/cache/mscoco_features.pkl".
|
|
|
* documents as ".txt" files in `../data/documents/`
|
|
|
* "MS COCO train 2014" images as ".jpg" files in `../data/train2014/`
|
|
|
* concreteness values file (which is related to the images) in `../data/concreteness/concretenessValuesMscoco.json`
|
|
|
|
|
|
Further, we want to save the highlighted images in `../data/miniclip/`. Since we run the multimodal dataset creation program for the first time (or we use the image set for the first time) we specify that the cached image features are saved in `../data/cache/mscoco_features.pkl`.
|
|
|
|
|
|
Then, running
|
|
|
|
... | ... | @@ -111,9 +136,9 @@ python main.py |
|
|
|
|
|
since the values of the parameters are the same as the default values.
|
|
|
|
|
|
Running the program for the first time may take a while because the images have to be preprocessed and encoded by CLIP. The second run with the same image set and the usage of the cached features file should be much faster.
|
|
|
Running the program for the first time may take a while because the images have to be preprocessed and encoded by CLIP. The second run with the same image set and the usage of the cached features file are much faster.
|
|
|
|
|
|
Now, every document in "../data/documents/" that has at least one multimodal sentence is saved in the MongoDB database called "multimodalDB". With
|
|
|
Now, every document in `../data/documents/` that has at least one multimodal sentence is saved in the MongoDB database called `multimodalDB`. With
|
|
|
|
|
|
```
|
|
|
mongo localhost:27017
|
... | ... | @@ -138,7 +163,7 @@ you can check the documents/books and the images of the multimodal sentences res |
|
|
|
|
|
For the further examples let's assume that the required parameters have the same value as their default values.
|
|
|
|
|
|
Let's say we want that the main images of the sentences have the same size (224x224 pixels) as the highlighted images. Then, we can resize the main images which are used and save them in "../data/resizedImages" with
|
|
|
Let's say we want that the main images of the sentences have the same size (224x224 pixels) as the highlighted images. Then, we can resize the main images which are used and save them in `../data/resizedImages` with
|
|
|
|
|
|
```
|
|
|
python main.py --resized_main_images ../data/resizedImages
|
... | ... | @@ -162,7 +187,7 @@ If we want to define a word from the concreteness values file to be concrete/dep |
|
|
python main.py --concreteness_threshold 50
|
|
|
```
|
|
|
|
|
|
The image retrieval with CLIP can be influenced with the parameters "--candidate_imgs", "--sent_img_similarity" and "--focus_word_img_similarity". The choice of the first two parameters is based on this [paper](https://www.inf.uni-hamburg.de/en/inst/ab/lt/publications/2022-wangetal-lrec.pdf). The last parameter then bases on the second one. Especially, increasing the last two ones might result in more suitable images but less multimodal sentences
|
|
|
The image retrieval with CLIP can be influenced with the parameters `--candidate_imgs`, `--sent_img_similarity` and `--focus_word_img_similarity`. The choice of the first two parameters is based on this [paper](https://www.inf.uni-hamburg.de/en/inst/ab/lt/publications/2022-wangetal-lrec.pdf). The last parameter then bases on the second one. Especially, increasing the last two ones might result in more suitable images but less multimodal sentences
|
|
|
|
|
|
```
|
|
|
python main.py --candidate_imgs 10 --sent_img_similarity 2.5 --focus_word_img_similarity 2.75
|
... | ... | |