... | ... | @@ -10,7 +10,9 @@ In the multimodal dataset creation, sentences of documents are enriched by image |
|
|
|
|
|
At the end of the pipeline, every document with at least one multimodal sentence will be saved in a MongoDB database. The database contains two collections/tables. In the first collection, the documents are saved. For each sentence of a document, there is also an id (SHA-512) value of that sentence and the information if the sentence is multimodal. The second collection contains the multimodal sentences. For each sentence, the path to the main image is saved and a dictionary with the (multimodal) focus words and the path to their highlighted images. These informations can be accessed through the [API](./api).
|
|
|
|
|
|
## Required Packages
|
|
|
## Prerequisites
|
|
|
|
|
|
### Required Packages
|
|
|
|
|
|
Most of the required packages can be installed with pip using the "requirements.txt" file from the repository:
|
|
|
|
... | ... | @@ -33,15 +35,15 @@ The POS-tagger from nltk can be installed via the interactive mode of python: |
|
|
>>> nltk.download("averaged_perceptron_tagger")
|
|
|
```
|
|
|
|
|
|
## Required Files
|
|
|
### Required Files
|
|
|
|
|
|
### Text Documents
|
|
|
#### Text Documents
|
|
|
|
|
|
Documents are required to create a multimodal dataset. Every document (e.g., book, news article ...) must be in a seperate file that has a ".txt" ending. The first line of a text file will be interpreted as its title.
|
|
|
|
|
|
For example, over 50.000 simple Wikipedia articles can be found [here](https://github.com/LGDoor/Dump-of-Simple-English-Wiki). In the repository, there is a script "scripts/simpleWikiArticles.py" which seperates the simple Wikipedia articles into individual files. For that, the "corpus.txt" file must be next to the script and a directory "articles" must be created next to these files.
|
|
|
|
|
|
### Images
|
|
|
#### Images
|
|
|
|
|
|
Images are used to enrich the sentences of the documents. They must be saved as "jpg" files.
|
|
|
|
... | ... | @@ -49,7 +51,7 @@ The image dataset that is used must be related to the concreteness values and th |
|
|
|
|
|
For example the [MS COCO dataset](http://images.cocodataset.org/zips/train2014.zip) can be used as an image dataset.
|
|
|
|
|
|
### Concreteness Values
|
|
|
#### Concreteness Values
|
|
|
|
|
|
The concreteness values of words are somehow used as a filtering step in this multimodal dataset creation. Having these values, it is easier to estimate if there is an image in the given dataset that represents a given word and thus can be highlighted according to that word.
|
|
|
|
... | ... | @@ -57,7 +59,7 @@ A json file that contains a dictionary of words and their concreteness values is |
|
|
|
|
|
That concreteness file can be created with the [implementation](https://github.com/victorssilva/concreteness) of [Visual Concreteness](https://arxiv.org/abs/1804.06786).
|
|
|
|
|
|
## Setup Database
|
|
|
### Setup Database
|
|
|
|
|
|
A MongoDB database is used to store the multimodal documents. [Here](https://www.mongodb.com/docs/manual/installation/) are several tutorials for installing MongoDB on different systems.
|
|
|
|
... | ... | |