Skip to content

File formats

The following file formats are supported:

Tabular data

type suffix example
csv .csv Titanic dataset csv
Excel .xls, .xlsx, .xlsm, .xlsb, .odf, .ods Titanic dataset excel
  • The first row contains the names of the columns.
  • Only the first sheet of the Excel file is used.

csv format

All common delimiters, including commas, colons, semicolons and tabs, are supported. For floating point numbers, please use dots as the decimal separator. Below is an example of a valid CSV file:

name,number,double
"Ford, Henry",1.4,2.8
"Benz, Carl",7.5,15.0

Load additional files (e.g. images)

To load additional files such as images, create a column in the tabular dataset (csv or excel) with the file names. Create a zip file with the tabular file and the additional files. Pack the files directly into the zip file, do not compress the folder containing the files. The zip file should contain only one csv file. The system will automatically detect that the column contains file names, and if they are present in the zip file, they will be loaded into your databag.

type suffix example
zip file .zip Mnist zip

Example structue of a zip file:

mnist.zip
├─ 10.jpg
├─ 20.jpg
├─ 24.jpg
└─ mnist.csv

Content of the mnist.csv:

image,digit
10.jpg,3
20.jpg,4
24.jpg,1
  • Supported image files are: .jpg, .jpeg, .png, .tiff
  • Currently only image files are supported, but in the future other files will be supported as well

Scripts

type suffix example
python script .py Python script

You can also upload python scripts that create a pandas dataframe. The script should save the dataframe as a csv without index in the location specified by the --output arg.

Examples for common Machine Learning Problems

We will give some examples of how you can model your machine learning problem using the multimodal format.

Image Classification

Upload a zip file containing your images and a csv file with the image names and labels. The csv should look like this:

image,digit
10.jpg,3
20.jpg,4
24.jpg,1

Image Captioning and Text2Image

image,description
tree.jpg,this is a tree in front of a house
ball.jpg,a blue and white ball
flowers.jpg,some flowers on a balcony

If you have a dataset with images and corresponding descriptions, you can train an image captioning model, by specifying the description as the output column. If you choose the image as the output column you can train a text2image model. However, this is not yet supported by our solvers, but will be a future feature.

Text classification

text,sentiment
he is a good boy,positive
he is rude, negative

Using a databag with a text column and a category column, you can model text classification tasks such as sentiment analysis.

Neural Machine Translation

english,deutsch
boy,Junge
italian,italienisch

If you have a databag with translations from one language to another, you can train a neural machine translation model by simply selecting the desired language as the output column.