{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dataset handling\n",
    "\n",
    "Scikit-multilearn provides methods to load, save and manipulate multi-label data sets in two formats:\n",
    "\n",
    "- a scikit-multilearn pickle of data set in scipy sparse format\n",
    "- the traditional ARFF file format\n",
    "\n",
    "The functionality is provided in the :mod:`skmultilearn.dataset` module.\n",
    "\n",
    "Scikit-multilearn also provides a repository of most popular benchmark data sets in the scipy sparse format and convienience functions to access them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## scikit-multilearn format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skmultilearn.dataset import load_dataset_dump, save_dataset_dump"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loading scikit-multilearn data format is easier as it stores more information than the ARFF file, all you need to do is specify the path to the data set file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y, feature_names, label_names = load_dataset_dump('_static/example.pkl.bz2')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(<65x19 sparse matrix of type '<class 'numpy.float64'>'\n",
       " \twith 491 stored elements in LInked List format>,\n",
       " <65x7 sparse matrix of type '<class 'numpy.int64'>'\n",
       " \twith 217 stored elements in LInked List format>,\n",
       " [('landmass', ['1', '2', '3', '4', '5', '6']),\n",
       "  ('zone', ['1', '2', '3', '4']),\n",
       "  ('area', 'NUMERIC')],\n",
       " [('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X, y, feature_names[:3], label_names[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'X': <10x4 sparse matrix of type '<class 'numpy.float64'>'\n",
       " \twith 27 stored elements in LInked List format>,\n",
       " 'y': <10x3 sparse matrix of type '<class 'numpy.int64'>'\n",
       " \twith 16 stored elements in LInked List format>,\n",
       " 'features': [('landmass', ['1', '2', '3', '4', '5', '6']),\n",
       "  ('zone', ['1', '2', '3', '4']),\n",
       "  ('area', 'NUMERIC'),\n",
       "  ('population', 'NUMERIC')],\n",
       " 'labels': [('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])]}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "save_dataset_dump(X[:10,:4], y[:10, :3], feature_names[:4], label_names[:3], filename=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the ``filename`` argument is not ``None`` this dictionary is saved as a bzip2 compressed pickle and the function does not return anything."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# scikit-multilearn repository"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skmultilearn.dataset import available_data_sets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following benchmark data sets, originally provided in the [MULAN data repository](http://mulan.sourceforge.net/datasets-mlc.html) are provided in ``train``, ``test``, and ``undivided`` variants. The undivided variant contains the complete data set, before the train/test split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Corel5k',\n",
       " 'bibtex',\n",
       " 'birds',\n",
       " 'delicious',\n",
       " 'emotions',\n",
       " 'enron',\n",
       " 'genbase',\n",
       " 'mediamill',\n",
       " 'medical',\n",
       " 'rcv1subset1',\n",
       " 'rcv1subset2',\n",
       " 'rcv1subset3',\n",
       " 'rcv1subset4',\n",
       " 'rcv1subset5',\n",
       " 'scene',\n",
       " 'tmc2007_500',\n",
       " 'yeast'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set([x[0] for x in available_data_sets().keys()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Variants:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'test', 'train', 'undivided'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set([x[1] for x in available_data_sets().keys()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scikit-multilearn can automatically download the data sets for you, similar to [scikit-learn's data set API](http://scikit-learn.org/stable/datasets/index.html).\n",
    "\n",
    "The data is stored by default in the subfolder ``scikit_ml_learn_data`` of your ``SCIKIT_ML_LEARN_DATA`` environment variable. If the variable is not set, the data is stored in ``~/scikit_ml_learn_data``.\n",
    "\n",
    "To download a data set use the :meth:`load_dataset` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skmultilearn.dataset import load_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "scene - exists, not redownloading\n"
     ]
    }
   ],
   "source": [
    "X, y, feature_names, label_names = load_dataset('scene', 'train')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(<1211x294 sparse matrix of type '<class 'numpy.float64'>'\n",
       " \twith 351805 stored elements in LInked List format>,\n",
       " <1211x6 sparse matrix of type '<class 'numpy.int64'>'\n",
       " \twith 1286 stored elements in LInked List format>,\n",
       " [('Att1', 'NUMERIC'), ('Att2', 'NUMERIC'), ('Att3', 'NUMERIC')],\n",
       " [('Beach', ['0', '1']), ('Sunset', ['0', '1']), ('FallFoliage', ['0', '1'])])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X, y, feature_names[:3], label_names[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ARFF files\n",
    "\n",
    "The most common way for storing multi-label data is the [ARFF file format](https://www.cs.waikato.ac.nz/ml/weka/arff.html) created by the [WEKA](https://www.cs.waikato.ac.nz/ml/weka/) library. You can find many benchmark data sets in ARFF format on the [MULAN data repository](http://mulan.sourceforge.net/datasets-mlc.html).\n",
    "\n",
    "Loading both dense and sparse ARFF files is simple in scikit-multilearn, just use :func:`load_from_arff`, like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skmultilearn.dataset import load_from_arff"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loading multi-label ARFF files requires additional information as the number or placement of labels, is not indicated in the format directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "path_to_arff_file = '_static/example.arff'\n",
    "label_count = 7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Different software expects labels in different parts of the ARFF file:\n",
    "\n",
    "- MEKA expects labels to appear at the beginning of the file\n",
    "- MULAN expects labels to appear at the end of the file\n",
    "\n",
    "As the `example.arff` comes from MULAN, we set the label location to ``end``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "label_location=\"end\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two ways to save ARFF data:\n",
    "- dense, where the file contains a complete dump of the data set row by row, including places where the value is 0\n",
    "- sparse, as a dictionary of keys, where for each row the non-zero elements are listed with their index\n",
    "\n",
    "The example file is not sparse, that's why we set the ``load_sparse`` argument to ``False``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "arff_file_is_sparse = False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y = load_from_arff(\n",
    "    path_to_arff_file, \n",
    "    label_count=label_count, \n",
    "    label_location=label_location,\n",
    "    load_sparse=arff_file_is_sparse\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or if you also want the metadata: feature and label names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y, feature_names, label_names = load_from_arff(\n",
    "    path_to_arff_file, \n",
    "    label_count=label_count, \n",
    "    label_location=label_location,\n",
    "    load_sparse=arff_file_is_sparse,\n",
    "    return_attribute_definitions=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see scikit-multilearn encodes nominal types by default as integers, and converts the input space to floats, while the output space to binary indicators 0/1 represented as integers. To change this behavior specify your own params to `load_from_arff` as described in the API documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(<65x19 sparse matrix of type '<class 'numpy.float64'>'\n",
       " \twith 491 stored elements in LInked List format>,\n",
       " <65x7 sparse matrix of type '<class 'numpy.int64'>'\n",
       " \twith 217 stored elements in LInked List format>,\n",
       " [('landmass', ['1', '2', '3', '4', '5', '6']),\n",
       "  ('zone', ['1', '2', '3', '4']),\n",
       "  ('area', 'NUMERIC')],\n",
       " [('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X, y, feature_names[:3], label_names[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to save ARFF files, you can use the :meth:`save_arff` function, which can both return a string containing an ARFF dump of the data set, or save it to a provided file when the `filename` argument is passed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skmultilearn.dataset import save_to_arff"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say we want to save a subset of the data in a sparse format and with labels at the begining of the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "% traindata\n",
      "@RELATION \"traindata: -C 3\"\n",
      "\n",
      "@ATTRIBUTE y0 {0, 1}\n",
      "@ATTRIBUTE y1 {0, 1}\n",
      "@ATTRIBUTE y2 {0, 1}\n",
      "@ATTRIBUTE X0 NUMERIC\n",
      "@ATTRIBUTE X1 NUMERIC\n",
      "@ATTRIBUTE X2 NUMERIC\n",
      "@ATTRIBUTE X3 NUMERIC\n",
      "\n",
      "@DATA\n",
      "{ 0 1,3 3.0,5 1001.0,6 47.0 }\n",
      "{ 2 1,3 1.0,4 2.0,5 178.0,6 3.0 }\n",
      "{ 0 1,2 1,3 1.0,4 3.0,5 76.0,6 2.0 }\n",
      "{ 0 1,2 1,3 5.0,4 1.0 }\n",
      "{ 0 1,3 4.0,5 47.0,6 1.0 }\n",
      "{ 2 1,4 3.0 }\n",
      "{ 0 1,2 1,3 4.0,5 121.0,6 18.0 }\n",
      "{ 0 1,1 1,3 2.0,5 301.0,6 57.0 }\n",
      "{ 0 1,1 1,3 4.0 }\n",
      "{ 0 1,1 1,3 3.0,5 2388.0,6 20.0 }\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(save_to_arff(X[:10,:4], y[:10, :3], label_location='start', save_sparse=True))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
