{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Developer documentation\n",
    "\n",
    "Scikit-multilearn development team is an open international community that welcomes contributions and new developers. This document is for you if you want to implement a new:\n",
    "\n",
    "- classifier\n",
    "- relationship graph builder\n",
    "- label space clusterer\n",
    "\n",
    "Before we can go into development details, we need to discuss how to setup a comfortable development environment and what is the best way to contribute.\n",
    "\n",
    "\n",
    "### Working with the repository\n",
    "\n",
    "Scikit-learn is developed on github using git for code version management. To get the current codebase you need to checkout the scikit-multilearn repository\n",
    "\n",
    "```\n",
    "git clone git@github.com:scikit-multilearn/scikit-multilearn.git\n",
    "```\n",
    "\n",
    "To make a contribution to the repository your should fork the repository, clone your fork, and start development based on the `master` branch. Once you're done, push your commits to your repository and submit a pull request for review. \n",
    "\n",
    "The review usually includes:\n",
    "- making sure that your code works, i.e. it has enough unit tests and tests pass\n",
    "- reading your code's documentation, it should follow the numpydoc standard\n",
    "- checking whether your code works properly on sparse matrix input\n",
    "- your class should not store more data in memory than neccessary\n",
    "\n",
    "Once your contributions adhere to reviewer comments, your code will be included in the next release.\n",
    "\n",
    "### Development Docker image\n",
    "\n",
    "To ease development and testing we provide a docker image containing all libraries needed to test all of scikit-multilearn codebase. It is an ubuntu based docker image with libraries that are very costly to compile such as python-graphtool. This docker image can be easily integrated with your PyCharm environment.\n",
    "\n",
    "To pull the [scikit-multilearn docker image](https://github.com/scikit-multilearn/development-docker) just use:\n",
    "\n",
    "```bash\n",
    "$ docker pull niedakh/scikit-multilearn-dev:latest\n",
    "```\n",
    "\n",
    "After cloning the scikit-multilearn repository, run the following command:\n",
    "\n",
    "\n",
    "This docker contains two python environments set for scikit-multilearn: 2.7 and 3.x, to use the first one run `python2` and `pip2`, the second is available via `python3` and `pip3`.\n",
    "\n",
    "You can pull the latest version from Docker hub using:\n",
    "```bash\n",
    "$ docker pull niedakh/scikit-multilearn-dev:latest\n",
    "```\n",
    "\n",
    "You can start it via:\n",
    "```bash\n",
    "$ docker run -e \"MEKA_CLASSPATH=/opt/meka/lib\" -v \"YOUR_CLONE_DIR:/home/python-dev/repo\" --name scikit_multilearn_dev_test_docker -p 8888:8888 -d niedakh/scikit-multilearn-dev:latest\n",
    "```\n",
    "\n",
    "To run the tests under the python 2.7 environment use:\n",
    "```bash\n",
    "$ docker exec -it scikit_multilearn_dev_test_docker python3 -m pytest /home/python-dev/repo\n",
    "```\n",
    "\n",
    "or for python 3.x use:\n",
    "```bash\n",
    "$ docker exec -it scikit_multilearn_dev_test_docker python2 -m pytest /home/python-dev/repo\n",
    "```\n",
    "\n",
    "To play around just login with:\n",
    "```bash\n",
    "$ docker exec -it scikit_multilearn_dev_test_docker bash\n",
    "```\n",
    "\n",
    "To start jupyter notebook run:\n",
    "\n",
    "```bash\n",
    "$ docker exec -it scikit_multilearn_dev_test_docker bash -c \"cd /home/python-dev/repo && jupyter notebook\"\n",
    "```\n",
    "\n",
    "### Building documentation\n",
    "\n",
    "In order to build HTML documentation just run:\n",
    "\n",
    "```bash\n",
    "$ docker exec -it scikit_multilearn_dev_test_docker bash -c \"cd /home/python-dev/repo/docs && make html\"\n",
    "```\n",
    "\n",
    "\n",
    "### Development\n",
    "\n",
    "One of the most comfortable ways to work on the library is to use [Pycharm](https://www.jetbrains.com/pycharm/) and its [support for docker-contained interpreters](https://www.jetbrains.com/help/pycharm/using-docker-as-a-remote-interpreter.html), just configure access to the docker server, set it up in Pycharm, use `niedakh/scikit-multilearn-dev:latest` as the image name and set up relevant path mappings, voila - you can now use this environment for development, debugging and running tests within the IDE. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing code\n",
    "\n",
    "At the very list you should make sure that your code:\n",
    "\n",
    "- works on Python 2 and Python 3 on Windows 10/Linux/OSX using travis/appveyor \n",
    "\n",
    "- PEP8 coding guidelines\n",
    "\n",
    "- follows scikit-learn interfaces if relevant interfaces exist\n",
    "\n",
    "- is documented in the [numpydocs fashion](http://numpydoc.readthedocs.io/en/latest/format.html), especially that all public API is documented, including attributes and an example use case, see existing code for inspiration\n",
    "\n",
    "- has tests written, you can find relevant tests in ``skmultilearn.cluster.tests`` and ``skmultilearn.problem_transform.tests``."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing a label space clusterer\n",
    "\n",
    "One of the approaches to multi-label classification is to cluster the label space into subspaces and perform classification in smaller subproblems to reduce the risk of under/overfitting.\n",
    "\n",
    "In order to create your own label space clusterer you need to inherit :class:`LabelSpaceClustererBase` and implement the ``fit_predict(X, y)`` class method. Expect ``X`` and ``y`` to be sparse matrices, you and also use :func:`skmultilearn.utils.get_matrix_in_format` to convert to a desired matrix format. ``fit_predict(X, y)`` should return an array-like (preferably ``ndarray`` or at least a ``list``) of ``n_clusters`` subarrays which contain lists of labels present in a given cluster. An example of a correct partition of five labels is: ``np.array([[0,1], [2,3,4]])`` and of overlapping clusters: ``np.array([[0,1,2], [2,3,4]])``.\n",
    "\n",
    "\n",
    "### Example Clusterer\n",
    "\n",
    "Let us look at a toy example, where a clusterer divides the label space based on how a given label's ordinal divides modulo a given number of clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skmultilearn.dataset import load_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "emotions:train - exists, not redownloading\n",
      "emotions:test - exists, not redownloading\n"
     ]
    }
   ],
   "source": [
    "X_train, y_train, _, _ = load_dataset('emotions', 'train')\n",
    "X_test, y_test, _, _ = load_dataset('emotions', 'test')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skmultilearn.ensemble import LabelSpacePartitioningClassifier\n",
    "from skmultilearn.cluster.base import LabelSpaceClustererBase\n",
    "\n",
    "\n",
    "class ModuloClusterer(LabelSpaceClustererBase):\n",
    "    \"\"\"Initializes the clusterer\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    n_clusters: int\n",
    "        number of clusters to partition into\n",
    "        \n",
    "    Returns\n",
    "    --------\n",
    "    array-like of array-like, (n_clusters,)\n",
    "        list of lists label indexes, each sublist represents labels\n",
    "        that are in that community\n",
    "    \"\"\"\n",
    "    def __init__(self, n_clusters = None):\n",
    "        \n",
    "        super(ModuloClusterer, self).__init__()\n",
    "        self.n_clusters = n_clusters\n",
    "\n",
    "    def fit_predict(self, X, y):\n",
    "        n_labels = y.shape[1]\n",
    "        partition_list = [[] for _ in range(self.n_clusters)]\n",
    "        for label in range(n_labels):\n",
    "            partition_list[label % self.n_clusters].append(label)\n",
    "        return np.array(partition_list)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 3],\n",
       "       [1, 4],\n",
       "       [2, 5]])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusterer = ModuloClusterer(n_clusters=3)\n",
    "clusterer.fit_predict(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using the example Clusterer\n",
    "Such a clusterer can then be used with an ensemble classifier such as the ``LabelSpacePartitioningClassifier``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skmultilearn.ensemble import LabelSpacePartitioningClassifier\n",
    "from skmultilearn.problem_transform import LabelPowerset\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.metrics import accuracy_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LabelSpacePartitioningClassifier(classifier=LabelPowerset(classifier=GaussianNB(priors=None), require_dense=[True, True]),\n",
       "                 clusterer=ModuloClusterer(n_clusters=3),\n",
       "                 require_dense=[False, False])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf = LabelSpacePartitioningClassifier(\n",
    "    classifier = LabelPowerset(classifier=GaussianNB()),\n",
    "    clusterer = clusterer\n",
    ")\n",
    "clf "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.23762376237623761"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.fit(X_train, y_train)\n",
    "prediction = clf.predict(X_test)\n",
    "accuracy_score(y_test, prediction)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing a Graph Builder\n",
    "\n",
    "Scikit-multilearn implements clusterers that are capable of infering label space clusters (in network science the word communities is used more often) from a graph/network depicting label relationships. These clusterers are further described in [Label relations](labelrelations.ipynb) chapter of the user guide.\n",
    "\n",
    "To implement your own graph builder you need to subclass `GraphBuilderBase` and implement the `transform` function which should return a weighted (or not) adjacency matrix in the form of a dictionary, with keys ``(label1, label2)`` and values representing a weight.\n",
    "\n",
    "\n",
    "### Example GraphBuilder\n",
    "\n",
    "Let's implement a simple graph builder which returns the correlations between labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy import stats\n",
    "from skmultilearn.cluster import GraphBuilderBase\n",
    "from skmultilearn.utils import get_matrix_in_format\n",
    "\n",
    "class LabelCorrelationGraphBuilder(GraphBuilderBase):\n",
    "    \"\"\"Builds a graph with label correlations on edge weights\"\"\"\n",
    "\n",
    "    def transform(self, y):\n",
    "        \"\"\"Generate weighted adjacency matrix from label matrix\n",
    "\n",
    "        This function generates a weighted label correlation\n",
    "        graph based on input binary label vectors\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        y : numpy.ndarray or scipy.sparse\n",
    "            dense or sparse binary matrix with shape \n",
    "            ``(n_samples, n_labels)``\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        dict\n",
    "            weight map with a tuple of ints as keys\n",
    "            and a float value ``{ (int, int) : float }``\n",
    "        \"\"\"\n",
    "        label_data = get_matrix_in_format(y, 'csc')\n",
    "        labels = range(label_data.shape[1])\n",
    "        \n",
    "        self.is_weighted = True\n",
    "        \n",
    "        edge_map = {}\n",
    "        \n",
    "        for label_1 in labels:\n",
    "            for label_2 in range(0, label_1+1):\n",
    "                # calculate pearson R correlation coefficient for label pairs\n",
    "                # we only include the edges above diagonal as it is an undirected graph\n",
    "                pearson_r, _ = stats.pearsonr(label_data[:,label_2].todense(), label_data[:,label_1].todense())\n",
    "                edge_map[(label_2, label_1)] = pearson_r[0]\n",
    "        \n",
    "        return edge_map\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph_builder = LabelCorrelationGraphBuilder()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{(0, 0): 1.0,\n",
       " (0, 1): 0.0054205072520802679,\n",
       " (0, 2): -0.4730507042031965,\n",
       " (0, 3): -0.35907118960632034,\n",
       " (0, 4): -0.32287762681546733,\n",
       " (0, 5): 0.24883125852376733,\n",
       " (1, 1): 1.0,\n",
       " (1, 2): 0.1393556218283642,\n",
       " (1, 3): -0.25112700233108359,\n",
       " (1, 4): -0.3343594619173676,\n",
       " (1, 5): -0.36277277605002756,\n",
       " (2, 2): 1.0,\n",
       " (2, 3): 0.34204580629202336,\n",
       " (2, 4): 0.23107157941324433,\n",
       " (2, 5): -0.56137098197912705,\n",
       " (3, 3): 1.0,\n",
       " (3, 4): 0.48890609122000817,\n",
       " (3, 5): -0.35949125643829821,\n",
       " (4, 4): 1.0,\n",
       " (4, 5): -0.28842101609587079,\n",
       " (5, 5): 1.0}"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "graph_builder.transform(y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This adjacency matrix can be then used by a Label Graph clusterer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using the example GraphBuilder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 5], [1], [2], [3, 4]], dtype=object)"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from skmultilearn.cluster import NetworkXLabelGraphClusterer\n",
    "clusterer = NetworkXLabelGraphClusterer(graph_builder=graph_builder)\n",
    "clusterer.fit_predict(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The clusterer can be then used with the LabelSpacePartitioning classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.13861386138613863"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from skmultilearn.ensemble import LabelSpacePartitioningClassifier\n",
    "from skmultilearn.problem_transform import LabelPowerset\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "clf = LabelSpacePartitioningClassifier(\n",
    "    classifier = LabelPowerset(classifier=GaussianNB()),\n",
    "    clusterer = clusterer\n",
    ")\n",
    "clf.fit(X_train, y_train)\n",
    "prediction = clf.predict(X_test)\n",
    "accuracy_score(y_test, prediction)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing a classifier\n",
    "\n",
    "To implement a multi-label classifier you need to subclass a classifier base class. Currently, you can select of a few classifier base classes depending on which approach to multi-label classification you follow.\n",
    "\n",
    "Scikit-multilearn inheritance tree for the classifier is shown on the figure below.\n",
    "\n",
    "![Classifier inheritance diagram][inheritance]\n",
    "\n",
    "[inheritance]: inheritance.png\n",
    "\n",
    "\n",
    "To implement a scikit-learn's ecosystem compatible classifier, we need to subclass two classes from sklearn.base: BaseEstimator and ClassifierMixin. For that we provide :class:`skmultilearn.base.MLClassifierBase` base class. We further extend this class with properties specific to the problem transformation approach in multi-label classification in :class:`skmultilearn.base.ProblemTransformationBase`.\n",
    "\n",
    "To implement a scikit-learn's ecosystem compatible classifier, we need to subclass two classes from sklearn.base: BaseEstimator and ClassifierMixin. For that we provide :class:`skmultilearn.base.MLClassifierBase` base class. We further extend this class with properties specific to the problem transformation approach in multi-label classification in :class:`skmultilearn.base.ProblemTransformationBase`.\n",
    "\n",
    "### Scikit-learn base classses\n",
    "\n",
    "#### BaseEstimator\n",
    "\n",
    "The base estimator class from scikit is responsible for providing the ability of cloning classifiers, for example when multiple instances of the same classifier are needed for cross-validation performed using the CrossValidation class.\n",
    "\n",
    "The class provides two functions responsible for that: ``get_params``, which fetches parameters from a classifier object and ``set_params``, which sets params of the target clone. The params should also be acceptable by the constructor.\n",
    "\n",
    "#### ClassifierMixin\n",
    "\n",
    "This is an interface with a non-important method that allows different classes in scikit to detect that our classifier behaves as a classifier (i.e. implements ``fit``/``predict`` etc.) and provides certain kind of outputs.\n",
    "\n",
    "\n",
    "### MLClassifierBase\n",
    "\n",
    "The base multi-label classifier in scikit-multilearn is :class:`skmultilearn.base.MLClassifierBase`. It provides two abstract methods: fit(X, y) to train the classifier and predict(X) to predict labels for a set of samples. These functions are expected from every classifier. It also provides a default implementation of get_params/set_params that works for multi-label classifiers.\n",
    "\n",
    "All you need to do in your classifier is: \n",
    "\n",
    "1. subclass ``MLClassifierBase`` or a derivative class\n",
    "2. set ``self.copyable_attrs`` in your class's constructor to a list of fields (as strings), that should be cloned (usually it is equal to the list of constructor's arguments)\n",
    "3. implement the ``fit`` method that trains your classifier\n",
    "4. implement the ``predict`` method that predicts results\n",
    "\n",
    "#### Copyable fields\n",
    "\n",
    "One of the most important concepts in scikit-learn's ``BaseEstimator``, is the concept of cloning. Scikit-learn provides a plethora of experiment performing methods, among others, cross-validation, which require the ability to clone a classifier. Scikit-multilearn's base multi-label class - ``MLClassifierBase`` - provides infrastructure for automatic cloning support.\n",
    "\n",
    "\n",
    "An example of this would be: \n",
    "\n",
    "```python\n",
    "from skmultilearn.base import MLClassifierBase\n",
    "\n",
    "class AssignKBestLabels(MLClassifierBase):\n",
    "    \"\"\"Assigns k most frequent labels\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    k : int\n",
    "        number of most frequent labels to assign\n",
    "        \n",
    "    Example\n",
    "    -------\n",
    "    An example use case for AssignKBestLabels:\n",
    "\n",
    "    .. code-block:: python\n",
    "\n",
    "        from skmultilearn.<YOUR_CLASSIFIER_MODULE> import AssignKBestLabels\n",
    "        \n",
    "        # initialize LabelPowerset multi-label classifier with a RandomForest\n",
    "        classifier = AssignKBestLabels(\n",
    "            k = 3\n",
    "        )\n",
    "\n",
    "        # train\n",
    "        classifier.fit(X_train, y_train)\n",
    "\n",
    "        # predict\n",
    "        predictions = classifier.predict(X_test) \n",
    "    \"\"\"\n",
    "\n",
    "\n",
    "    def __init__(self, k = None):\n",
    "        super(AssignKBestLabels, self).__init__()\n",
    "        self.k = k\n",
    "        self.copyable_attrs = ['k']\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The fit method\n",
    "\n",
    "The ``fit(self, X, y)`` expects classifier training data:\n",
    "\n",
    "- ``X`` should be a sparse matrix of shape: ``(n_samples, n_features)``, although for compatibility reasons array of arrays and a dense matrix are supported. \n",
    "\n",
    "- ``y`` should be a sparse, binary indicator, matrix of shape: ``(n_samples, n_labels)`` with 1 in a position ``i,j`` when ``i``-th sample  is labelled with label no. ``j``\n",
    "\n",
    "It should return ``self`` after the classifier has been fitted to training data. It is customary that ``fit`` should remember ``n_labels`` in a way. In practice we store ``n_labels`` as ``self.label_count`` in scikit-multilearn classifiers.\n",
    "\n",
    "Let's make our classifier trainable:\n",
    "```python\n",
    "    def fit(self, X, y):\n",
    "        \"\"\"Fits classifier to training data\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        X : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix, shape=(n_samples, n_features)\n",
    "            input feature matrix\n",
    "        y : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix of `{0, 1}`, shape=(n_samples, n_labels)\n",
    "            binary indicator matrix with label assignments\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        self\n",
    "            fitted instance of self\n",
    "        \"\"\"\n",
    "        frequencies = (y_train.sum(axis=0)/float(y_train.sum().sum())).A.tolist()[0]\n",
    "        labels_sorted_by_frequency = sorted(range(y_train.shape[1]), key = lambda i: frequencies[i])\n",
    "        self.labels_to_assign = labels_sorted_by_frequency[:self.k]\n",
    "        \n",
    "        return self\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The predict and predict_proba method\n",
    "\n",
    "The ``predict(self, X)`` returns a prediction of labels for the samples from ``X``:\n",
    "\n",
    "- ``X`` should be a sparse matrix of shape: ``(n_samples, n_features)``, although for compatibility reasons array of arrays and a dense matrix are supported. \n",
    "\n",
    "The returned value is similar to ``y`` in ``fit``. It should be a sparse binary indicator matrix of the shape ``(n_samples, n_labels)``.\n",
    "\n",
    "In some cases, while scikit continues to progress towards a complete switch to sparse matrices, it might be needed to convert the sparse matrix to a `dense matrix` or even `array-like of array-likes`. Such is the case for some scoring functions in scikit. This problem should go away in the future versions of scikit.\n",
    "\n",
    "The ``predict_proba(self, X)`` functions similarly but returns the likelihood of the label being correctly assigned to samples from ``X``.\n",
    "\n",
    "Let's add the prediction functionality to our classifier and see how it works:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.10396039603960396"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from skmultilearn.base import MLClassifierBase\n",
    "from scipy.sparse import lil_matrix\n",
    "\n",
    "class AssignKBestLabels(MLClassifierBase):\n",
    "    \"\"\"Assigns k most frequent labels\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    k : int\n",
    "        number of most frequent labels to assign\n",
    "        \n",
    "    Example\n",
    "    -------\n",
    "    An example use case for AssignKBestLabels:\n",
    "\n",
    "    .. code-block:: python\n",
    "\n",
    "        from skmultilearn.<YOUR_CLASSIFIER_MODULE> import AssignKBestLabels\n",
    "        \n",
    "        # initialize LabelPowerset multi-label classifier with a RandomForest\n",
    "        classifier = AssignKBestLabels(\n",
    "            k = 3\n",
    "        )\n",
    "\n",
    "        # train\n",
    "        classifier.fit(X_train, y_train)\n",
    "\n",
    "        # predict\n",
    "        predictions = classifier.predict(X_test) \n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, k = None):\n",
    "        super(AssignKBestLabels, self).__init__()\n",
    "        self.k = k\n",
    "        self.copyable_attrs = ['k']\n",
    "        \n",
    "    def fit(self, X, y):\n",
    "        \"\"\"Fits classifier to training data\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        X : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix, shape=(n_samples, n_features)\n",
    "            input feature matrix\n",
    "        y : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix of `{0, 1}`, shape=(n_samples, n_labels)\n",
    "            binary indicator matrix with label assignments\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        self\n",
    "            fitted instance of self\n",
    "        \"\"\"\n",
    "        self.n_labels = y.shape[1]\n",
    "        frequencies = (y.sum(axis=0)/float(y.sum().sum())).A.tolist()[0]\n",
    "        labels_sorted_by_frequency = sorted(range(y.shape[1]), key = lambda i: frequencies[i])\n",
    "        self.labels_to_assign = labels_sorted_by_frequency[:self.k]\n",
    "        \n",
    "        return self\n",
    "    \n",
    "    def predict(self, X):\n",
    "        \"\"\"Predict labels for X\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        X : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix, shape=(n_samples, n_features)\n",
    "            input feature matrix\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        :mod:`scipy.sparse` matrix of `{0, 1}`, shape=(n_samples, n_labels)\n",
    "            binary indicator matrix with label assignments\n",
    "        \"\"\"\n",
    "        \n",
    "        prediction = lil_matrix(np.zeros(shape=(X.shape[0], self.n_labels), dtype=int))\n",
    "        prediction[:,self.labels_to_assign] = 1\n",
    "        \n",
    "        return prediction\n",
    "\n",
    "    def predict_proba(self, X):\n",
    "        \"\"\"Predict probabilities of label assignments for X\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        X : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix, shape=(n_samples, n_features)\n",
    "            input feature matrix\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        :mod:`scipy.sparse` matrix of `float in [0.0, 1.0]`, shape=(n_samples, n_labels)\n",
    "            matrix with label assignment probabilities\n",
    "        \"\"\"\n",
    "        \n",
    "        probabilities = lil_matrix(np.zeros(shape=(X.shape[0], self.n_labels), dtype=float))\n",
    "        probabilities[:,self.labels_to_assign] = 1.0\n",
    "        \n",
    "        return probabilities\n",
    "\n",
    "clf = AssignKBestLabels(k=2)\n",
    "clf.fit(X_train, y_train)\n",
    "prediction = clf.predict(X_test)\n",
    "accuracy_score(y_test, prediction)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Selecting the base class\n",
    "\n",
    "Madjarov et al. divide approach to multi-label classification into three categories, you should select a scikit-multilearn base class according to the philosophy behind your classifier:\n",
    "\n",
    "- algorithm adaptation, when a single-label algorithm is directly adapted to the multi-label case, ex. Decision Trees can be adapted by taking multiple labels into consideration in decision functions, for now the base function for this approach is ``MLClassifierBase``\n",
    "\n",
    "- problem transformation, when the multi-label problem is transformed to a set of single-label problems, solved there and converted to a multi-label solution afterwards - for this approach we provide a comfortable ``ProblemTransformationBase`` base class\n",
    "\n",
    "- ensemble classification, when the multi-label classification is performed by an ensemble of multi-label classifiers to improve performance, overcome overfitting etc. In the case when your classifier concentrates on clustering the label space, you should use :class:`LabelSpacePartitioningClassifier` - which partitions a label space using a cluster class that implements the :class:`LabelSpaceClustererBase` interface.\n",
    "\n",
    "\n",
    "#### Problem transformation\n",
    "\n",
    "Problem transformation approach is centred around the idea of converting a multi-label problem into one or more single-label problems, which are usually solved by single- or multi-class classifiers. Scikit-learn is the de facto standard source of Python implementations of single-label classifiers.\n",
    "\n",
    "To perform the transformation, every problem transformation classifier needs a base classifier. As all classifiers that follow scikit-s BaseEstimator a clonable, scikit-multilearn's base class for problem transformation classifiers requires an instance of a base classifier in initialization. Such an instance can be cloned if needed, and its parameters can be set up comfortably.\n",
    "\n",
    "The biggest problem with joining single-label scikit classifiers with multi-label classifiers is that there exists no way to learn whether a given scikit classifier accepts sparse matrices as input for ``fit``/``predict`` functions. For this reason ``ProblemTransformationBase`` requires another parameter - ``require_dense`` : ``[ bool, bool ]`` - a list/tuple of two boolean values. If the first one is true, that means the base classifier expects a dense (scikit-compatible array-like of array-likes) representation of the sample feature space ``X``. If the second one is true - the target space ``y`` is passed to the base classifier as an array like of numbers. In case any of these are false - the arguments are passed as a sparse matrix.\n",
    "\n",
    "If the ``required_dense`` argument is not passed, it is set to ``[false, false]`` if a classifier inherits ::class::``MLClassifierBase`` and to ``[true, true]`` as a fallback otherwise. In short, it assumes dense representation is required for base classifier if the base classifier is not a scikit-multilearn classifier.\n",
    "\n",
    "\n",
    "\n",
    "### Ensemble classification\n",
    "\n",
    "Ensemble classification is an approach of transforming a multi-label classification problem into a family (an ensemble) of multi-label subproblems. \n",
    "\n",
    "\n",
    "\n",
    "### Unit testing classifiers\n",
    "\n",
    "Scikit-multilearn provides a base unit test class for testing classifiers. Please check ``skmultilearn.tests.classifier_basetest`` for a general framework for testing the multi-label classifier.\n",
    "\n",
    "Currently tests test three capabilities of the classifier:\n",
    "- whether the classifier works with dense/sparse input data :func:`ClassifierBaseTest.assertClassifierWorksWithSparsity`\n",
    "- whether the classifier predicts probabilities using ``predict_proba`` for dense/sparse input data :func:`ClassifierBaseTest.assertClassifierPredictsProbabilities`\n",
    "- whether it is clonable and works with scikit-learn's cross-validation classes :func:`ClassifierBaseTest.assertClassifierWorksWithCV`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
