Metadata-Version: 1.1
Name: pboost
Version: 0.1.2
Summary: Parallel Implementation of Boosting Algorithms with MPI.
Home-page: https://github.com/mbasbug/pboost
Author: Mehmet Basbug
Author-email: mbasbug@princeton.edu
License: UNKNOWN
Description: *******************************************************************************
        PBOOST : Parallel Implementation of Boosting Algorithms with MPI
        
        Mehmet Basbug
        PhD Candidate
        Princeton University
        
        February 2013
        *******************************************************************************
        
        A.QUICK INSTALL
        -------------------------------------------------------------------------------
        First make sure that you have OpenMPI and HDF5 installed. Then run the following command to install pboost with all the dependencies:
        
        pip install pboost
        
        OR
        
        easy_install pboost
        
        This will also create a directory named 'pboost' in your home folder.
        
        This should work fine for many users; however, you may be missing some required libraries. On Debian/Ubuntu run the following command first (add sudo if necessary)
        
        apt-get install build-essential python-dev python-setuptools python-pip python-numpy python-scipy python-matplotlib openmpi-bin openmpi-doc libopenmpi-dev libhdf5-serial-dev
        
        
        B.RUNNING DEMO
        -------------------------------------------------------------------------------
        To run the demo, execute the following command
        
        mpirun -np 2 python ~/pboost/run.py 0
         
        You should see something similar to the following output
        
        Info : Confusion matrix for combined validation error with zero threshold :
        {'FP': 11, 'TN': 89, 'FN': 10, 'TP': 90}
        Info : Confusion matrix for testing error with zero threshold :
        {'FP': 11, 'TN': 89, 'FN': 12, 'TP': 88}
        Info : Training Error of the final classifier : 0.0
        Info : Validation Error of the final classifier : 0.105
        Info : Testing Error of the final classifier : 0.115
        
        A quick note, if you installed pboost with sudo privileges, you may need to change the ownership of pboost folder and the files in it. In linux, run the following command 
        
        sudo chown -R {YOUR_USERNAME} ~/pboost
        
        
        C.CREATING YOUR CONFIGURATION FILE
        -------------------------------------------------------------------------------
        The recommended way is appending your new configurations to configurations.cfg file in pboost folder. Alternatively you can create an empty text file with the extension of cfg. A shorter explanation of options can be found in configurations.cfg file. For the detailed explanations see below:
        
        # train_file    = [HDF5 File]
        #                 Raw file of examples and their attributes for training 
        #                 (Must be specified)
        #
        # test_file     = [HDF5 File]
        #                 Raw file of examples and their attributes for testing
        #                 (Optional)
        
        Data files (train_file and test_file) should be in HDF5 format with raw data dataset must be named 'data' and label must be named 'label'. Alternatively one can use the last column of 'data' dataset as the label. 
        
        If a separate 'label' dataset is not given, the program will assume the last column as labels. At this point only binary labeling is supported; therefore, all the labels should be zero or one.
        
        If there is no separate test dataset, please leave test_data option blank.
        
        # factory_files = [.py File(s), separated by comma] 
        #                 File(s) containing the concrete implementations of 
        #                 the BaseFeatureFactory class
        #                 (Leaving blank will include default implementation)
        
        factory_files option refers to the files containing user defined feature factories. See examples of Feature Factory classes in diabetes_feature_factory.py file in demo folder.
        
        In order to use each column of your raw data as a feature, leave this option blank or use the keyword default.
        
        More than one options can be specified, for instance the following is perfectly okay.
        	factory_files = diabetes_feature_factory.py,default
        
        Each user defined Feature Factory class should inherit from BaseFeatureFactory and override blueprint and make methods. The  blueprint method refers to the functionality of features. The produce method is about giving arguments to actually create features. One should call make method for every single feature creation within that method. All Feature Factory Files should be in the working directory.
        
        # working_dir 	= [string]
        #                 Path to a shared disk location containing data files, factory files 
        #                 (Leaving blank will set it to the same directory as this file)
        
        Working directory should be on a shared disk space. Data files and feature factory files should be put in that directory.
        
        # algorithm 	= [string] 
        #                 Boosting algorithm to use. 
        #                 Current options are "conf-rated" and "adaboost"
        #                 (Leaving blank will set this field to "conf-rated")
        
        Currently only "Adaboost" and "Confidence Rated Boosting" are supported. The second one is recommended as it is likely to converge faster.
        
        # rounds 		= [integer]
        #                 Number of rounds that the boosting algorithm should iterate for 
        #                 (Leaving blank will set this number to 100)
        
        The runtime of the program is directly proportional to the number of rounds. It is recommended not give a very large number; training error of boosting usually converges with a moderate number of rounds and a large number might cause over fitting.
        
        # xval_no 		= [integer] 
        #                 Number of cross-validation folds
        #                 (Leaving blank will set it to 1, that is disabling x validation)
        
        In order to have k-fold cross validation, set this number to k. Cross validation is useful when you do not have a separate testing dataset. In this case, validation error might be a good estimate of testing error. A large number should be avoided, as it will increase the runtime. For k-fold cross validation pboost will train k+1 algorithms, meaning that it will take k times longer compared to not using cross validation.
        
        # max_memory 	= [integer/float]
        #                 Amount of available RAM to each core, in GigaBytes 
        #                 (Leaving blank will set this number to 2, that is 2GB per core)
        
        Since MPI does not treat differently to cores on the same node, this option should be specified in per core basis.
        
        # show_plots    = [y/n]
        #                 Enable to show the plots before the program terminates
        #                 (Leaving blank will set this field to y)
        
        
        All the necessary data is always stored in output folder and plots can be generated later on.
        
        D. RUNNING YOUR PROGRAM
        -------------------------------------------------------------------------------
        run.py takes two arguments cp and cn
          cn : Configuration numbers to process
          cp (optional) : Path to the configuration file
          
        A typical command to run pboost looks like 
         
        mpirun -np NUMBER_OF_PROCESSORS python run.py -cp CONF_PATH CONF_NUM_1 CONF_NUM2
        
        The default setting for cp is '~/pboost/configurations.cfg'
        
        The output of the program will be in the folder out_{CONF_NUM} in working directory. The output folder contains the following files in it:
        
        feature factory files : All Feature Factory Files necessary to make necessary features
        final_classifier.py : A script to classify any given raw data
        hypotheses.npy : Data file containing hypotheses found by pboost
        dump.npz : Data file containing predictions and other information about the job
        
        One can generate plots using plot.py and giving dump.npz as input to that script
        
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Framework :: Buildout
Classifier: Intended Audience :: Information Technology
Classifier: License :: Free for non-commercial use
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
