Metadata-Version: 2.1
Name: zipfanalysis
Version: 0.5
Summary: Tools for analysing Zipf's law from text samples
Home-page: https://github.com/chasmani/zipfanalysis
Author: Charlie Pilgrim
Author-email: pilgrimcharlie2@gmail.com
License: UNKNOWN
Description: ============
        zipfanalysis
        ============
        
        Tools in python for analysing Zipf's law from text samples. 
        
        This can be installed as a package from the python3 package library using the terminal command:
        ::
        
        	>>> pip3 install zipfanalysis
        
        -----
        Usage
        -----
        
        The package can be used from within python scripts to estimate Zipf exponents, assuming a simple power law model for 
        word frequencies and ranks. To use the pacakge import it using
        ::
        
        	import zipfanalysis
        
        -------------
        Simple Method
        -------------
        
        The easiest way to carry out an analysis on a book or text file, using different estimators, is:
        ::
        
        	alpha_clauset = zipfanalysis.clauset("path_to_book.txt")
        
        	alpha_pdf = zipfanalysis.ols_pdf("path_to_book.txt", min_frequency=3)
        
        	alpha_cdf = zipfanalysis.ols_cdf("path_to_book.txt", min_frequency=3)
        
        	alpha_abc = zipfanalysis.abc("path_to_book.txt")
        
        ---------------
        In Depth Method
        ---------------
        
        Convert a book or text file to the frequency of words, ranked from highest to lowest: 
        ::
        
        	word_counts = zipfanalysis.preprocessing.preprocessing.get_rank_frequency_from_text("path_to_book.txt")
        	
        
        Carry out different types of analysis to fit a power law to the data:
        ::
        
        	# Clauset et al estimator
        	alpha_clauset = zipfanalysis.estimators.clauset.clauset_estimator(word_counts)
        
        	# Ordinary Least Squares regression on log(rank) ~ log(frequency) 
        	# Optional low frequency cut-off
        	alpha_pdf = zipfanalysis.estimators.ols_regression_pdf.ols_regression_pdf_estimator(word_counts, min_frequency=2)
        
        	# Ordinary least squares regression on the complemantary cumulative distribution function of ranks
        	# OLS on log(P(R>rank)) ~ log(rank) 
        	# Optional low frequency cut-off 
        	alpha_cdf = zipfanalysis.estimators.ols_regression_cdf.ols_regression_cdf_estimator(word_counts)
        
        	# Approximate Bayesian computation (regression method)
        	# Assumes model of p(rank) = C prob_rank^(-alpha)
        	# prob_rank is a word's rank in an underlying probability distribution
        	alpha_abc = zipfanalysis.estimators.approximate_bayesian_computation.abc_estimator(word_counts)
        
        ------------------------
        Development - Next Steps
        ------------------------
        
        1. Speed up abc. Current bottleneck is sampling from infinite power law. Could be sped up by considering we only need the frequency vector of ranks, not the whole sample. So for example could sample from unoform distribution then drop values into interger ranked buckets based on inverse CDF.
        
        2. Build in frequency rank analysis. Convert to frequency counts representation, then carry out fit on that. 
        
        3. Add significance testing
        
        4. Add ability to calcaulte x_min and truncated power laws. 
        
        5. Speed up OLS on the cdf
        
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
