Metadata-Version: 2.1
Name: certstream-analytics
Version: 0.1.4
Summary: certstream + analytics
Home-page: https://github.com/huydhn/certstream-analytics
Author: Huy Do
Author-email: huydhn@gmail.com
License: MIT
Description: # Certstream + Analytics
        
        [![Build Status](https://travis-ci.org/huydhn/certstream-analytics.svg?branch=master)](https://travis-ci.org/huydhn/certstream-analytics)
        [![codecov.io](https://codecov.io/gh/huydhn/certstream-analytics/master.svg)](http://codecov.io/gh/huydhn/certstream-analytics?branch=master)
        
        
        # Installation
        
        The package can be installed from
        [PyPI](https://pypi.org/project/certstream-analytics)
        
        ```
        pip install certstream-analytics
        ```
        
        # Usage
        
        ```python
        import time
        
        from certstream_analytics.analysers import WordSegmentation
        from certstream_analytics.analysers import IDNADecoder
        from certstream_analytics.analysers import HomoglyphsDecoder
        
        from certstream_analytics.transformers import CertstreamTransformer
        from certstream_analytics.storages import ElasticsearchStorage
        from certstream_analytics.stream import CertstreamAnalytics
        
        done = False
        
        # These analysers will be run in the same order
        analyser = [
            IDNADecoder(),
            HomoglyphsDecoder(),
            WordSegmentation(),
        ]
        
        # The following fields are filtered out and indexed:
        # - String: domain
        # - List: SAN
        # - List: Trust chain
        # - Timestamp: Not before
        # - Timestamp: Not after
        # - Timestamp: Seen
        transformer = CertstreamTransformer()
        
        # Indexed the data in Elasticsearch
        storage = ElasticsearchStorage(hosts=['localhost:9200'])
        
        consumer = CertstreamAnalytics(transformer=transformer,
                                       storage=storage,
                                       analyser=analyser)
        # The consumer is run in another thread so this function is non-blocking
        consumer.start()
        
        while not done:
            time.sleep(1)
        
        consumer.stop()
        ```
        
        ## IDNA decoder
        This analyser decode IDNA domain name into Unicode for further processing
        downstream.  Normally, it will be the very first analyser to be run.  If
        the analyser encounters a malform IDNA domain string, it will keep the
        domain as it is.
        
        ```python
        from certstream_analytics.analysers import IDNADecoder
        
        decoder = IDNADecoder()
        
        # Just an example dummy record
        record = {
            'all_domains': [
                'xn--f1ahbgpekke1h.xn--p1ai',
            ]
        }
        
        # The domain name will now become 'укрэмпужск.рф'
        print(decoder.run(record))
        ```
        
        ## Homoglyphs decoder
        There are lots of phishing websites that utilize [homoglyphs](https://en.wikipedia.org/wiki/Homoglyph)
        to lure the victims.  Some common examples include 'l' and 'i' or the
        Unicode character RHO '𝞀' and 'p'.  The homoglyphs decoder uses the excellent
        [confusable_homoglyphs](https://github.com/vhf/confusable_homoglyphs) to
        generate all potential alternative domain names in ASCII.
        
        ```python
        from certstream_analytics.analysers import HomoglyphsDecoder
        
        # If the greedy flag is set, all alternative domains will be returned
        decoder = HomoglyphsDecoder(greed=False)
        
        # Just an example dummy record
        record = {
            'all_domains': [
                # MATHEMATICAL MONOSPACE SMALL P
                '*.𝗉aypal.com',
        
                # MATHEMATICAL SAN-SERIF BOLD SMALL RHO
                '*.𝗉ay𝞀al.com',
            ]
        }
        
        # The domain name will now be converted to '*.paypal.com' with the ASCII
        # character p
        print(decoder.run(record))
        ```
        
        ## Aho-Corasick
        A domain and its SAN from Certstream will be compared against a list of
        most popular [domains](https://github.com/opendns/public-domain-lists)
        (from OpenDNS) using Aho-Corasick algorithm.  This is a simple check to
        remove some of the most obvious phishing domains, for examples, *www.facebook.com.msg40.site*
        will match with *facebook* cause *facebook* is in the above list of most
        popular domains (I wonder how long it is going to last).
        
        ```python
        from certstream_analytics.analysers import AhoCorasickDomainMatching
        from certstream_analytics.reporter import FileReporter
        
        # Print the list of matching domains
        reporter = FileReporter('matching-results.txt')
        
        with open('opendns-top-domains.txt')) as fhandle:
            domains = [line.rstrip() for line in fhandle]
        
        # The list of domains to match against
        domain_matching_analyser = AhoCorasickDomainMatching(domains)
        
        consumer = CertstreamAnalytics(transformer=transformer,
                                       analyser=domain_matching_analyser,
                                       reporter=reporter)
        
        # Need to think about what to do with the matching result
        consumer.start()
        
        while not done:
            time.sleep(1)
        
        consumer.stop()
        ```
        
        ## Word segmentation
        In order to improve the accuracy of the matching algorithm, we segment
        the domains into English words using
        [wordsegment](https://github.com/grantjenks/python-wordsegment).
        
        ```python
        from certstream_analytics.analysers import WordSegmentation
        
        wordsegmentation = WordSegmentation()
        
        # Just an example dummy record
        record = {
            'all_domains': [
                'login-appleid.apple.com.managesupport.co',
            ]
        }
        
        # The returned output is as follows:
        #
        # {
        #   'analyser': 'WordSegmentation',
        #   'output': {
        #     'login-appleid.apple.com.managesuppport.co': [
        #       'login',
        #       'apple',
        #       'id',
        #       'apple',
        #       'com',
        #       'manage',
        #       'support',
        #       'co'
        #     ],
        # },
        #
        print(decoder.run(record))
        ```
        
        ## Features generator
        A list of features for each domain will also be generated so that they
        can be used for classification jobs further downstream.  The list
        includes:
        
        - The number of dot-separated fields in the domain, for example, www.google.com has 3.
        - The overall length of the domain in characters.
        - The length of the longest dot-separate field .
        - The length of the TLD, e.g. .online (6) or .download (8) is longer than .com (3).
        - The randomness level of the domain.  [Nostril](https://github.com/casics/nostril)
          package is used to check how many words as returned by the WordSegmentation
          analyser are non-sense.
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
