Metadata-Version: 2.1
Name: extruct
Version: 0.12.0
Summary: Extract embedded metadata from HTML markup
Home-page: https://github.com/scrapinghub/extruct
Author: Scrapinghub
Author-email: info@scrapinghub.com
Maintainer: Scrapinghub
Maintainer-email: info@scrapinghub.com
License: UNKNOWN
Description: =======
        extruct
        =======
        
        .. image:: https://github.com/scrapinghub/extruct/workflows/build/badge.svg?branch=master
            :target: https://github.com/scrapinghub/extruct/actions
            :alt: Build Status
        
        .. image:: https://img.shields.io/codecov/c/github/scrapinghub/extruct/master.svg?maxAge=2592000
            :target: https://codecov.io/gh/scrapinghub/extruct
            :alt: Coverage report
        
        .. image:: https://img.shields.io/pypi/v/extruct.svg
           :target: https://pypi.python.org/pypi/extruct
           :alt: PyPI Version
        
        
        *extruct* is a library for extracting embedded metadata from HTML markup.
        
        Currently, *extruct* supports:
        
        - `W3C's HTML Microdata`_
        - `embedded JSON-LD`_
        - `Microformat`_ via `mf2py`_
        - `Facebook's Open Graph`_
        - (experimental) `RDFa`_ via `rdflib`_
        - `Dublin Core Metadata (DC-HTML-2003)`_
        
        .. _W3C's HTML Microdata: http://www.w3.org/TR/microdata/
        .. _embedded JSON-LD: http://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents
        .. _RDFa: https://www.w3.org/TR/html-rdfa/
        .. _rdflib: https://pypi.python.org/pypi/rdflib/
        .. _Microformat: http://microformats.org/wiki/Main_Page
        .. _mf2py: https://github.com/microformats/mf2py
        .. _Facebook's Open Graph: http://ogp.me/
        .. _Dublin Core Metadata (DC-HTML-2003): https://www.dublincore.org/specifications/dublin-core/dcq-html/2003-11-30/
        
        The microdata algorithm is a revisit of `this Scrapinghub blog post`_ showing how to use EXSLT extensions.
        
        .. _this Scrapinghub blog post: http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/
        
        
        Installation
        ------------
        
        ::
        
            pip install extruct
        
        
        Usage
        -----
        
        All-in-one extraction
        +++++++++++++++++++++
        
        The simplest example how to use extruct is to call
        ``extruct.extract(htmlstring, base_url=base_url)``
        with some HTML string and an optional base URL.
        
        Let's try this on a webpage that uses all the syntaxes supported (RDFa with `ogp`_).
        
        First fetch the HTML using python-requests and then feed the response body to ``extruct``::
        
          >>> import extruct
          >>> import requests
          >>> import pprint
          >>> from w3lib.html import get_base_url
          >>>
          >>> pp = pprint.PrettyPrinter(indent=2)
          >>> r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')
          >>> base_url = get_base_url(r.text, r.url)
          >>> data = extruct.extract(r.text, base_url=base_url)
          >>>
          >>> pp.pprint(data)
          { 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',
                                                'content': 'What is Open Graph Protocol '
                                                           'and why you need it? Learn to '
                                                           'implement Open Graph Protocol '
                                                           'for Facebook on your website. '
                                                           'Open Graph Protocol Meta Tags.',
                                                'name': 'description'}],
                                'namespaces': {},
                                'terms': []}],
        
          'json-ld': [ { '@context': 'https://schema.org',
                           '@id': '#organization',
                           '@type': 'Organization',
                           'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',
                           'name': 'Optimize Smart',
                           'sameAs': [ 'https://www.facebook.com/optimizesmart/',
                                       'https://uk.linkedin.com/in/analyticsnerd',
                                       'https://www.youtube.com/user/optimizesmart',
                                       'https://twitter.com/analyticsnerd'],
                           'url': 'https://www.optimizesmart.com/'}],
            'microdata': [ { 'properties': {'headline': ''},
                             'type': 'http://schema.org/WPHeader'}],
            'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],
                                                               'name': [ 'Open Graph '
                                                                         'Protocol for '
                                                                         'Facebook '
                                                                         'explained with '
                                                                         'examples\n'
                                                                         '\n'
                                                                         'Specialized '
                                                                         'Tracking\n'
                                                                         '\n'
                                                                         '\n'
                                                                         (...)
                                                                         'Follow '
                                                                         '@analyticsnerd\n'
                                                                         '!function(d,s,id){var '
                                                                         "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                                                         "'script', "
                                                                         "'twitter-wjs');"]},
                                               'type': ['h-entry']}],
                               'properties': { 'name': [ 'Open Graph Protocol for '
                                                         'Facebook explained with '
                                                         'examples\n'
                                                         (...)
                                                         'Follow @analyticsnerd\n'
                                                         '!function(d,s,id){var '
                                                         "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                                         "'script', 'twitter-wjs');"]},
                               'type': ['h-feed']}],
            'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},
                             'properties': [ ('og:locale', 'en_US'),
                                             ('og:type', 'article'),
                                             ( 'og:title',
                                               'Open Graph Protocol for Facebook '
                                               'explained with examples'),
                                             ( 'og:description',
                                               'What is Open Graph Protocol and why you '
                                               'need it? Learn to implement Open Graph '
                                               'Protocol for Facebook on your website. '
                                               'Open Graph Protocol Meta Tags.'),
                                             ( 'og:url',
                                               'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),
                                             ('og:site_name', 'Optimize Smart'),
                                             ( 'og:updated_time',
                                               '2018-03-09T16:26:35+00:00'),
                                             ( 'og:image',
                                               'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),
                                             ( 'og:image:secure_url',
                                               'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],
            'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',
                        'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},
                      { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',
                        'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
                        'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],
                        'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}],
                        'article:section': [{'@value': 'Specialized Tracking'}],
                        'http://ogp.me/ns#description': [ { '@value': 'What is Open '
                                                                      'Graph Protocol '
                                                                      'and why you need '
                                                                      'it? Learn to '
                                                                      'implement Open '
                                                                      'Graph Protocol '
                                                                      'for Facebook on '
                                                                      'your website. '
                                                                      'Open Graph '
                                                                      'Protocol Meta '
                                                                      'Tags.'}],
                        'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
                        'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
                        'http://ogp.me/ns#locale': [{'@value': 'en_US'}],
                        'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],
                        'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '
                                                                'Facebook explained with '
                                                                'examples'}],
                        'http://ogp.me/ns#type': [{'@value': 'article'}],
                        'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
                        'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],
                        'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}
        
        Select syntaxes
        +++++++++++++++
        It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned::
        
          >>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
          >>> base_url = get_base_url(r.text, r.url)
          >>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
          >>>
          >>> pp.pprint(data)
          { 'microdata': [],
            'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                                            'fb': 'http://www.facebook.com/2008/fbml',
                                            'og': 'http://ogp.me/ns#'},
                             'properties': [ ('fb:app_id', '308540029359'),
                                             ('og:site_name', 'Songkick'),
                                             ('og:type', 'songkick-concerts:artist'),
                                             ('og:title', 'Elysian Fields'),
                                             ( 'og:description',
                                               'Find out when Elysian Fields is next '
                                               'playing live near you. List of all '
                                               'Elysian Fields tour dates and concerts.'),
                                             ( 'og:url',
                                               'https://www.songkick.com/artists/236156-elysian-fields'),
                                             ( 'og:image',
                                               'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],
            'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
                        'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
                        'al:ios:app_store_id': [{'@value': '438690886'}],
                        'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
                        'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                                      'Elysian Fields is '
                                                                      'next playing live '
                                                                      'near you. List of '
                                                                      'all Elysian '
                                                                      'Fields tour dates '
                                                                      'and concerts.'}],
                        'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
                        'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
                        'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
                        'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
                        'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
                        'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}
        
        
        Uniform
        +++++++
        Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure: ::
        
            {'@context': 'http://example.com',
                         '@type': 'example_type',
                         /* All other the properties in keys here */
                         }
        
        To do so set ``uniform=True`` when calling ``extract``, it's false by default for backward compatibility. Here the same example as before but with uniform set to True: ::
        
          >>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
          >>> base_url = get_base_url(r.text, r.url)
          >>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)
          >>>
          >>> pp.pprint(data)
          { 'microdata': [],
            'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                                         'fb': 'http://www.facebook.com/2008/fbml',
                                         'og': 'http://ogp.me/ns#'},
                           '@type': 'songkick-concerts:artist',
                           'fb:app_id': '308540029359',
                           'og:description': 'Find out when Elysian Fields is next '
                                             'playing live near you. List of all '
                                             'Elysian Fields tour dates and concerts.',
                           'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',
                           'og:site_name': 'Songkick',
                           'og:title': 'Elysian Fields',
                           'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],
            'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
                        'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
                        'al:ios:app_store_id': [{'@value': '438690886'}],
                        'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
                        'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                                      'Elysian Fields is '
                                                                      'next playing live '
                                                                      'near you. List of '
                                                                      'all Elysian '
                                                                      'Fields tour dates '
                                                                      'and concerts.'}],
                        'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
                        'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
                        'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
                        'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
                        'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
                        'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}
        
        NB rdfa structure is not uniformed yet
        
        Returning HTML node
        +++++++++++++++++++
        
        It is also possible to get references to HTML node for every extracted metadata item.
        The feature is supported only by microdata syntax.
        
        To use that, just set the ``return_html_node`` option of ``extract`` method to ``True``.
        As the result, an additional key "nodeHtml" will be included in the result for every
        item. Each node is of ``lxml.etree.Element`` type: ::
        
          >>> r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')
          >>> base_url = get_base_url(r.text, r.url)
          >>> data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)
          >>>
          >>> pp.pprint(data)
          { 'microdata': [ { 'htmlNode': <Element div at 0x7f10f8e6d3b8>,
                             'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n'
                                                            'Not your thin sticky pad, '
                                                            'No-Muv is truly the best!',
                                             'image': ['', ''],
                                             'name': ['No-Muv', 'No-Muv'],
                                             'offers': [ { 'htmlNode': <Element div at 0x7f10f8e6d138>,
                                                           'properties': { 'availability': 'http://schema.org/InStock',
                                                                           'price': 'Price:  '
                                                                                    '$45'},
                                                           'type': 'http://schema.org/Offer'},
                                                         { 'htmlNode': <Element div at 0x7f10f8e60f48>,
                                                           'properties': { 'availability': 'http://schema.org/InStock',
                                                                           'price': '(Select '
                                                                                    'Size/Shape '
                                                                                    'for '
                                                                                    'Pricing)'},
                                                           'type': 'http://schema.org/Offer'}],
                                             'ratingValue': ['5.00', '5.00']},
                             'type': 'http://schema.org/Product'}]}
        
        Single extractors
        -----------------
        
        You can also use each extractor individually. See below.
        
        Microdata extraction
        ++++++++++++++++++++
        ::
        
          >>> import pprint
          >>> pp = pprint.PrettyPrinter(indent=2)
          >>>
          >>> from extruct.w3cmicrodata import MicrodataExtractor
          >>>
          >>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
          >>> html = """<!DOCTYPE HTML>
          ... <html>
          ...  <head>
          ...   <title>Photo gallery</title>
          ...  </head>
          ...  <body>
          ...   <h1>My photos</h1>
          ...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
          ...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
          ...    <figcaption itemprop="title">The house I found.</figcaption>
          ...   </figure>
          ...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
          ...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
          ...    <figcaption itemprop="title">The mailbox.</figcaption>
          ...   </figure>
          ...   <footer>
          ...    <p id="licenses">All images licensed under the <a itemprop="license"
          ...    href="http://www.opensource.org/licenses/mit-license.php">MIT
          ...    license</a>.</p>
          ...   </footer>
          ...  </body>
          ... </html>"""
          >>>
          >>> mde = MicrodataExtractor()
          >>> data = mde.extract(html)
          >>> pp.pprint(data)
          [{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                           'title': 'The house I found.',
                           'work': 'http://www.example.com/images/house.jpeg'},
            'type': 'http://n.whatwg.org/work'},
           {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                           'title': 'The mailbox.',
                           'work': 'http://www.example.com/images/mailbox.jpeg'},
            'type': 'http://n.whatwg.org/work'}]
        
        JSON-LD extraction
        ++++++++++++++++++
        ::
        
          >>> import pprint
          >>> pp = pprint.PrettyPrinter(indent=2)
          >>>
          >>> from extruct.jsonld import JsonLdExtractor
          >>>
          >>> html = """<!DOCTYPE HTML>
          ... <html>
          ...  <head>
          ...   <title>Some Person Page</title>
          ...  </head>
          ...  <body>
          ...   <h1>This guys</h1>
          ...     <script type="application/ld+json">
          ...     {
          ...       "@context": "http://schema.org",
          ...       "@type": "Person",
          ...       "name": "John Doe",
          ...       "jobTitle": "Graduate research assistant",
          ...       "affiliation": "University of Dreams",
          ...       "additionalName": "Johnny",
          ...       "url": "http://www.example.com",
          ...       "address": {
          ...         "@type": "PostalAddress",
          ...         "streetAddress": "1234 Peach Drive",
          ...         "addressLocality": "Wonderland",
          ...         "addressRegion": "Georgia"
          ...       }
          ...     }
          ...     </script>
          ...  </body>
          ... </html>"""
          >>>
          >>> jslde = JsonLdExtractor()
          >>>
          >>> data = jslde.extract(html)
          >>> pp.pprint(data)
          [{'@context': 'http://schema.org',
            '@type': 'Person',
            'additionalName': 'Johnny',
            'address': {'@type': 'PostalAddress',
                        'addressLocality': 'Wonderland',
                        'addressRegion': 'Georgia',
                        'streetAddress': '1234 Peach Drive'},
            'affiliation': 'University of Dreams',
            'jobTitle': 'Graduate research assistant',
            'name': 'John Doe',
            'url': 'http://www.example.com'}]
        
        
        RDFa extraction (experimental)
        ++++++++++++++++++++++++++++++
        
        ::
        
          >>> import pprint
          >>> pp = pprint.PrettyPrinter(indent=2)
          >>> from extruct.rdfa import RDFaExtractor  # you can ignore the warning about html5lib not being available
          INFO:rdflib:RDFLib Version: 4.2.1
          /home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
            'parsers will not be available.')
          >>>
          >>> html = """<html>
          ...  <head>
          ...    ...
          ...  </head>
          ...  <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/">
          ...    <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting">
          ...       <h2 property="dc:title">The trouble with Bob</h2>
          ...       ...
          ...       <h3 property="dc:creator schema:creator" resource="#me">Alice</h3>
          ...       <div property="schema:articleBody">
          ...         <p>The trouble with Bob is that he takes much better photos than I do:</p>
          ...       </div>
          ...      ...
          ...    </div>
          ...  </body>
          ... </html>
          ... """
          >>>
          >>> rdfae = RDFaExtractor()
          >>> pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))
          [{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
            '@type': ['http://schema.org/BlogPosting'],
            'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
            'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
            'http://schema.org/articleBody': [{'@value': '\n'
                                                         '        The trouble with Bob '
                                                         'is that he takes much better '
                                                         'photos than I do:\n'
                                                         '      '}],
            'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]
        
        You'll get a list of expanded JSON-LD nodes.
        
        
        Open Graph extraction
        ++++++++++++++++++++++++++++++
        
        ::
        
          >>> import pprint
          >>> pp = pprint.PrettyPrinter(indent=2)
          >>>
          >>> from extruct.opengraph import OpenGraphExtractor
          >>>
          >>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
          ... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
          ...  <head>
          ...   <title>Himanshu's Open Graph Protocol</title>
          ...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
          ...   <meta http-equiv="Content-Language" content="en-us" />
          ...   <link rel="stylesheet" type="text/css" href="event-education.css" />
          ...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
          ...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
          ...   <meta property="og:type" content="article"/>
          ...   <meta property="og:url" content="https://www.eventeducation.com/test.php"/>
          ...   <meta property="og:image" content="https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"/>
          ...   <meta property="fb:admins" content="himanshu160"/>
          ...   <meta property="og:site_name" content="Event Education"/>
          ...   <meta property="og:description" content="Event Education provides free courses on event planning and management to event professionals worldwide."/>
          ...  </head>
          ...  <body>
          ...   <div id="fb-root"></div>
          ...   <script>(function(d, s, id) {
          ...               var js, fjs = d.getElementsByTagName(s)[0];
          ...               if (d.getElementById(id)) return;
          ...                  js = d.createElement(s); js.id = id;
          ...                  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=501839739845103";
          ...                  fjs.parentNode.insertBefore(js, fjs);
          ...                  }(document, 'script', 'facebook-jssdk'));</script>
          ...  </body>
          ... </html>"""
          >>>
          >>> opengraphe = OpenGraphExtractor()
          >>> pp.pprint(opengraphe.extract(html))
          [{"namespace": {
                "og": "http://ogp.me/ns#"
            },
            "properties": [
                [
                    "og:title",
                    "Himanshu's Open Graph Protocol"
                ],
                [
                    "og:type",
                    "article"
                ],
                [
                    "og:url",
                    "https://www.eventeducation.com/test.php"
                ],
                [
                    "og:image",
                    "https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"
                ],
                [
                    "og:site_name",
                    "Event Education"
                ],
                [
                    "og:description",
                    "Event Education provides free courses on event planning and management to event professionals worldwide."
                ]
              ]
           }]
        
        
        Microformat extraction
        ++++++++++++++++++++++++++++++
        
        ::
        
          >>> import pprint
          >>> pp = pprint.PrettyPrinter(indent=2)
          >>>
          >>> from extruct.microformat import MicroformatExtractor
          >>>
          >>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
          ... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
          ...  <head>
          ...   <title>Himanshu's Open Graph Protocol</title>
          ...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
          ...   <meta http-equiv="Content-Language" content="en-us" />
          ...   <link rel="stylesheet" type="text/css" href="event-education.css" />
          ...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
          ...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
          ...   <article class="h-entry">
          ...    <h1 class="p-name">Microformats are amazing</h1>
          ...    <p>Published by <a class="p-author h-card" href="http://example.com">W. Developer</a>
          ...       on <time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time></p>
          ...    <p class="p-summary">In which I extoll the virtues of using microformats.</p>
          ...    <div class="e-content">
          ...     <p>Blah blah blah</p>
          ...    </div>
          ...   </article>
          ...  </head>
          ...  <body></body>
          ... </html>"""
          >>>
          >>> microformate = MicroformatExtractor()
          >>> data = microformate.extract(html)
          >>> pp.pprint(data)
          [{"type": [
                "h-entry"
            ],
            "properties": {
                "name": [
                    "Microformats are amazing"
                ],
                "author": [
                    {
                        "type": [
                            "h-card"
                        ],
                        "properties": {
                            "name": [
                                "W. Developer"
                            ],
                            "url": [
                                "http://example.com"
                            ]
                        },
                        "value": "W. Developer"
                    }
                ],
                "published": [
                    "2013-06-13 12:00:00"
                ],
                "summary": [
                    "In which I extoll the virtues of using microformats."
                ],
                "content": [
                    {
                        "html": "\n<p>Blah blah blah</p>\n",
                        "value": "\nBlah blah blah\n"
                    }
                ]
              }
           }]
        
        DublinCore extraction
        ++++++++++++++++++++++++++++++
        ::
        
            >>> import pprint
            >>> pp = pprint.PrettyPrinter(indent=2)
            >>> from extruct.dublincore import DublinCoreExtractor
            >>> html = '''<head profile="http://dublincore.org/documents/dcq-html/">
            ... <title>Expressing Dublin Core in HTML/XHTML meta and link elements</title>
            ... <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
            ... <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
            ...
            ...
            ... <meta name="DC.title" lang="en" content="Expressing Dublin Core
            ... in HTML/XHTML meta and link elements" />
            ... <meta name="DC.creator" content="Andy Powell, UKOLN, University of Bath" />
            ... <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF" content="2003-11-01" />
            ... <meta name="DC.identifier" scheme="DCTERMS.URI"
            ... content="http://dublincore.org/documents/dcq-html/" />
            ... <link rel="DCTERMS.replaces" hreflang="en"
            ... href="http://dublincore.org/documents/2000/08/15/dcq-html/" />
            ... <meta name="DCTERMS.abstract" content="This document describes how
            ... qualified Dublin Core metadata can be encoded
            ... in HTML/XHTML &lt;meta&gt; elements" />
            ... <meta name="DC.format" scheme="DCTERMS.IMT" content="text/html" />
            ... <meta name="DC.type" scheme="DCTERMS.DCMIType" content="Text" />
            ... <meta name="DC.Date.modified" content="2001-07-18" />
            ... <meta name="DCTERMS.modified" content="2001-07-18" />'''
            >>> dublinlde = DublinCoreExtractor()
            >>> data = dublinlde.extract(html)
            >>> pp.pprint(data)
            [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',
                                'content': 'Expressing Dublin Core\n'
                                           'in HTML/XHTML meta and link elements',
                                'lang': 'en',
                                'name': 'DC.title'},
                              { 'URI': 'http://purl.org/dc/elements/1.1/creator',
                                'content': 'Andy Powell, UKOLN, University of Bath',
                                'name': 'DC.creator'},
                              { 'URI': 'http://purl.org/dc/elements/1.1/identifier',
                                'content': 'http://dublincore.org/documents/dcq-html/',
                                'name': 'DC.identifier',
                                'scheme': 'DCTERMS.URI'},
                              { 'URI': 'http://purl.org/dc/elements/1.1/format',
                                'content': 'text/html',
                                'name': 'DC.format',
                                'scheme': 'DCTERMS.IMT'},
                              { 'URI': 'http://purl.org/dc/elements/1.1/type',
                                'content': 'Text',
                                'name': 'DC.type',
                                'scheme': 'DCTERMS.DCMIType'}],
                'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',
                                'DCTERMS': 'http://purl.org/dc/terms/'},
                'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',
                             'content': '2003-11-01',
                             'name': 'DCTERMS.issued',
                             'scheme': 'DCTERMS.W3CDTF'},
                           { 'URI': 'http://purl.org/dc/terms/abstract',
                             'content': 'This document describes how\n'
                                        'qualified Dublin Core metadata can be encoded\n'
                                        'in HTML/XHTML <meta> elements',
                             'name': 'DCTERMS.abstract'},
                           { 'URI': 'http://purl.org/dc/terms/modified',
                             'content': '2001-07-18',
                             'name': 'DC.Date.modified'},
                           { 'URI': 'http://purl.org/dc/terms/modified',
                             'content': '2001-07-18',
                             'name': 'DCTERMS.modified'},
                           { 'URI': 'http://purl.org/dc/terms/replaces',
                             'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',
                             'hreflang': 'en',
                             'rel': 'DCTERMS.replaces'}]}]
        
        
        
        Command Line Tool
        -----------------
        
        *extruct* provides a command line tool that allows you to fetch a page and
        extract the metadata from it directly from the command line.
        
        Dependencies
        ++++++++++++
        
        The command line tool depends on ``requests``, which is not installed by default
        when you install **extruct**. In order to use the command line tool, you can
        install **extruct** with the `cli` extra requirements::
        
            pip install extruct[cli]
        
        
        Usage
        +++++
        
        ::
        
            extruct "http://example.com"
        
        Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph
        and Microformat metadata to `stdout`.
        
        Supported Parameters
        ++++++++++++++++++++
        
        By default, the command line tool will try to extract all the supported
        metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph
        and Microformat). If you want to restrict the output to just one or a subset of
        those, you can pass their individual names collected in a list through 'syntaxes' argument.
        
        For example, this command extracts only Microdata and JSON-LD metadata from
        "http://example.com"::
        
            extruct "http://example.com" --syntaxes microdata json-ld
        
        NB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat
        
        Development version
        -------------------
        
        ::
        
            mkvirtualenv extruct
            pip install -r requirements-dev.txt
        
        
        Tests
        -----
        
        Run tests in current environment::
        
            py.test tests
        
        
        Use tox_ to run tests with different Python versions::
        
            tox
        
        
        .. _tox: https://testrun.org/tox/latest/
        .. _ogp: https://ogp.me/
Keywords: extruct
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Description-Content-Type: text/x-rst
Provides-Extra: cli
