Metadata-Version: 2.1
Name: contentai-metadata-flatten
Version: 1.4.1
Summary: ContentAI Metadata Flattening Service
Home-page: https://gitlab.research.att.com/turnercode/metadata-flatten-extractor
Author: Eric Zavesky
License: Apache
Platform: UNKNOWN
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
Requires-Dist: pandas
Requires-Dist: numexpr
Requires-Dist: pytimeparse
Requires-Dist: contentaiextractor (>=1.0.4)

metadata-flatten-extractor
==========================

A method to flatten generated JSON data into timed CSV events in support
of analytic workflows within the `ContentAI
Platform <https://www.contentai.io>`__, published as the extractor
``dsai_metadata_flatten``.   There is also a 
`pypi package <https://pypi.org/project/contentai-metadata-flatten/>`__ 
of this package published for easy incorporation in other projects.

1. `Getting Started <#getting-started>`__
2. `Execution <#execution-and-deployment>`__
3. `Testing <#testing>`__
4. `Future Development <#future-development>`__
5. `Changes <#changes>`__

Getting Started
===============

| This library is used as a `single-run executable <#contentai-standalone>`__.
| Runtime parameters can be passed for processing that configure the
  returned results and can be examined in more detail in the
  `main <main.py>`__ script.

**NOTE: Not all flattening functions will respect/obey properties
defined here.**

-  ``force_overwrite`` - *(bool)* - force existing files to be
   overwritten (*default=False*)
-  ``compressed`` - *(bool)* - compress output CSVs instead of raw write
   (*default=True*, e.g. append ‘.gz’)
-  ``all_frames`` - *(bool)* - for video-based events, log all instances
   in box or just the center (*default=False*)
- ``time_offset`` - *(int)* - when merging events for an asset split into 
   multiple parts, time in seconds (*default=0*); negative numbers will 
   cause a truncation (skip) of events happening before the zero time 
   mark *(added v0.7.1)*
- ``time_offset_source`` - *(str)* - check for this one-line file path with 
   number of seconds offset according to `time_offset` rules; *(added v1.4.0)*
-  ``verbose`` - *(bool)* - verbose input/output configuration printing
   (*default=False*)
-  ``extractor`` - *(string)* - specify one extractor to flatten,
   skipping nested module import (*default=all*, e.g. ``dsai_metadata``)
-  ``generator`` - *(string)* - cify one generator for output,
   skipping nested module import (``*``=all, empty=none), e.g. ``flattened_csv``)


Generators
==========

CSV Schema (CSV)
----------------

One output of this flattening will be a set of CSV files if the ``flattened_csv``
is enabled as a generator.  One file is created for each discovered/input parser/extractor. 
The standard schema for these CSV files has the following fields.

-  ``time_begin`` = time in seconds of event start
-  ``time_end`` = time in seconds of end (may be equal to time_start if
   instantaneous)
-  ``time_event`` = exact time in seconds (may be equal to time_start if
   instantaneous)
-  ``source_event`` = source media for event to add granularity for
   event inpact (e.g. face, video, audio, speech, image, ocr, script)
-  ``tag`` = simple text word or phrase
-  ``tag_type`` = descriptor for type of tag; e.g. tag=concept/label/emotion, keyword=special word,
   shot=segment, transcript=text, moderation=moderation, word=text/speech word, person=face or skeleton,
   phrase=long utterance, face=face emotion/properties, identity=face or speaker recognition, 
   scene=semantic scenes/commercials/commercial_lead, brand=product or logo mention, emotion=visual or audio sentiment/emotion
-  ``score`` = confidence/probability
-  ``details`` = possible bounding box or other long-form (JSON-encoded)
   details
-  ``extractor`` = name of extractor for insight


Example Programmatic Use
------------------------

While this library is primarily used as an extractor in ContentAI, it can 
be programmatically called within another extractor to simplify incoming 
data into a simple list for analysis.  Several of these examples are available
as code examples in the testing scripts.

1. This dictionary-based call example will parse output of the `azure_videoindexer` 
   and return it as a dictionary only (do not generate CSV or JSON output).

.. code:: python

   from contentai_metadata_flatten.main import flatten

   dict_result = flatten({"path_content": "content/jobs", "extractor": "azure_videoindexer",
                          "generator": "", "verbose": True, "path_result": ".", args=[])


2. This argument call example will parse all extractor outputs and generate a CSV.

.. code:: python

   from contentai_metadata_flatten.main import flatten

   dict_result = flatten(args=["--path_content", "content/jobs/example.mp4", 
                               "--generator", "flattened_csv", "--path_result": "content/flattened")

3. This low-level access to a parser allows more control over which file or directory
   is parsed by the library and no generators are called.  This call example is the same as
   the first example except that it returns a `DataFrame` instead of a dictionary and may 
   be slightly faster.

.. code:: python

   from contentai_metadata_flatten import parsers
   import logging

   logger = logging.getLogger()
   logger.setLevel(logging.INFO)

   list_parser = parsers.get_by_name("azure_videoindexer")
   parser_instance = list_parser[0]['obj']("content/jobs", logger=logger)
   config_default = parser_instance.default_config()
   result_df = parser_instance.parse(config_default)

4. Another low-level access to parsers for only certain tag types.  This call example allows
   the parsing of only certain tag types (below only those of type `identity` and `face`).

.. code:: python

   from contentai_metadata_flatten import parsers
   import logging

   logger = logging.getLogger()
   logger.setLevel(logging.INFO)

   list_parser = parsers.get_by_type(["face", "identity"])
   for parser_obj in list_parser:
      parser_instance = parser_obj['obj']("content/jobs", logger=logger)
      config_default = parser_instance.default_config()
      result_df = parser_instance.parse(config_default)


Return Value
------------

The main function `main.py::flatten` now returns a richer dictionary (*v1.3.0*).
For programatic callers of the function the dictionary object contains a 
`data` property (all of the flattened data as a list) and a `generated` property 
which contains a list of nested dictionaries indicating generated outptu (if enabled).
An example output below demonstrates the flattened results as well as two enabled generators.

.. code:: shell

   {'data': [
      {'tag': 'Clock', 'time_begin': 0, 'time_end': 1, 'time_event': 0, 'score': 0.08157, 'details': '{"model": "/m/01x3z"}', 'source_event': 'audio', 'tag_type': 'tag', 'extractor': 'example_extractor'}, 
      {'tag': 'Sine wave', 'time_begin': 0, 'time_end': 1, 'time_event': 0, 'score': 0.07586, 'details': '{"model": "/m/01v_m0"}', 'source_event': 'audio', 'tag_type': 'tag', 'extractor': 'example_extractor'}, 
      {'tag': 'Tick-tock', 'time_begin': 0, 'time_end': 1, 'time_event': 0, 'score': 0.07297, 'details': '{"model": "/m/07qjznl"}', 'source_event': 'audio', 'tag_type': 'tag', 'extractor': 'example_extractor'}, 
      ... ]
   'generated': [
      {'generator': 'flattened_csv', 'path': 'testme/example_extractor.csv.gz'}, 
      {'generator': 'wbTimeTaggedMetadata', 'path': 'testme/wbTimeTaggedMetadata.json.gz'}] 
   }

Execution and Deployment
========================

This package is meant to be run as a one-off processing tool that
aggregates the insights of other extractors.

command-line standalone
-----------------------

Run the code as if it is an extractor. In this mode, configure a few
environment variables to let the code know where to look for content.

One can also run the command-line with a single argument as input and
optionally ad runtime configuration (see `runtime
variables <#getting-started>`__) as part of the ``EXTRACTOR_METADATA``
variable as JSON.

.. code:: shell

   EXTRACTOR_METADATA='{"compressed":true}'

Locally Run on Results
~~~~~~~~~~~~~~~~~~~~~~

For utility, the above line has been wrapped in the bash script
``run_local.sh``.

.. code:: shell

   EXTRACTOR_METADATA='$3' EXTRACTOR_NAME=metadata-flatten EXTRACTOR_JOB_ID=1 \
       EXTRACTOR_CONTENT_PATH=$1 EXTRACTOR_CONTENT_URL=file://$1 EXTRACTOR_RESULT_PATH=`pwd`/results/$2 \
       python -u main.py

This allows a simplified command-line specification of a run
configuration, which also allows the passage of metadata into a
configuration.

*Normal result generation into compressed CSVs (with overwrite).*

.. code:: shell

   ./run_local.sh data/wHaT3ver1t1s results/

*Result generation with environment variables and integration of results
from a file that was split at an offset of three hours.*

.. code:: shell

   ./run_local.sh results/1XMDAz9w8T1JFEKHRuNunQhRWL1/ results/ '{"force_overwrite":false,"time_offset":10800}'

*Result generation from a single extractor, with its nested directory
explicitly specified. (added v0.6.1)*

.. code:: shell

   ./run_local.sh results/dsai_metadata results/ '{"extractor":"dsai_metadata"}'

Local Runs with Timing Offsets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The script ``run_local.sh`` also searches for a text file called
``timing.txt`` in each source directory. If found, it will offset all
results by the specified number of seconds before saving them to disk.
Also, negative numbers will cause a truncation (skip) of events
happening before the zero time mark. *(added v0.7.1)*

This capability may be useful if you have to manually split a file into
multiple smaller files at a pre-determined time offset (e.g. three hours
-> 10800 in ``timing.txt``). *(added v0.5.2)*

.. code:: shell

   echo "10800" > 1XMDAz9w8T1JFEKHRuNunQhRWL1/timing.txt
   ./run_local.sh results/1XMDAz9w8T1JFEKHRuNunQhRWL1/ results/

Afterwards, new results can be added arbitrarily and the script can be
rerun in the same directory to accomodate different timing offsets.

*Example demonstrating integration of multiple output directories.*

.. code:: shell

   find results -type d  -d 1 | xargs -I {} ./run_local.sh {} results/

ContentAI
---------

Deployment
~~~~~~~~~~

Deployment is easy and follows standard ContentAI steps.

.. code:: shell

   contentai deploy --cpu 256 --memory 512 metadata-flatten
   Deploying...
   writing workflow.dot
   done

Alternatively, you can pass an image name to reduce rebuilding a docker
instance.

.. code:: shell

   docker build -t metadata-deploy
   contentai deploy metadata-flatten --cpu 256 --memory 512 -i metadata-deploy

Locally Downloading Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can locally download data from a specific job for this extractor to
directly analyze.

.. code:: shell

   contentai data wHaT3ver1t1s --dir data

Run as an Extractor
~~~~~~~~~~~~~~~~~~~

.. code:: shell

   contentai run https://bucket/video.mp4  -w 'digraph { aws_rekognition_video_celebs -> metadata_flatten}'

   JOB ID:     1Tfb1vPPqTQ0lVD1JDPUilB8QNr
   CONTENT:    s3://bucket/video.mp4
   STATE:      complete
   START:      Fri Feb 15 04:38:05 PM (6 minutes ago)
   UPDATED:    1 minute ago
   END:        Fri Feb 15 04:43:04 PM (1 minute ago)
   DURATION:   4 minutes 

   EXTRACTORS

   my_extractor

   TASK      STATE      START           DURATION
   724a493   complete   5 minutes ago   1 minute 

Or run it via the docker image…

::

   docker run --rm  -v `pwd`/:/x -e EXTRACTOR_CONTENT_PATH=/x/file.mp3 -e EXTRACTOR_RESULT_PATH=/x/result2 <docker_image>

View Extractor Logs (stdout)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: shell

   contentai logs -f <my_extractor>
   my_extractor Fri Nov 15 04:39:22 PM writing some data
   Job complete in 4m58.265737799s

Testing
=======

Testing is included via tox.  To launch testing for the entire package, just run `tox` at the command line. 
Testing can also be run for a specific file within the package by setting the evironment variable `TOX_ARGS`.

.. code:: shell

   TOX_ARG=test_basic.py tox 



Future Development
==================

-  the remaining known extractors...  ``openpose``, ``dsai_tmstext_classifier_extractor``, 
    ``dsai_vinyl_sound_ai``, ``dsai_name_entity_extractor``, 
    ``aws_rekognition_video_segments``
-  integration of viewership insights
-  creation of sentiment and mood-based insights (which tags most
   co-occur here?)

Changes
=======

A method to flatten generated JSON data into timed CSV events in support
of analytic workflows within the `ContentAI Platform <https://www.contentai.io>`__.

1.4
---

1.4.0
~~~~~
- fix for timing offsets; don't overwrite any output if timing offset indicator

1.4.1
~~~~~
- add new `dsai_ads_detector` parser for predictive ad locations


1.3
---

1.3.3
~~~~~
- minor fix for `azure_videoindexer` parsing, now first video shot can *not* contain a keyframe ? 


1.3.2
~~~~~
- minor fix for `gcp_videointelligence_text_detection` parsing

1.3.1
~~~~~
- fix for no-output generators
- fix complete output for returned dictionary of data
- add richer documentation for library/api usage

1.3.0
~~~~~
- update output of main parse function to return a dict instead of file listing
- modify generator specification to allow ALL (``*`` **default**) or NONE for outputs


1.2
---

1.2.2
~~~~~
- add parsers for `gcp_videointelligence_text_detection`, `comskip_json`, `ibm_max_audio_classifier`, 
   `gcp_videointelligence_object_tracking`, `gcp_videointelligence_people_detection`
- improve testing to iterate over known set of data in testing dir
- fix generator/parser retrieve for whole name matches, not partials
- add documentation for new types, explicitly call out `person` tag_type
- update the `dsai_activity_emotions` parser to return tag type `emotion` (matching that of other AWS, Azure parsers)

1.2.1
~~~~~
- update `azure_videoindexer` for `tag_type` in detected brands (was speech, now video)

1.2.0
~~~~~
- add unit-testing to package build
- add command-line / parser input as complement to contentai-driven ENV variables
- fix bugs around specification of result path or specific generator

1.1
---

1.1.8
~~~~~
- fix issue about constant reference
- fix `run_local.sh` script for extra run param config
- fix querying for local files in non-contentai environments (regression since 1.1.0)

1.1.7
~~~~~
- inclusion of other constants for compatibility with other packages
- refactor/rename of parser classes to mandate a filename output prefix (e.g. ``flatten_``)
- add ``dsai_activity_emotions`` parser (a clone of ``dsai_activity_classifier``)

1.1.6
~~~~~
- remove applications, fork to new `metatata-database` source, to be posted
  at a `pypi database package <https://pypi.org/project/contentai-metadata-database>`__

1.1.4
~~~~~
- name update for ``dsai_moderation_image`` extractor

1.1.3
~~~~~
- hotfix for build distribution
- fix for content creation in streamlit/browsing app

1.1.2
~~~~~
- deployed extractor (docker fix) for updated namespace


1.1.1
~~~~~
- docs update, testing fixes, version bump for publication

1.1.0
~~~~~
- rename to ``contentai-metadata-flatten`` and publish to pypi as a package!


1.0
---

1.0.2
~~~~~
- update documentation for `Metadata Browser <app_browser>`__ and `Inventory Discovery <app_inventory>`__ app

1.0.1
~~~~~
- add ability to parse input CSVs but not segment into shot
- move to a single NLP library (spacy) for applications, using large model (with vectors)

1.0.0
~~~~~
- add new `dash/plotly <https://dash.plotly.com/>`__ driven quality check application

0.9
---

0.9.9
~~~~~
- update to optimize the pull of asset keys

0.9.7
~~~~~

- upgrade to use new `contentai extractor package <https://pypi.org/project/contentaiextractor/>`__
- update parser logic for safer key and data retrieval


0.9.6
~~~~~

- upgrade to use new `contentai extractor package <https://pypi.org/project/contentaiextractor/>`__
- update parser logic for safer key and data retrieval


0.9.6
~~~~~
- small tweaks/normalization of rounding factor for extractors
- correct emotion souce type for azure
- refactor app location for primary streamlit browser
  - fix mode discovery for modules with specific UX interface
- update file listing to show data bundle files as well
- refactor utilities script for reuse in other apps


0.9.5
~~~~~

- update to parse new version of `dsai_places`
- add new parser for `detectron2` extractor

0.9.4
~~~~~

- add static file serving to streamlit app, inspired by this `streamlit issue discussion <https://github.com/streamlit/streamlit/issues/400>`_
- modify some pages to point to downloadable tables (with button click)
- create new download page/mode that lists the generated and source files
- minor refactor of app's docker image for better caching in local creation and testing


0.9.3
~~~~~

- add ``dsai_moderation_text`` parser, update ``dsai_moderation`` parser for version robustness
  - add min threshold (*0.05*) to both moderation detectors


0.9.2
~~~~~

- add recursion to file-based discovery method for processed assets
  - unify read of JSON and text files with internalaized function call in extractor base class
- fix some extractors to use single name reference ``self.EXTRACTOR``

0.9.1
~~~~~

- fix transcript parsing in ``azure_videoindexer`` component
- add speaker differentiation as an identity block in ``azure_videoindexer`` (similar to ``aws_transcribe``)


0.9.0
~~~~~

- add timeline viewing to the ``event_table`` mode of streamlit app



0.8
---

0.8.9
~~~~~

- fixes to main streamlit app for partial extractors (e.g. missing identity, sparse brand)

0.8.8
~~~~~

- add parser for ``dsai_moderation``


0.8.7
~~~~~

- add parser for ``dsai_activity_classifier``
- fix bug for faulty rejection of ``flatten_aws_transcribe`` results

0.8.6
~~~~~

- add parsers for ``pyscenedetect``, ``dsai_sceneboundary``, ``aws_transcribe``, ``yolo3``, ``aws_rekognition_video_text_detect``
- add speaker identity (from speech) to ``gcp_videointelligence_speech_transcription``
- add ``type`` field (maps to ``tag_type``) to output generated by ``wbTimeTaggedTmetadata`` generator
  - add hashing against data (e.g. ``box``) within JSON metadata generator


0.8.5
~~~~~

- add parsers for ``dsai_yt8m`` (youtube8M or mediapipe)


0.8.4
~~~~~

- add parsers for ``dsai_activity_slowfast`` (activity) and ``dsai_places`` (scene/settings)
- add *source_type* sub-field to ``event_table`` browsing mode


0.8.3
~~~~~

- add ``manifest`` option to application for multiple assets
- fix app docker file for placement/generation of code with a specific user ID
- fix CI/CD integration for auto launch
- fix app explorer bugs (derive 'words' from transcript/keywords if none)


0.8.2
~~~~~

- hotfix for missing data in ``dsai_metadata`` parser


0.8.2
~~~~~

- slight refactor of how parsers are discovered, to allow search by name or type (for use as package)
- fix package import for contentai local file
- switch *tag_type* of ``ocr`` to ``transcript`` and ``ocr`` for *source_type* (``azure_videoindexer``)


0.8.1
~~~~~

- adding music parser ``dsai_musicnn`` for different audio regions


0.8.0
~~~~~

- convert to package for other modules to install
- switch document to RST from MD
- add primitive testing capabilities (to be filled)


0.7
---

0.7.1
~~~~~

-  added truncation/trim of events before zero mark if time offset is
   negative
-  re-brand extractor as ``dsai_metadata_flatten`` for ownership
   consistency

0.7.0
~~~~~

-  create new set of generator class objects for varying output
   generator
-  add new ``generator`` input for limiting output to a single type


0.6
---

0.6.2
~~~~~

-  rename ``rekognition_face_collection`` to
   ``aws_rekognition_face_collection`` for consistency


0.6.1
~~~~~

-  split documentation and changes
-  add new ``cae_metadata`` type of parser
-  modify ``source_type`` of detected faces in ``azure_videoindexer`` to
   ``face``
-  modify to add new ``extractor`` input for limit to scanning (skips
   sub-dir check)

0.6.0
~~~~~

-  adding CI/CD script for `gitlab <https://gitlab.com>`__
-  validate usage as a flattening service
-  modify ``source_type`` for ``aws_rekognition_video_celebs`` to
   ``face``

0.5
---


0.5.4
~~~~~

-  adding ``face_attributes`` visualization mode for exploration of face
   data
-  fix face processing to split out to ``tag_type`` as ``face`` with
   richer subtags

0.5.3
~~~~~

-  add labeling component to application (for video/image inspection)
-  fix shot duration computeation in application (do not overwrite
   original event duration)
-  add text-search for scanning named entities, words from transcript


0.5.2
~~~~~

-  fix bugs in ``gcp_videointelligence_logo_recognition`` (timing) and
   ``aws_rekognition_video_faces`` (face emotions)
-  add new detection of ``timing.txt`` for integration of multiple
   results and their potential time offsets
-  added ``verbose`` flag to input of main parser
-  rename ``rekognition_face_collection`` for consistency with other
   parsers


0.5.1
~~~~~

-  split app modules into different visualization modes (``overview``,
   ``event_table``, ``brand_expansion``)

   -  ``brand_expansion`` uses kNN search to expand from shots with
      brands to similar shots and returns those brands
   -  ``event_table`` allows specific exploration of identity
      (e.g. celebrities) and brands witih image/video playback
   -  **NOTE** The new application requires ``scikit-learn`` to perform
      live indexing of features

-  dramatically improved frame targeting (time offset) for event
   instances (video) in application


0.5.0
~~~~~

-  split main function into sepearate auto-discovered modules
-  add new user collection detection parser
   ``rekognition_face_collection`` (custom face collections)


0.4
---


0.4.5
~~~~~

-  fixes for gcp moderation flattening
-  fixes for app rendering (switch most graphs to scatter plot)
-  make all charts interactive again
-  fix for time zone/browser challenge in rendering


0.4.4
~~~~~

-  fixes for ``azure_videoindexer`` parser
-  add sentiment and emotion summary
-  rework graph generation and add bran/entity search capability


0.4.3
~~~~~

-  add new ``azure_videoindexer`` parser
-  switch flattened reference from ``logo`` to ``brand``; ``explicit``
   to ``moderation``
-  add parsing library ``pytimeparse`` for simpler ingest
-  fix bug to delete old data bundle if reference files are available


0.4.2
~~~~~

-  add new ``time_offset`` parameter to environment/run configuration
-  fix bug for reusing/rewriting existing files
-  add output prefix ``flatten_`` to all generated CSVs to avoid
   collision with other extractor input


0.4.1
~~~~~

-  fix docker image for nlp tasks, fix stop word aggregation


0.4.0
~~~~~

-  adding video playback (and image preview) via inline command-line
   execution of ffmpeg in application
-  create new Dockerfile.app for all-in-one explorer app creation


0.3
---


0.3.2
~~~~~

-  argument input capabilities for exploration app
-  sort histograms in exploration app by count not alphabet


0.3.1
~~~~~

-  browsing bugfixes for exploration application


0.3.0
~~~~~

-  added new `streamlit <https://www.streamlit.io/>`__ code for `data
   explorer interface <app>`__

   -  be sure to install extra packages if using this app and starting
      from scratch (e.g. new flattened files)
   -  if you’re working from a cached model, you can also drop it in
      from a friend


0.2
---


0.2.1
~~~~~

-  schema change for verb/action consistency ``time_start`` ->
   ``time_begin``
-  add additional row field ``tag_type`` to describe type of tag (see
   `generated-insights <#generated-insights>`__)
-  add processing type ``gcp_videointelligence_logo_recognition``
-  allow compression as a requirement/input for generated files
   (``compressed`` as input)

0.2.0
~~~~~

-  add initial package, requirements, docker image
-  add basic readme for usage example
-  processes types ``gcp_videointelligence_label``,
   ``gcp_videointelligence_shot_change``,
   ``gcp_videointelligence_explicit_content``,
   ``gcp_videointelligence_speech_transcription``,
   ``aws_rekognition_video_content_moderation``,
   ``aws_rekognition_video_celebs``, ``aws_rekognition_video_labels``,
   ``aws_rekognition_video_faces``,
   ``aws_rekognition_video_person_tracking``,




