Metadata-Version: 2.1
Name: contentai-metadata-flatten
Version: 1.1.7
Summary: ContentAI Metadata Flattening Service
Home-page: https://gitlab.research.att.com/turnercode/metadata-flatten-extractor
Author: Eric Zavesky
License: Apache
Platform: UNKNOWN
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
Requires-Dist: pandas
Requires-Dist: numexpr
Requires-Dist: pytimeparse
Requires-Dist: contentaiextractor (>=1.0.4)

metadata-flatten-extractor
==========================

A method to flatten generated JSON data into timed CSV events in support
of analytic workflows within the `ContentAI
Platform <https://www.contentai.io>`__, published as the extractor
``dsai_metadata_flatten``.   There is also a 
`pypi package <https://pypi.org/project/contentai-metadata-flatten/>`__ 
of this package published for easy incorporation in other projects.

1. `Getting Started <#getting-started>`__
2. `Execution <#execution-and-deployment>`__
3. `Testing <#testing>`__
4. `Future Development <#future-development>`__
5. `Changes <#changes>`__

Getting Started
===============

| This library is used as a `single-run executable <#contentai-standalone>`__.
| Runtime parameters can be passed for processing that configure the
  returned results and can be examined in more detail in the
  `main <main.py>`__ script.

**NOTE: Not all flattening functions will respect/obey properties
defined here.**

-  ``force_overwrite`` - *(bool)* - force existing files to be
   overwritten (*default=False*)
-  ``compressed`` - *(bool)* - compress output CSVs instead of raw write
   (*default=True*, e.g. append ‘.gz’)
-  ``all_frames`` - *(bool)* - for video-based events, log all instances
   in box or just the center (*default=False*)
-  ``time_offset`` - *(int)* - when merging events for an asset split
   into multiple parts, time in seconds (*default=0*); negative numbers
   will cause a truncation (skip) of events happening before the zero
   time mark *(added v0.7.1)*
-  ``verbose`` - *(bool)* - verbose input/output configuration printing
   (*default=False*)
-  ``extractor`` - *(string)* - specify one extractor to flatten,
   skipping nested module import (*default=all*, e.g. ``dsai_metadata``)
-  ``generator`` - *(string)* - specify one generator for output,
   skipping nested module import (*default=all*, e.g. ``flattened_csv``)

generated schema
----------------

The output of this flattening will be a set of CSV files, one for each
extractor. the standard schema for these CSV files has the following
fields.

-  ``time_begin`` = time in seconds of event start
-  ``time_end`` = time in seconds of end (may be equal to time_start if
   instantaneous)
-  ``time_event`` = exact time in seconds (may be equal to time_start if
   instantaneous)
-  ``source_event`` = source media for event to add granularity for
   event inpact (e.g. face, video, audio, speech, image, ocr, script)
-  ``tag`` = simple text word or phrase
-  ``tag_type`` = descriptor for type of tag; e.g. tag=concept/label/emotion, keyword=special word,
   shot=segment, transcript=text, moderation=moderation, word=text/speech word,
   phrase=long utterance, face=face emotion/properties, identity=face or speaker
   recognition, scene=semantic scenes, brand=product or logo mention
-  ``score`` = confidence/probability
-  ``details`` = possible bounding box or other long-form (JSON-encoded)
   details
-  ``extractor`` = name of extractor for insight

dependencies
------------

| To install package dependencies in a fresh system, the recommended
  technique is a set of
| vanilla pip packages. The latest requirements should be validated from
  the ``requirements.txt`` file but at time of writing, they were the
  following.

.. code:: shell

   pip install --no-cache-dir -r requirements.txt 

Execution and Deployment
========================

This package is meant to be run as a one-off processing tool that
aggregates the insights of other extractors.

command-line standalone
-----------------------

Run the code as if it is an extractor. In this mode, configure a few
environment variables to let the code know where to look for content.

One can also run the command-line with a single argument as input and
optionally ad runtime configuration (see `runtime
variables <#getting-started>`__) as part of the ``EXTRACTOR_METADATA``
variable as JSON.

.. code:: shell

   EXTRACTOR_METADATA='{"compressed":true}'

Locally Run on Results
~~~~~~~~~~~~~~~~~~~~~~

For utility, the above line has been wrapped in the bash script
``run_local.sh``.

.. code:: shell

   EXTRACTOR_METADATA='$3' EXTRACTOR_NAME=metadata-flatten EXTRACTOR_JOB_ID=1 \
       EXTRACTOR_CONTENT_PATH=$1 EXTRACTOR_CONTENT_URL=file://$1 EXTRACTOR_RESULT_PATH=`pwd`/results/$2 \
       python -u main.py

This allows a simplified command-line specification of a run
configuration, which also allows the passage of metadata into a
configuration.

*Normal result generation into compressed CSVs (with overwrite).*

.. code:: shell

   ./run_local.sh data/wHaT3ver1t1s results/

*Result generation with environment variables and integration of results
from a file that was split at an offset of three hours.*

.. code:: shell

   ./run_local.sh results/1XMDAz9w8T1JFEKHRuNunQhRWL1/ results/ '{"force_overwrite":false,"time_offset":10800}'

*Result generation from a single extractor, with its nested directory
explicitly specified. (added v0.6.1)*

.. code:: shell

   ./run_local.sh results/dsai_metadata results/ '{"extractor":"dsai_metadata"}'

Local Runs with Timing Offsets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The script ``run_local.sh`` also searches for a text file called
``timing.txt`` in each source directory. If found, it will offset all
results by the specified number of seconds before saving them to disk.
Also, negative numbers will cause a truncation (skip) of events
happening before the zero time mark. *(added v0.7.1)*

This capability may be useful if you have to manually split a file into
multiple smaller files at a pre-determined time offset (e.g. three hours
-> 10800 in ``timing.txt``). *(added v0.5.2)*

.. code:: shell

   echo "10800" > 1XMDAz9w8T1JFEKHRuNunQhRWL1/timing.txt
   ./run_local.sh results/1XMDAz9w8T1JFEKHRuNunQhRWL1/ results/

Afterwards, new results can be added arbitrarily and the script can be
rerun in the same directory to accomodate different timing offsets.

*Example demonstrating integration of multiple output directories.*

.. code:: shell

   find results -type d  -d 1 | xargs -I {} ./run_local.sh {} results/

ContentAI
---------

Deployment
~~~~~~~~~~

Deployment is easy and follows standard ContentAI steps.

.. code:: shell

   contentai deploy --cpu 256 --memory 512 metadata-flatten
   Deploying...
   writing workflow.dot
   done

Alternatively, you can pass an image name to reduce rebuilding a docker
instance.

.. code:: shell

   docker build -t metadata-deploy
   contentai deploy metadata-flatten --cpu 256 --memory 512 -i metadata-deploy

Locally Downloading Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can locally download data from a specific job for this extractor to
directly analyze.

.. code:: shell

   contentai data wHaT3ver1t1s --dir data

Run as an Extractor
~~~~~~~~~~~~~~~~~~~

.. code:: shell

   contentai run https://bucket/video.mp4  -w 'digraph { aws_rekognition_video_celebs -> metadata_flatten}'

   JOB ID:     1Tfb1vPPqTQ0lVD1JDPUilB8QNr
   CONTENT:    s3://bucket/video.mp4
   STATE:      complete
   START:      Fri Feb 15 04:38:05 PM (6 minutes ago)
   UPDATED:    1 minute ago
   END:        Fri Feb 15 04:43:04 PM (1 minute ago)
   DURATION:   4 minutes 

   EXTRACTORS

   my_extractor

   TASK      STATE      START           DURATION
   724a493   complete   5 minutes ago   1 minute 

Or run it via the docker image…

::

   docker run --rm  -v `pwd`/:/x -e EXTRACTOR_CONTENT_PATH=/x/file.mp3 -e EXTRACTOR_RESULT_PATH=/x/result2 <docker_image>

View Extractor Logs (stdout)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: shell

   contentai logs -f <my_extractor>
   my_extractor Fri Nov 15 04:39:22 PM writing some data
   Job complete in 4m58.265737799s

Testing
=======

Testing is included via tox.  To launch testing for the entire package, just run `tox` at the command line. 
Testing can also be run for a specific file within the package by setting the evironment variable `TOX_ARGS`.

.. code:: shell

   TOX_ARG=test_basic.py tox 



Future Development
==================

-  the remaining known extractors...  ``openpose``, ``dsai_tmstext_classifier_extractor``, 
    ``dsai_vinyl_sound_ai``, ``dsai_name_entity_extractor``, ``gcp_videointelligence_text_detection``,
    ``aws_rekognition_video_segments``
-  integration of viewership insights
-  creation of sentiment and mood-based insights (which tags most
   co-occur here?)

Changes
=======

A method to flatten generated JSON data into timed CSV events in support
of analytic workflows within the `ContentAI Platform <https://www.contentai.io>`__.

1.1
---

1.1.7
~~~~~
- inclusion of other constants for compatibility with other packages
- refactor/rename of parser classes to mandate a filename output prefix (e.g. ``flatten_``)
- add ``dsai_activity_emotions`` parser (a clone of ``dsai_activity_classifier``)

1.1.6
~~~~~
- remove applications, fork to new `metatata-database` source, to be posted
  at a `pypi database package <https://pypi.org/project/contentai-metadata-database>`__

1.1.4
~~~~~
- name update for ``dsai_moderation_image`` extractor

1.1.3
~~~~~
- hotfix for build distribution
- fix for content creation in streamlit/browsing app

1.1.2
~~~~~
- deployed extractor (docker fix) for updated namespace


1.1.1
~~~~~
- docs update, testing fixes, version bump for publication

1.1.0
~~~~~
- rename to ``contentai-metadata-flatten`` and publish to pypi as a package!


1.0
---

1.0.2
~~~~~
- update documentation for `Metadata Browser <app_browser>`__ and `Inventory Discovery <app_inventory>`__ app

1.0.1
~~~~~
- add ability to parse input CSVs but not segment into shot
- move to a single NLP library (spacy) for applications, using large model (with vectors)

1.0.0
~~~~~
- add new `dash/plotly <https://dash.plotly.com/>`__ driven quality check application

0.9
---

0.9.9
~~~~~
- update to optimize the pull of asset keys

0.9.7
~~~~~

- upgrade to use new `contentai extractor package <https://pypi.org/project/contentaiextractor/>`__
- update parser logic for safer key and data retrieval


0.9.6
~~~~~

- upgrade to use new `contentai extractor package <https://pypi.org/project/contentaiextractor/>`__
- update parser logic for safer key and data retrieval


0.9.6
~~~~~
- small tweaks/normalization of rounding factor for extractors
- correct emotion souce type for azure
- refactor app location for primary streamlit browser
  - fix mode discovery for modules with specific UX interface
- update file listing to show data bundle files as well
- refactor utilities script for reuse in other apps


0.9.5
~~~~~

- update to parse new version of `dsai_places`
- add new parser for `detectron2` extractor

0.9.4
~~~~~

- add static file serving to streamlit app, inspired by this `streamlit issue discussion <https://github.com/streamlit/streamlit/issues/400>`_
- modify some pages to point to downloadable tables (with button click)
- create new download page/mode that lists the generated and source files
- minor refactor of app's docker image for better caching in local creation and testing


0.9.3
~~~~~

- add ``dsai_moderation_text`` parser, update ``dsai_moderation`` parser for version robustness
  - add min threshold (*0.05*) to both moderation detectors


0.9.2
~~~~~

- add recursion to file-based discovery method for processed assets
  - unify read of JSON and text files with internalaized function call in extractor base class
- fix some extractors to use single name reference ``self.EXTRACTOR``

0.9.1
~~~~~

- fix transcript parsing in ``azure_videoindexer`` component
- add speaker differentiation as an identity block in ``azure_videoindexer`` (similar to ``aws_transcribe``)


0.9.0
~~~~~

- add timeline viewing to the ``event_table`` mode of streamlit app



0.8
---

0.8.9
~~~~~

- fixes to main streamlit app for partial extractors (e.g. missing identity, sparse brand)

0.8.8
~~~~~

- add parser for ``dsai_moderation``


0.8.7
~~~~~

- add parser for ``dsai_activity_classifier``
- fix bug for faulty rejection of ``flatten_aws_transcribe`` results

0.8.6
~~~~~

- add parsers for ``pyscenedetect``, ``dsai_sceneboundary``, ``aws_transcribe``, ``yolo3``, ``aws_rekognition_video_text_detect``
- add speaker identity (from speech) to ``gcp_videointelligence_speech_transcription``
- add ``type`` field (maps to ``tag_type``) to output generated by ``wbTimeTaggedTmetadata`` generator
  - add hashing against data (e.g. ``box``) within JSON metadata generator


0.8.5
~~~~~

- add parsers for ``dsai_yt8m`` (youtube8M or mediapipe)


0.8.4
~~~~~

- add parsers for ``dsai_activity_slowfast`` (activity) and ``dsai_places`` (scene/settings)
- add *source_type* sub-field to ``event_table`` browsing mode


0.8.3
~~~~~

- add ``manifest`` option to application for multiple assets
- fix app docker file for placement/generation of code with a specific user ID
- fix CI/CD integration for auto launch
- fix app explorer bugs (derive 'words' from transcript/keywords if none)


0.8.2
~~~~~

- hotfix for missing data in ``dsai_metadata`` parser


0.8.2
~~~~~

- slight refactor of how parsers are discovered, to allow search by name or type (for use as package)
- fix package import for contentai local file
- switch *tag_type* of ``ocr`` to ``transcript`` and ``ocr`` for *source_type* (``azure_videoindexer``)


0.8.1
~~~~~

- adding music parser ``dsai_musicnn`` for different audio regions


0.8.0
~~~~~

- convert to package for other modules to install
- switch document to RST from MD
- add primitive testing capabilities (to be filled)


0.7
---

0.7.1
~~~~~

-  added truncation/trim of events before zero mark if time offset is
   negative
-  re-brand extractor as ``dsai_metadata_flatten`` for ownership
   consistency

0.7.0
~~~~~

-  create new set of generator class objects for varying output
   generator
-  add new ``generator`` input for limiting output to a single type


0.6
---

0.6.2
~~~~~

-  rename ``rekognition_face_collection`` to
   ``aws_rekognition_face_collection`` for consistency


0.6.1
~~~~~

-  split documentation and changes
-  add new ``cae_metadata`` type of parser
-  modify ``source_type`` of detected faces in ``azure_videoindexer`` to
   ``face``
-  modify to add new ``extractor`` input for limit to scanning (skips
   sub-dir check)

0.6.0
~~~~~

-  adding CI/CD script for `gitlab <https://gitlab.com>`__
-  validate usage as a flattening service
-  modify ``source_type`` for ``aws_rekognition_video_celebs`` to
   ``face``

0.5
---


0.5.4
~~~~~

-  adding ``face_attributes`` visualization mode for exploration of face
   data
-  fix face processing to split out to ``tag_type`` as ``face`` with
   richer subtags

0.5.3
~~~~~

-  add labeling component to application (for video/image inspection)
-  fix shot duration computeation in application (do not overwrite
   original event duration)
-  add text-search for scanning named entities, words from transcript


0.5.2
~~~~~

-  fix bugs in ``gcp_videointelligence_logo_recognition`` (timing) and
   ``aws_rekognition_video_faces`` (face emotions)
-  add new detection of ``timing.txt`` for integration of multiple
   results and their potential time offsets
-  added ``verbose`` flag to input of main parser
-  rename ``rekognition_face_collection`` for consistency with other
   parsers


0.5.1
~~~~~

-  split app modules into different visualization modes (``overview``,
   ``event_table``, ``brand_expansion``)

   -  ``brand_expansion`` uses kNN search to expand from shots with
      brands to similar shots and returns those brands
   -  ``event_table`` allows specific exploration of identity
      (e.g. celebrities) and brands witih image/video playback
   -  **NOTE** The new application requires ``scikit-learn`` to perform
      live indexing of features

-  dramatically improved frame targeting (time offset) for event
   instances (video) in application


0.5.0
~~~~~

-  split main function into sepearate auto-discovered modules
-  add new user collection detection parser
   ``rekognition_face_collection`` (custom face collections)


0.4
---


0.4.5
~~~~~

-  fixes for gcp moderation flattening
-  fixes for app rendering (switch most graphs to scatter plot)
-  make all charts interactive again
-  fix for time zone/browser challenge in rendering


0.4.4
~~~~~

-  fixes for ``azure_videoindexer`` parser
-  add sentiment and emotion summary
-  rework graph generation and add bran/entity search capability


0.4.3
~~~~~

-  add new ``azure_videoindexer`` parser
-  switch flattened reference from ``logo`` to ``brand``; ``explicit``
   to ``moderation``
-  add parsing library ``pytimeparse`` for simpler ingest
-  fix bug to delete old data bundle if reference files are available


0.4.2
~~~~~

-  add new ``time_offset`` parameter to environment/run configuration
-  fix bug for reusing/rewriting existing files
-  add output prefix ``flatten_`` to all generated CSVs to avoid
   collision with other extractor input


0.4.1
~~~~~

-  fix docker image for nlp tasks, fix stop word aggregation


0.4.0
~~~~~

-  adding video playback (and image preview) via inline command-line
   execution of ffmpeg in application
-  create new Dockerfile.app for all-in-one explorer app creation


0.3
---


0.3.2
~~~~~

-  argument input capabilities for exploration app
-  sort histograms in exploration app by count not alphabet


0.3.1
~~~~~

-  browsing bugfixes for exploration application


0.3.0
~~~~~

-  added new `streamlit <https://www.streamlit.io/>`__ code for `data
   explorer interface <app>`__

   -  be sure to install extra packages if using this app and starting
      from scratch (e.g. new flattened files)
   -  if you’re working from a cached model, you can also drop it in
      from a friend


0.2
---


0.2.1
~~~~~

-  schema change for verb/action consistency ``time_start`` ->
   ``time_begin``
-  add additional row field ``tag_type`` to describe type of tag (see
   `generated-insights <#generated-insights>`__)
-  add processing type ``gcp_videointelligence_logo_recognition``
-  allow compression as a requirement/input for generated files
   (``compressed`` as input)

0.2.0
~~~~~

-  add initial package, requirements, docker image
-  add basic readme for usage example
-  processes types ``gcp_videointelligence_label``,
   ``gcp_videointelligence_shot_change``,
   ``gcp_videointelligence_explicit_content``,
   ``gcp_videointelligence_speech_transcription``,
   ``aws_rekognition_video_content_moderation``,
   ``aws_rekognition_video_celebs``, ``aws_rekognition_video_labels``,
   ``aws_rekognition_video_faces``,
   ``aws_rekognition_video_person_tracking``,




