%We then discuss design goals and introduce proposals for their
    %  implementation.
    %Details of the proposals are discussed in individual sections.

    %In this chapter we discuss the challenge of building a software package to
    %  maintain primary and derivative data of images and annotations.

    % CHALLENGES:
    %(\eg{} image URIs, GPS data, annotation bounding boxes, \ldots),
    %(\eg{} SIFT descriptors, annotation matches, detection models, \ldots),
    % Abstracted away caching w/ cache invalidation. 
    % Ease of experimentation and comparision between algorithms
    % Hyperparamter optimization

    %
    % are examples of primary data and
    %   are examples of derived data

    % This is the problem we intend to address in this chapter --- we want to
    % develop a framework to abstract algorithm inputs and dependencies away
    % from the developer and allows for the developer to  interactively write
    % algorithms to extract features, train models, and perform comparisons
    % between objects while also testing different hyperparamter settings.


\newcommand{\DependencyCache}{Dependency Cache}

\chapter{System architecture}\label{chap:system}

    In this chapter we describe a system architecture designed to aid in the
      development and deployment of the computer vision algorithms used by
      IBEIS{}.

    It is common to use directed acyclic graphs implement pipelines.
    Other related work is GNU Make, Luigi, Oozie, Azkaban, Airflow.
    Ours differs because instead of specifying general task
      dependencies we are specifying object dependencies.
    The design of our dependency cache is to efficiently compute
      multiple levels of hierarchical derivative information from a
      single root object.
    This is well suited for computer vision applications because often
      there are many choices of algorithms and features that are
      repeatably applied to many instances of the same base object
      (\eg{} an image).
    In addition to computing dependencies of a single object, multiple
      objects can be specified as input to a function.
    In this way a function can train a model (\eg{} an SVM, Neighbor
      Indexer, DCNN, \ldots).
    This multi-input model can then be referenced either by the input
      that created it, or it can be detached from it and referenced by a
      unique UUID{}.
    This allows for efficient and organized training, testing, and
      distribution of learned models.
    The provenance information of how this model was created is always
      maintained even if the original inputs cease to exist.
    Furthermore, algorithms that make comparisons, classifications,
      and or just output results in general can also be implemented
      here.
    These algorithms are testable and comparable through the test
      harness which is available in an IPython notebook interface.

    The main differences between us and Luigi are:
    * we have less boilerplate code because we implement a less
      general design.
    * we support a more functional design.
    Functions can be executed and tested completely outside the scope
      of the depcache.
    * We are focused on quickly changing parameters settings and
      searching for good values for parameter settings.
    Therefore configuration objects are central to our design.
    * The test harness is a good example of this.
       
    What we do not do:
    Hadoop integration, everything currently runs locally.
    However, Luigi could be used to implement the back end of our data
      registration interface.
    The main novelty of this interface is not to be yet another DAG
      workflow manager.
    We believe the main contribution of this interface is a low
      barrier to entry model of implementing and testing computer vision
      algorithms.

    TODO:{} abstract away row-id inputs.
    Row-id input is nice because of its efficiency.
    Every object is represented by an integer akin to a C++ pointer.
    However, this is less friendly to a new user.
    There should be an alternative mode to have inputs to registered
      functions be native types of the variables that are needed.
    The depcache should do any loading that is needed behind the
      scenes.

    % https://github.com/spotify/luigi
    % https://media.readthedocs.org/pdf/luigi/latest/luigi.pdf


    Not only must the image analysis system Between the raw images and
      all of the derived information, there is a large amount of data
      that needs to be maintained by the IBEIS image analysis system.

    In order to 
    
    * developing algorithms
    * testing and comparing multiple configurations on different
    datasets
    * maintaining large amounts of datasets
    * Information is derived under multiple configurations
    * Models are trained on multiple pieces of information 

    
    We maintain a common structure that handles 
    * Precomputed Features
    * Models (nn indexer)
    * Algorithms (nn indexer)
    * User interactions?

    the system architecture of the IBEIS image
      analysis module, which allows for the development, testing, and
      maintenance of computer vision algorithms.

    The image analysis component of IBEIS fundamentally keeps track of
      two primary objects:
    images and annotations.
    However, between all of the computer vision algorithms and their
      different settings the system can end up generating a lot of
      derived data.
    This introduces a problem that is often dealt with using
      memoization or caching, however when the bounding box of an
      annotation is changed many of these derived properties become
      invalid.

    To handle these problems we defined the
      \glossterm{\DependencyCache{}} class.
    This is a directed acyclic multi-graph that is able to store the
      results of feature extraction algorithms, and set-to-set
      comparison results.
    This cache is primarily built for handling data that can be
      deterministically derived given its input and configuration
      parameters.

    This cache is extensible and allows for new algorithms to be
      incorporated into Image Analysis as plugins.
    The user simply writes the algorithm as a function which accepts
      input annotations and a configuration.
    If any features not provided by the system are required those are
      written as standalone functions which again just take in a parent
      object (like an annotation) and return a result.
    These functions or classes are registered as plugins using Python
      decorators.
    Optionally, a configuration object can be specified.
    The system can use these configurations to grid search for an
      optimal set of parameters.

    \section{Dependency cache}\label{sec:depc}

        \subsection{Root nodes}

            This are the non-recomputable root of the graph.
            This is a first class object like an annotation or an
              image.

            It is possible for there to be more than one root node.
            For instance image and annotation could be in the same
              graph, with annotation being an ancestor of image.
            However, because an annotation is not automatically
              recomputable, there must be is a user interaction that
              separates any two root nodes in the graph.
            Any non-recomputable node in the system can be viewed as
              the root of its own \DependencyCache{}.

        \subsection{Derived single-input nodes}

            For a configuration and a single set of parents (at most
              one parent from each ancestor node), a single-input
              function computes a Dependant property and returns the
              result.

        \subsection{Derived multi-input nodes}

            An algorithm might also take in multiple sets of parents
              from of the same type.
            For instance an identification algorithm might take in a
              set of query annotations and a set of database
              annotations, or a neighbor index might just take in a
              single set of annotations.

            There is a case when the two inputs are from the same
              table.
            This is a Cartesian-product-input node.

        \subsection{Non-derived nodes}

            The system also allows for non-derived nodes.
            These nodes store the result of some user interaction, and
              as such, are not deterministic.

            If a parent of a non-derived node is altered, by default
              the non-derived node must be deleted.
            However, if the manual interaction is a segmentation mask,
              and the bounding box of the parent annotation was simply
              translated by a few pixels, it would be best if the system
              was able to modify the original input.
            Thus the user can specify a modification function for a
              node that can alter a user input given a change to a
              parent.

        \subsection{Implicit Edges}

            The computation of a node is allowed to use another node
              in its computation function.
            This implicitly adds an edge between these two nodes that
              is not expressed in the DAG visualization.
            It is ok to do this if the configuration for the accessed
              node is specified in the configuration to the first node.

            It is even ok to access node in a cyclic fashion as long
              as there are no cycles in configuration space.
            (\ie{} a chip with config1 can ask for the keypoints
              computed on a chip with config2)
        
        \subsection{Configuration}

            Configurations are combined over paths in the DAG{}.

        \subsection{External Storage}

            Data in the cache can either be stored internally or
            externally

        \subsection{Getters}

            A getter in the dependency cache is asked to return the properties
              of a table at rows that correspond to a root under a
              configuration.

            If the row does not exist there are three options:
            return None, recompute it, or get the input from an external
              source (such as Wildbook).

            In the case of a manual interaction, recomputation is
              marked as requested, and execution is blocked until the
              interaction is complete or an error is thrown.

        \subsection{Setters}
           
            Setters are used to either modify a manual node or set
              information from an external source.

        \subsection{Test Harness}\label{sec:testharn}
         
            The test harness makes use of the configurations.
            An algorithm is simply a node in the DAG and a test
              harness iterates over different configuration settings.

        \subsection{Graph modification}\label{sec:modification}

            Any modification made to the headers appropriately
              modifies the structure and steps are taken to rectify the
              database.
            Recomputation occurs as necessary.

            Algorithm modifications can be made by adding a config
              param and preserving old behavior or bumping an algorithm
              version flag which indicates that all previous results
              must be invalidated.

    %\section{Testing}\label{sec:testing}
    %utool doctests


    \section{IBEIS-IA definition}\label{sec:ibeisdepc}

        Enumerates the exact way the dependency cache is used to
          create the IBEIS image analysis system.

        \subsection{IBEIS DAG / DAM}
        \begin{comment}
        python -m ibeis.control.IBEISControl --test-show_depc_graph --save figures5/digraph.png --dpath ~/latex/crall-candidacy-2015/ --diskshow --clipwhite --reduced --dpi=120

        python -m ibeis.control.IBEISControl --test-show_depc_graph --save figures5/digraph.png --dpath ~/latex/crall-candidacy-2015/
        \end{comment}
        \ImageCommand{figures5/digraph.png}{\textwidth}{
            % ---
            dependency digraph
            % ---
        }{digraph}
        \digraph{}


        \subsection{Using system to handle photobombs}
            \keywords{heuristic, turking, segmentation\\}
            \relatedto{annotation property}

            It is a challenging task to segment two overlapping
              animals with similar textures.
            At this only provides a minor barrier to identification we
              choose to manually segment photobombs when they become a
              problem.
            This will allow for the system to build up training data
              that can eventually be used to learn how to segment out a
              foreground animal in an annotation.

            To address the issue of photobombing in the \GGR{}, I will
              manually paint over the photobombed sections of images
              using a matplotlib \ucite{mpl} interaction whenever a
              photobomb occurs in matching.

            Ideally we would have trained a state of the art segmentation
              technique --- like a CNN --- to segment out the primary animal in
              each annotation.
            However, the implementation, training, testing, and evaluation of
              such techniques is outside the scope of this work.
\begin{comment}
./mass_tex_fixes.py --outline --fpaths chapter5-systemchapter.tex --numlines=1
\end{comment}


\section{Dependency Cache Design}\label{sec:depc}

    \paragraph{Related Work}

        It is common to use directed acyclic graphs implement
          pipelines.
        Other related work is GNU Make, Luigi, Oozie, Azkaban,
          Airflow, and Dryad.
        Of these, Luigi --- a python based pipeline manager --- is the
          most similar to our system.
        Ours differs because instead of specifying general task
          dependencies we are specifying object dependencies.

    \subsection{Assumptions}
        We distinguish ourselves from previous work by targeting a
          more niche application:
        dependency caches for computer vision algorithm development in
          Python.

         \begin{itemize}
            \item It is assumed that there is one root-object and multiple
              instances of these objects will be created.
         \end{itemize}

    \subsection{Types of derived data}
        We identify and distinguish between three types of derived
          data:
        (1) features,
        (2) comparisons, and
        (3) models.
        %
        Features are properties directly derived from an object and
          correspond with an object in a one-to-one manner (\eg{} an
          annotation and its chip, or an image and a set of raw
          detections from a bounding box detector).
        % 
        Comparisons determine a notion of similarity or difference
          between a fixed size set of objects.
        Typically comparisons are pairwise.
        (\eg{} SIFT descriptors between two annotations are compared
          to find correspondences).
        %
        Models are often learned from a set of multiple objects and
          are described with a many-to-one relationship (\eg{} a kd-tree
          indexer indexes the features of many annotations, and a neural
          network may be trained to detect bounding boxes using many
          images.).
        Models are often used to generate features or make
          comparisons.

        This introduces a problem that is often dealt with using
          memoization or caching.
        However, when certain properties of a primary object (such as
          a bounding box of an annotation) are changed many of these
          derived properties become invalid.

\section{Dependency Cache Details}\label{sec:depcdetails}

For each dependency cache there is a single root object.
Multiple instances of this object can be created.

A table is made for each property in the \depcache{}.
Tables may have multiple properties.
All properties of a table are computed at the same time for some
  number of root instances.

Results of a property are either stored internally in an SQL Database
  or externally in a file.
Filenames are derived automatically from dependency information.

Multiple roots can be used to specify models.
When a model is a dependency of a computation the model is either
  specified in the input using the input to the model, or using the
  model's UUID{}.


\subsection{Determining root-most inputs}

Inputs to a table are determined by finding the ``root most'' tables
  in an ``expanded input graph''.
An expanded input graph takes the table node and creates an expanded
  input graph based on its parents.

\paragraph{Expanded input graph}
An expanded input graph is computed between the root node and the
  table node.
First all paths from the root to the table are enumerated as a list of
  edges (recall the \depcache{} is a multi-digraph, so an edge is
  specified as a 3-tuple $(u, v, k)$).
Each edge is associated with a ``local input-id'' which specifies if
  an edge is one-to-one or many-to-one as well as the $k$ edge key.
Different inputs will be distinguished by accumulated local input-ids.
Each path is then traversed from the table to the root.
The local-input ids along the way are accumulated in a list and
  appended to the node label at each step (duplicate consecutive
  input-ids are removed).
This results in a graph where the table node is the only sink and
  there are multiple source nodes which are specify a different set of
  root objects.

\paragraph{Root-most tables}
To specify the inputs to a table it is convenient to specify only the
  ids of the root objects and a configuration object.
However, when some dependencies have multiple inputs (such as model
  classifications) it is more convenient to specify multiple inputs.
A specifiable node is one that is either a source node or a node with
  multiple inputs.

The ``root-most'' tables a subset of specifiable inputs such that all
  paths from source nodes to the sink node there is exactly one
  specifiable node identified as ``root-most''.
(Removing all the child edges of the root-most nodes would cut all
  sources from the sink).
These represent the sets of inputs that must be specified by a user.
In the case where all paths are one-to-one properties the user
  specifies just a single root object.
In the case where two root objects are compared the list of root pairs
  is specified.
If a model is involved the id of the model must be specified as well
  (for convenience a default values can be specified in the
  configuration).

\paragraph{Computing input order}
Given the set of inputs we must walk down the dependency graph and
  lookup or compute intermediate values between the specified nodes and
  the target node.
To do this we topologically sort the nodes in the expanded input graph
  and compute them in this order.
For each branch in the tree (a set of paths from a source to the sink)
  we compute a branch id.

\expandedinput{}


\section{Test Harness}\label{sec:testharn}

We have developed a test harness to explore the configuration space of
  our algorithms.
Currently the test harness is only capable of performing grid search
  and interactive search.

\paragraph{Major goals}:
\begin{itemize}
    \item Reproducibility of experiments.
    \item Ease of algorithm comparisons. 
    \item Reporting of results.
\end{itemize}
  

There are many related works on hyper-parameter optimization that are
  applicable here.
Extensions would be to incorporate random search, A/B testing, and
  Bayesian hyper-parameter optimization methods.

Random search has shown to be superior to grid search for hyper parameter
  optimization of neural networks~\cite{bergstra_random_2012}.
%pass
\cite{hutter_sequential_2011}


    \paragraph{Bulleted lists (for draft document only)}
    A summary of the design goals of the images analysis module are:
    \begin{itemize}
        \item Dependant property storage and retrieval of data with automatic
            dependency pipeline execution. 
        \item Reproducible experiments.
        \item Distribution through ``publication'' of trained models.
        \item Extensible for new algorithm integration.
        \item Automatic hyper parameter optimization.
        \item Maintain data provenance / data lineage. (The history of data ownership, where did this data come from?)
    \end{itemize}

    A summary of \depcache{} features are:
    \begin{itemize}

        \item New algorithms are integrated into the system by
          registering the algorithms, its output types, its
          dependencies, and its configurations.

        \item Algorithms can be implemented to accept single inputs
          for feature extraction, multiple inputs for model training,
          and pairwise inputs for primary object comparisions.

        \item Results are cached based on a unique hash of their input
          and configuration dependencies.
        This allows for results to be implicitly invalidated if the
          parent object changes.

        \item Registered configurations define hyper paraemters of
          models.
        A result comparison function can be registered in order to
          automatically optimize hyper-parameters through grid-search,
          random search, or baysian based techniques.

        \item Registered configurations also allow for experiments to
          be reproduced.
        Experiemnts are tagged with a hash and manifest file
          indicating how the experiment was run.

        \item Trained models that are ``publishable'' are not
          invalidated if a source of parent data is removed.
        These models can be downloaded to other instances of image
          analysis and used without the parent data.
        However at least a fingerprint of what the parent data was is
          always maintained.

    \end{itemize}

    These challenges are summarized as:
    \begin{itemize}
        \item Adding, removing, and modifying primary data.
        \item Algorithm configurations.
        \item Dependant computations.
    \end{itemize}


    \paragraph{The system architecture should}
    \begin{itemize}

        \item Integrate of algorithms from multiple sources across
          multiple languages.

        \item Correctly compute and cache algorithm results based on
          all relevant inputs.

        \item Allow trained models to be distributed and used by other
          instances of the image analysis software.

        \item Invalidate cached results if the source data changes
          (note that there is an exception for ``publishable'' results
          such as trained models such as quality classifiers)

        \item Have the capacity to augment models without completely
          rebuilding their structure (\eg{} adding points to a nearest
          neighbor kd-tree index).

        \item Optimize hyper-parameters of an algorithm given a method
          to compare results.

    \end{itemize}


    \paragraph{Goals}
    \begin{itemize}
        \item \textbf{Abstract need for caching away from the developer}:
        We aim to have all caches created and invalided based on the
          algorithm dependency structure and the requested
          configurations.

        \item \textbf{Lazy and configurable and computation of
                heirarchical derivative data}: 
            Make derived data such as features a top-level attribute
              of a root-object (annotation) such that it is evaluated as
              needed and stored as specified.

          \item \textbf{Efficient integration of new algorithms}:
            Functional design with minimal boilerplate.
            To integrate a new table into the \depcache{} all that is
              needed is a function that computes the requested outputs
              given the appropriate inputs.
            This function is register with the \depcache{} using a
              decorator.
            This makes it simple to integrate independently developed
              algorithms.

          \item \textbf{Train models in a reproducible manner}:
            Models can be simple such as a kd-tree indexer or complex
              such as a neural network classifier.
            A model should contain at least a fingerprint of the data
              and configuration that was used to compute it.
            Models should also be distributable so they can be used in
              systems they were not necessarily trained on.

          \item \textbf{Modular testing}:
            Functions registered with the \depcache{} should be able
              to be tested without heavy reliance on the \depcache{}
              itself.
            However, the \depcache{} should make testing easier by
              having the ability to quickly supply test input data.
            Any property of the \depcache{}  should be recomputable
              on-the-fly without effecting the state of the cache for
              both testing and timing purposes.
            The \depcache{} should be able to automatically test and
              time each table.
      \end{itemize}

    \paragraph{Image Analysis Modules Design}
        \begin{itemize}

            \item We designed the module to have a low barrier to
              entry.
            We aim to clearly point out entry points into program
              execution.
            We have aim to have a small startup time with both low
              import module overhead and low open database overhead.

            \item Doc-strings are used as documentation, examples, and
              unit tests.
            Most functions show an example that loads the necessary
              data to step through an understand / debug / enhance or
              develop a function.

            \item \Depcache{} marshals data around with getters (lazy
              computation) and setters.

            \item Module functionality split between different
              standalone packages.
            Helpers and general utilities are in \utool{}.
            Computer vision algorithms are in \vtool{}.
            Plotting functions are in \plottool{}, GUI functions are
              in \guitool{}.
        \end{itemize}

     \paragraph{Features}
         \begin{itemize}
             \item One-to-one properties that belong to each
                 root-object.
             \item One-vs-one comparisons between objects.
             \item One-vs-many comparisons between an object and a set of
                 objects.
             \item Training or building of many-to-one models.
             \item Properties and models computed by the \depcache{} are
                 highly configuration.
         \end{itemize}


    % Sources to cite:
    % http://www.mitre.org/sites/default/files/publications/pr-15-1254-architectural-model-mitre-research-blueridge.pdf
    % http://www.computer.org/csdl/proceedings/ssdbm/2004/2146/00/21460423.pdf
    % http://www.alexanderpokluda.ca/coursework/cs848/CS848%20Paper%20Presentation%20-%20Alexander%20Pokluda.pdf
    % http://uazone.org/demch/papers/bddac2013-bigdata-infrastructure-v06.pdf
    % http://www.rosebt.com/uploads/8/1/8/1/8181762/big_data_the_management_revolution.pdf
    % http://www.sciencedirect.com/science/article/pii/S0167739X14002015 - Pegasus, has transformation from abstract to executable workflow

    %We propose extensions to this architecture to support both result caching
    %  and hyperparameter optimization as well as reproducible model training
    %  and distribution.

    \item \textbf{Maintain data provenance}:

    \item \textbf{Interactity}

        Development should be an interactive experience.
        Computer algorithms tend have more empirically than
          theoretical justification.
        This process of trail and error should be reflected in the
          algorithm development process.

        \begin{itemize}

            \item \textbf{Data and algorithm interaction}:
                The developer should have the ability to explore
                  the data interactively and manually specify tags or
                  properties (such as causes of failure cases) that
                  may be relevant to a specific task.
                The developer should be able to browse root
                  objects, inspect dependant properties, and run
                  algorithms from an interactive interface.
                Algorithms should have an intuitive ``entry point''
                  which demonstrates how to acquire the necessary
                  data to start stepping through algorithm logic.

            \item \textbf{Discoverablity}:
                Ideally the learning curve should be low by making
                  the system have a high level of discoverability ---
                  that is a new developer should be able to sit down
                  at the system and be able to intuitively ``play
                  with'' and manipulate objects to see an immediate % tangible
                results.
                We envision that this design goal will be met using
                  IPython and autocompletion.
                It is then our challenge to develop a discoverable
                  API{}.
        \end{itemize}


    Of these goals, interactivity is the least important for addressing
      the challenges.
    Maintaining data providence is essential.
    %There are several factors that cause the design of such an
    %  organization scheme to be challenging.
    %Developers must be able to run feasibility studies to test if the system
    %is able to identify a new species of animal.  The system is demoed on a
    %regular basis and is regularly updated for live use in the field.  New
    %algorithms are integrated.  Existing algorithms are updated and modified.
    %Algorithms are often run with different configurations.
    %We envision a framework which addresses these challenges by
    %  satisfying several design goals.