\begin{comment} fixtex --fpaths appendix.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_appendix.md \end{comment} \appendix % This command is used only once! \addcontentsline{toc}{chapter}{APPENDICES} %toc entry or: %\addtocontents{toc}{\parindent0pt\vskip12pt APPENDICES} %toc entry, no page # %\chapter{Appendix} \crefalias{section}{appsec} \crefalias{chapter}{appsec} \chapter{OCCURRENCES}\label{app:occurgroup} In the identification workflow we separate groups of images into \glossterm{occurrences}. The Darwin Core defines an occurrence as a collection of evidence that shows an organism exists within specific location and span of time~\cite{wieczorek_darwin_2012}. For our purposes this amounts to a cluster of images localized in space and time. We outline an occurrence grouping algorithm performs agglomerative clustering on the GPS coordinates and time specified in the image metadata. We first describe the space-time measure of distance between images and then describe the clustering algorithm. \section{SPACE-TIME IMAGE DISTANCE} To compute occurrences we define a space-time feature $\g_i$ for each image $i$, and a pairwise distance, $\Delta(\g_i, \g_j)$, between these features. This feature will a two-dimensional feature tuple, % $\g_i = \paren{\time_i, \gps_i}$, where the first component is the POSIX timestamp $\time_i$, and the second component is a GPS coordinate % $\gps_i = \brak{\lat_i, \lon_i}^{T}$, where the angles of latitude and longitude are measured in radians. To compute this distance between two images $\g_i$ and $\g_j$ we first compute the distance in each component of the feature tuple. The difference in time is the absolute value of the timedelta, % $\Delta_t(\g_i, \g_j) = \abs{\time_i - \time_j}$, which is in seconds. % DISTANCE BETWEEN TWO IMAGES (space and final) Next, the distance in space is computed by approximating the Earth as a sphere. In general, the distance between two points on a sphere with radius $r$ is a function of inverse haversines, and is expressed as: \begin{equation}\label{eqn:geodistance} d(\gps_i, \gps_j, r) = 2 r \asin{\sqrt{ \haversine{\lat_i - \lat_j} + \haversine{\lon_i - \lon_j} + \cos\paren{\lat_i} \cos\paren{\lat_j}}} \end{equation} In the previous equation, $\haversine{\theta} = \haversineFULL{\theta}$ is the half vertical sine function. Thus, we arrive at the spatial distance between two images by estimating the radius of the earth to be $r=6367$ kilometers. \begin{equation} \Delta_s(\g_i, \g_j) = d(\gps_i, \gps_j, 6367). \end{equation} This results in distance in seconds and a distance in kilometers, which are in incompatible units. To combine these distances we convert kilometers to seconds by heuristically estimating the walking speed, $S$, of an animal (for zebras we use $S=2\sciE{-3}$ kilometers per second). This allows us to cancel kilometers from the expression and express GPS distance as a unit of time: $\frac{\Delta_s(\g_i, \g_j)}{S}$. This distance can be interpreted as the total amount of time it would take an animal to move between two points. The total distance between two images is the sum of these components. \begin{equation}\label{eqn:imgdist} \Delta(\g_i, \g_j) = \Delta_t(\g_i, \g_j) + \frac{\Delta_s(\gps_i, \gps_j)}{S} \end{equation} Notice that if there is no difference in GPS location, then this measure becomes to a distance in time. \section{CLUSTERING PROCEDURE} Having defined pairwise a distance between two images, we use the agglomerative clustering procedures implemented in SciPy~\cite{eric_jones_scipy_2001} to group the images. There are two inputs to the agglomerative clustering algorithm: (1) The matrix of pairwise distance between images, and (2) the minimum distance threshold between two images. The matrix of distances is computed using~\cref{eqn:imgdist}, and we set the distance threshold to $30$ minutes. Any pair of images that is within this threshold connected via a linkage matrix. Connected components in this matrix form the final clusters that we use as occurrences{}. To improve the efficiency of the algorithm, we separate it into disjoint parts by sorting the images by timestamp and splitting the data wherever the difference in consecutive timestamps exceeds the threshold. Images that are missing either timestamp or GPS location are grouped by what data they do have and clustered separately. %\section{DISCUSSION OF OCCURRENCES} %These computed occurrences are valuable measurements for multiple components of the IBEIS software. %At its core an occurrence describes \wquest{when} a group of animals was seen and \wquest{where} that group % was seen. %However, to answer the questions like \wquest{how many} animals there were, \wquest{who} an animal is, % \wquest{who else} is an animal with, and \wquest{where else} have these animals been seen, the \annots{} in % the occurrence must be grouped into individual \encounters{} and then matched against the \masterdatabase{}. \begin{comment} \chapter{Converting existing datasets to decision graphs}\label{sec:rename} In this section we briefly discuss the problem of applying graph identification from~\cref{chap:graphid} to existing databases. Most datasets used in practice ignore detailed connectivity information and simply associate a name label with a database of cropped (or sometimes un-cropped) images. Because graph identification relies on this detailed connectivity, we must reconstruct it before new images can be added. To apply graph identification to a previously existing dataset where annotations have been assigned name labels and connectivity between the annotations is unknown use follow the following process. First we compute the pairwise probabilities between each pair of annotations labeled with the same name. Then, we automatically classify any edge above a threshold as positive, negative, or incomparable. For any set of nodes originally labeled with the same name, we compute an edge augmentation to connect the PCC as detailed in~\cref{subsec:augredun} and insert these edges into the graph, labeling them as positive but assigning them the confidence of guessing. Note that any edge labeled as negative in the classification step will result in an inconsistency because it will be a negative inside a PCC. It will be common for such datasets to contain errors, we resolve any inconsistent PCCs using the algorithm from~\cref{sec:incon}, but then we search for additional split cases using the pairwise classifier. The main idea is to re-review all edges where the pairwise classifier prediction disagrees with its assigned match-state. Edges are sorted by the magnitude of the disagreement, but any edge with a confidence of absolutely-sure is ignored. This will present edges labeled as guessing for the user to re-review. At this point the dataset is in a legal state, where the name labels correspond to PCCs. The final step is to execute normal graph review in order to find any merge cases and explicitly label hard negative edges. In the case that a pairwise classifier does not exist, then one can be trained using the temporary edges defined by the maximum spanning trees of the PCCs. It is sometimes desirable new PCCs to keep the old name labels from the original database (\eg{} sometimes ecologists encoded information in these names). This is a simple matter when the original database contained no mistakes, but when the original database contains errors care must be taken. We address this problem by seeking to minimize the number of annotations that have their name label changed from the original dataset. This can be computing by finding a maximum linear sum assignment using the Munkres algorithm implemented in SciPy~\cite{eric_jones_scipy_2001}. We create a matrix where each row represents a group of annotations in the same PCC and each column represents an original name. If there are more PCCs than original names, then the columns are padded with extra values. The matrix is first initialized to be negative infinity representing impossible assignments. Then for each column representing a padded name, we set we its value to $1$ indicating that each new name could be assigned to a padded name for some small profit. Finally, we encode both the profit of assigning a new name with an original name and the extra one ensures that these original names are always preferred over padded names. Let $f_{rc}$ be the number of annotations in row $r$ with an original name of $c$, and set matrix value % $(r, c)$ to $f_{rc} + 1$ if $f_{rc} > 0$. The maximum linear sum assignment of this matrix results in the optimal consistent assignment of PCCs to original name labels. \end{comment} \begin{comment} \chapter{Sight-resight analysis with incomparability}\label{app:markrecapincomp} Given a completed decision graph, each PCC corresponds to an individual, and each pair of PCCs is either known to be different or known to be incomparable. If all PCCs are known to be different, then sight-resight analysis is simple. Each PCC can be grouped into a set of encounters. The chronologically first encounter is a sighting, and the subsequent encounters are re-sightings. However, if it cannot be determined that some pairs of PCCs are different, we must use only the first or second of these PCCs in our analysis, and the other must be discarded. Finding the largest set of PCCs that can be used in sight-resight statistics, we must find the largest set of PCCs that are all comparable to each other. This problem can be solved in the following steps: \begin{enumln} \item Find all pairs of PCCs that are incomparable. \item Consider the meta-graph where each of these incomparable PCCs is a node and there is an edge between each pair. \item Find the largest independent set in this meta-graph (note the is NP-hard). \item Remove all nodes from the decision graph corresponding to the PCCs in the meta-graph that were not in the independent set. \end{enumln} Now, in the original graph there is no PCC that is incomparable with any other, otherwise there would have been two nodes in the independent set that had an edge between them, which is a contradiction. Even though finding the largest independent set would use the most data, any independent set will do. Thus, sight-resight statistics can now be performed on this graph. \end{comment} %\chapter{Properties of the graph algorithm}\label{app:graphprop} %\begin{enumln} %\item Finding the largest set of comparable PCCs is a generalization to the independent set problem. %Using the meta-graph where PCCs are nodes, and an edge is drawn between any PCCs known to have no comparable % pairs of annotations, any independent set of PCCs will all be comparable to one another. %The largest independent set will be the largest set of comparable PCCs. %\item The graph identification procedure is agnostic to the underlying computer vision algorithms. % Any ranking and verification algorithm can be substituted in. %\item The graph identification procedure can be used without computer vision algorithms to facilitate a more % efficient brute-force search, which can be useful to label small datasets. %This is done by using all edges as candidate edges, and each edge is given a priority of $0$. %The refresh criterion is not used, and the algorithm stops when the queue is empty. %The positive and negative redundancy criteria will remove edges from the queue, so less than $|V|^2$ manual % reviews will have to be done. %\item When $k=2$ and edges are only added to the graph exactly to the specifications of the graph algorithm, it % is impossible for a PCC to contain more than one negative edge at any time. %\item When $k=1$, reviewing edges in order of positive match probability minimizes the expected number of total % reviews. %\item At the end of the graph algorithm every PCC will be $k$-positive-redundant. %\item In recovery mode, consider that the hypothesis algorithm has generated a set of edges. %The user has reviewed some of these edges and agreed with each hypothesis so far. %In this case the remaining edges are necessarily a minimum cut. %This is because the edges reviewed so far have had their label changed, meaning that they no longer connect % terminal nodes. %In this case, if hypothesis generation is recomputed, then the new hypothesis will be the same as the set of % remaining edges if the weights in the graph are unique. %If the weights in the graph are not unique, then the implementation of min-cut can be chosen to enforce that the % new set is the same as the remaining set. %\item A review can have one of the following mutually exclusive effects on the PCCs of the graph: % (1) merge exactly two PCCs into one PCC. % (2) split a single PCC into exactly two PCCs. % (3) do nothing to the PCCs, but potentially influence redundancy. %\end{enumln} %\end{appendices}