\begin{comment} ./texfix.py --fpaths chapter3-matching.tex --outline --asmarkdown --numlines=999 -w ./texfix.py --fpaths chapter3-matching.tex --outline --asmarkdown --numlines=999 -w ./texfix.py --fpaths chapter3-matching.tex --reformat # http://jaxedit.com/mark/ fixtex --fpaths chapter3-matching.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_chapter3-matching.md \end{comment} \chapter{IDENTIFICATION USING A RANKING ALGORITHM}\label{chap:ranking} This chapter addresses the problem of computer-assisted animal identification, where an algorithm suggests likely possibilities, but a human reviewer always makes the final decision. Given, a single annotation depicting an unknown animal and a database of previously identified annotations, the task is to determine a ranking of database individuals most likely to match the unknown animal. A human reviewer manually determines which --- if any --- of the top ranked results are correct. An \glossterm{annotation} is a rectangular region of interest around a specific animal within an image. Each annotation given a \name{} label denoting its individual identity. A \glossterm{\name{}} refers to a group of annotations known to be the same individual. The identification process assigns a \name{} label to an unknown annotation either as (1) the \name{} label of a matched database annotation or (2) a new \name{} label if no matches are found. The ranking algorithm is based on the feature correspondences between a query annotation and a set of database annotations. In each annotation a set of patch-based features is detected at keypoint locations. Then the visual appearance of each patch is described using SIFT~\cite{lowe_distinctive_2004}. A nearest neighbor algorithm establishes a set of feature correspondences between a query and the annotations in the database. A scoring mechanism based on local \naive{} Bayes nearest neighbor (LNBNN)~\cite{mccann_local_2012} produces a score for each feature correspondence. These scores are then aggregated into a single score for each \name{} in the database, producing a ranked list of \names{}. If the top ranked \name{} has a ``high'' score, then it is likely to be the same individual depicted in the query annotation. If the top ranked \name{} has a ``low'' score, then it is likely that the query individual is not depicted in any database annotation. An example ranked list returned by the algorithm is illustrated in~\cref{fig:rankedmatches}. For each result in the ranked list we decide if it is a correct or incorrect match. In the baseline algorithm this identification decision is left to a user. However, in the next chapter we will consider using an algorithm to automatically verify results returned in this ranked list. The outline of this chapter is as follows: \Cref{sec:annotrepr} discusses the initial processing of an annotation which involves image normalization, feature extraction, and feature weighting. \Cref{sec:baselineranking,sec:sver} describe the baseline ranking and scoring algorithm. The first of these sections focuses on establishing feature correspondences, and the second focuses on filtering the correspondences. \Cref{sec:exempselect} describes the process for selecting \exemplars{}. \Cref{sec:rankexpt} provides an experimental evaluation of the ranking algorithm. \Cref{sec:staticsum} summarizes this chapter. \rankedmatches{} \section{ANNOTATION REPRESENTATION}\label{sec:annotrepr} For each annotation in the database we (1) normalize the image geometry and intensity, (2) compute features, (3) and weight the features. % Chip Extract Image normalization rotates, crops, and resizes an annotation from its image. This helps to remove background clutter and roughly align the annotations in pose and scale. The extracted and normalized region is referred to as a \glossterm{chip}. % Feat Detect Then, a set of features --- a keypoint and descriptor pair --- is computed. Keypoints are detected at multiple locations and scales within the chip, and a texture-based descriptor vector is extracted at each keypoint. % Featweight Finally, each feature is assigned a probabilistic weight using a foregroundness classifier. This helps remove the influence of background features. \subsection{Chip extraction} % Bounding box + orientation Each annotation has a bounding box and an orientation specified in a previous detection step. For zebras and giraffes, the orientation of the chip is chosen such that the top of the bounding box is roughly parallel to the back of the animal. A chip is extracted by jointly rotating, scaling, and cropping an annotation's parent image using Lanczos resampling~\cite{lanczos_applied_1988}. The scaling resizes the image such that the cropped chip has approximately $450^2$ pixels and it maintains the aspect ratio of the bounding box. If specified in the pipeline configuration, adaptive histogram equalization~\cite{pizer_adaptive_1987} is applied to the chip, however this is not used in the experimental evaluation presented later in this chapter. \subsection{Keypoint detection and description} Keypoints are detected within each annotation's chip using a modified implementation of the Hessian detector described in~\cite{perdoch_efficient_2009} and reviewed in~\cref{sec:featuredetect}. This produces a set of elliptical features localized in space, scale, shape, and orientation. Each keypoint is described using the SIFT~\cite{lowe_distinctive_2004} descriptor that was reviewed in~\cref{sec:featuredescribe}. The resulting unordered set of keypoint-descriptor pairs are an annotation's features. Further details about the keypoint structure are given in~\cref{sec:kpstructure}. We choose a baseline feature detection algorithm that produces affine invariant keypoints with the gravity vector. Affine invariant (\ie{} shape adapted) keypoints detect elliptical patches instead of circular ones. We choose affine invariant keypoints because the animals we identify will be seen from many viewpoints. Because all chips have been rotated into an upright position, we assign all keypoints a constant orientation --- this is the gravity vector assumption~\cite{perdoch_efficient_2009}. However, these baseline settings may not be appropriate for all species. It is important to select the appropriate level of invariance for each species we identify. % Shape In our experiments in~\cref{sub:exptinvar}, we will vary several parameters related to invariance in keypoint detection. To determine if affine invariance is appropriate for animal identification we experiment with both circular and elliptical keypoints. % Orientation We also experiment with different levels of orientation invariance. The gravity vector assumption holds in the case of rigid non-poseable objects (\eg{} buildings), if the image is upright. Clearly, for highly poseable animals, this assumption is more questionable. However, full rotation invariance (using dominant gradient orientations) has intuitive problems. Patterns (like ``V'' and ``$\Lambda$'') that might contribute distinguishing evidence that two annotations match, would always appear identical under full rotation invariance. Ideally orientation selection would be made based on the pose of the animal. We introduce a simple orientation heuristic to help match keypoints from the same animal in the presence of small pose variations. Instead of extracting a single keypoint in the direction of gravity or multiple keypoints in the directions of the dominant gradient orientation we extract 3 descriptors at every keypoint: one in the direction of gravity, and the other two offset at $\pm15\degrees$ from the gravity direction. This provides a middle ground between rotation invariance and strict use of the gravity vector. Using this heuristic, it will be more likely to extract similar descriptors from two annotations of the same animal seen from slightly different poses. \subsection{Feature weighting} In animal identification, there will often be many annotations containing the same background. Photographers may take many photos in a single place and camera traps will contribute many images with the same background. Without accurate background masking, regions of an annotation from different images containing the same background may strongly match and outscore matches to correct individuals. An example illustrating two different individuals seen in front of the same distinctive background is shown in~\cref{fig:SceneryMatch}. To account for this, each feature is given a weight based on its probability of belonging to the foreground --- its ``foregroundness''. This weight is used indicate the importance of a feature in scoring and spatial verification. Foregroundness is derived from a species detection algorithm developed by Parham~\cite{parham_photographic_2015}. The input to the species detection algorithm is the annotation's chip, and the output is an intensity image. Each pixel in the intensity image represents the likelihood that it is part of a foreground object. A single feature's foregroundness weight is computed for each keypoint in an annotation as follows: The region around the keypoint in the intensity image is warped into a normalized reference frame. Each pixel in the normalized intensity patch is weighted using a Gaussian falloff based on the pixel's distance from the center of the patch. The sum of these weighted intensities is the feature's foregroundness weight. The steps of feature weight computation are illustrated in~\cref{fig:fgweight}. \SceneryMatch{} \fgweight{} \subsection{Keypoint structure overview}\label{sec:kpstructure} %Before we discuss the computation of the we review the structure of % a keypoint. The keypoint of a feature is represented as: $\kp\tighteq(\pt, \vmat, \ori)$, % The vector $\pt\tighteq\ptcolvec$ is the feature's $xy$-location. The scalar $\theta$ is the keypoint orientation. The lower triangular matrix $\vmat\tighteq\VMatII$ encodes the keypoint's shape and scale. This matrix skews and scales a keypoint's elliptical shape into a unit circle. A keypoint is circular when $a\tighteq{}d$ and $c\tighteq0$. %If $c\tighteq0$, there is no skew and if $a\tighteq{}d$ the keypoint % is circular. The keypoint scale is related to the determinant of this matrix and can be extracted as: % $\sigma = \frac{1}{\sqrt{\detfn{\vmat}}} = \frac{1}{\sqrt{ad}}$. All of this information can be encoded in a single affine matrix. \subsubsection{Encoding keypoint parameters in an affine matrix} It will be useful to construct two transformations that encode all keypoint information in a single matrix. The first, $\rvmat$, maps a keypoint in an annotation into a normalized reference frame --- the unit circle. The second transformation, $\inv{\rvmat}$ is the inverse, which warps the normalized reference frame back onto the keypoint. To construct $\rvmat$, the keypoint is centered at the origin $(0, 0)$ using translation matrix, $\mat{T}$. Then $\vmat$ is used to skew and scale the keypoint into a unit circle. Finally, the keypoint's orientation is normalized by rotating $-\theta$ radians using a rotation matrix $\mat{R}$. \begin{equation}\label{eqn:RVTConstruct} %\rvmat=\paren{\invrotMatIII{\paren{-\ori}} \VMatIII \transMATIII{-x}{-y}} \rvmat=\mat{R} \vmat \mat{T} = \rotBigMatIII{\paren{-\ori}} \VBigMatIII \transBigMatIII{-x}{-y} \end{equation} The construction of $\inv{\rvmat}$ is performed similarly. % see vtool.keypoint.get_invVR_mats_oris \begin{equation}\label{eqn:invTVRConstruct} \inv{\rvmat} = \inv{\mat{T}} \inv{\vmat} \inv{\mat{R}} = \transBigMatIII{x}{y} \BIGMAT{ \frac{1}{a} & 0 & 0\\ -\frac{c}{a d} & \frac{1}{d} & 0\\ 0 & 0 & 1 } \rotBigMatIII{\paren{\ori}} \end{equation} \subsubsection{Extracting keypoint parameters from an affine matrix} During the spatial verification step, described in~\cref{sec:sver}, keypoints are warped from one image into the space of another. It will be useful to extract the keypoint parameters from an arbitrary keypoint matrix. This will allow us to directly compare properties of corresponding features. Given an arbitrary affine matrix $\inv{\rvmat}$ representing keypoint $\kp$, we show how the individual parameters $(\pt, \scale, \ori)$ can be extracted. First consider the components of $\inv{\rvmat}$ by simplifying the right side of~\cref{eqn:invTVRConstruct}. \begin{equation}\label{eqn:ArbInvRVTMat} \inv{\rvmat} = \BIGMAT{ e & f & x\\ g & h & y\\ 0 & 0 & 1 } = \BIGMAT{ \frac{1}{a} \cos{(\theta )} & -\frac{1}{a} \sin{(\theta )} & x\\ \frac{1}{d} \sin{(\theta )} - \frac{c}{a d} \cos{(\theta )} & \frac{1}{d} \cos{(\theta )} + \frac{c}{a d} \sin{(\theta )} & y\\ 0 & 0 & 1 } \end{equation} %%--------- The position, scale, and orientation can be extracted from an arbitrary affine keypoint shape matrix $\invvrmat$ as follows: \begin{equation}\label{eqn:affinewarp} \begin{aligned} \pt &= \VEC{x\\y} \\ \scale &= \sqrt{\detfn{\invvrmat}}\\ \ori &= \modfn{\paren{-\atantwo{f, e}}}{\TAU} \end{aligned} \end{equation} %\newcommand{\annotscoreop}{\opname{annot\_score}} %\newcommand{\namescoreop}{\opname{name\_score}} \newcommand{\annotscoreop}{\opname{K_{\tt annot}}} \newcommand{\amechscoreop}{\opname{K_{\csum}}} \newcommand{\fmechscoreop}{\opname{K_{\nsum}}} \section{MATCHING AGAINST A DATABASE OF INDIVIDUAL ANIMALS}\label{sec:baselineranking} To identify a query annotation, it is matched against a database of known \names{}. \Aan{\name{}} is a set of annotations known to depict the same animal. The basic matching pipeline can be summarized in \three{} steps: (1) establish feature correspondences, % (2) score feature correspondences, and % (3) aggregate feature correspondence scores across the \names{}. Correspondences between a query annotation's features and \emph{all} database annotation features are established using an approximate nearest neighbor algorithm. This step also establishes a normalizing feature which is used to measure the distinctiveness of a query feature. Each feature correspondence is scored based on the feature weights established in the previous section and a measure of the distinctiveness of the query feature. The feature correspondence scores are then aggregated into a \glossterm{\namescore{}} for each \name{} in the database. The \namescores{} induce a ranking on \names{} in the database where database \names{} with higher ranks are more likely to be correct matches. \subsection{Establishing initial feature correspondence}\label{sub:featmatch} \subsubsection{Offline indexing} Before feature correspondences can be established, an offline algorithm indexes descriptors from all database annotations for fast approximate nearest neighbor search. All database descriptor vectors are stacked into a single array of vectors, % $\AnyDB$, % should this be removed? % and these descriptors are indexed by an inverted index. The inverted index maps each descriptor in the stacked array back to its original annotation and feature. This database array is indexed for nearest neighbor search using a forest of kd-trees~\cite{silpa_anan_optimised_2008} using the FLANN library~\cite{muja_fast_2009}, both of which were reviewed in~\cref{sec:ann}. This allows for the efficient implementation of a \codeobj{neighbor index function} % $\NN(\AnyDB, \desc, \K)$ % %$\NN(\desc, \K)$ % that returns the indices in $\AnyDB$ of the $\K$ approximate nearest neighbors of a query feature's descriptor $\desc$. %that returns the database indices of the $\K$ approximate % nearest neighbors of a query feature's descriptor %$\desc$. \subsubsection{Approximate nearest neighbor search} Matching begins by establishing multiple feature correspondences between each query feature and several visually similar database features. For each query descriptor vector $\desc_i \in \X$ the $\K + \Knorm$ approximate nearest neighbors are found using the \coderef{neighbor index function}. These neighbors sorted by ascending distance are: \begin{equation} \NN(\AnyDB, \desc_i, \K + \Knorm) \eqv \dotarrIII{j}{\K}{\K + \Knorm} %\NN(\desc_i, \K + \Knorm) \eqv \dotarrIII{j}{\K}{\K + \Knorm} \end{equation} The $\K$ nearest neighbors, $\dotsubarr{\desc}{{j_1}}{{j_\K}}$, are the initial feature correspondences to the $i\th{}$ query feature. The remaining $\Knorm$ neighbors,% $\dotsubarr{\desc}{{j_{\K + 1}}}{j_{\Knorm}}$, are candidate normalizers for use in LNBNN scoring. \subsubsection{Normalizer selection} A single descriptor $\descnorm_{i}$ is selected from the $\Knorm{}$ candidate normalizers and used in computing the LNBNN score for all (up to $\K$) of the $i\th{}$ query descriptor's correspondences in the database. The purpose of a normalizing descriptor is to estimate the local density of descriptor space, which can be interpreted as a measure of the query descriptor's distinctiveness \wrt{} the database. The normalizing descriptor is chosen as the most visually similar descriptor to the query that is not a correct match. In other words, the query descriptor's normalizer should be from an individual different from the query. The intuition is there will not be any features in the database that are close to distinctive features in the query except for the features that belong to the correct match. The selection process described in the original formulation of LNBNN is to simply choose the % $\K + 1\th{}$ nearest neighbor, which amounts to setting $\Knorm=1$. The authors of LNBNN find that there is no benefit to using a higher value of $\Knorm$~\cite{mccann_local_2012}. However, this does not account for the case when some or all the $\K + 1\th{}$ nearest neighbor belongs to the same class as one of the nearest $\K$ neighbors. Therefore, we employ a slightly different selection process. To motivate our selection process, consider the case when there are more than $\K$ images of the same individual from the same viewpoint in the database and a distinctive feature from a new annotation of that individual is being scored. In this case $\K$ correspondences will be correctly established a distinctive the query feature and $\K$ database features. However, if the normalizer is chosen as the $\K + 1$ neighbor, then these correspondences will be inappropriately downweighted. % probably need to reword. To gain an intuition of the LNBNN scoring mechanism, consider the example in~\cref{fig:knorm}. The figure shows the case where $\K=3$ and $\Knorm=1$. In this case there are two examples \Cref{sub:knormb,sub:knormc} of the query individual in the database. Even though there is an incorrect match, the LNBNN scores of the correct matches are an order of magnitude higher than the score for the incorrect match. Now, consider the case where the number of correct matches in the database is greater than $\K$ by setting $\K=1$. In this case the normalizing descriptor is the ``same'' feature as the query feature and the nearest match drops from $0.066$ to $0.007$. \knorm{} To avoid this case, a normalizing feature is carefully chosen to reduce the possibility that it belongs to a potentially correct match. More formally, the normalizing descriptor is chosen to be the descriptor with the smallest distance to the query descriptor that is not from the same \name{} as any of the chosen correspondences. Let $\nid_j$ be the \name{} associated with the annotation containing descriptor $\desc_j$. Let % $\multiset{N}_i \eqv \curly{\nid_j \where j \in \NN(\AnyDB, \desc_i, \K)}$ be the set of \names{} matched by the $i\th{}$ query feature. The descriptor that normalizes all matches of query descriptor $\desc_i$ is: \begin{equation} \descnorm_{i} \eqv \argmin{\desc_j \in \dotsubarr{\desc}{{j_{\K + 1}}}{j_{\Knorm}}} \elltwosqrd{\desc_j - \desc_i} \where \nid_j \notin \multiset{N}_i \end{equation} \subsection{Feature correspondence scoring} Each feature correspondence is given a score representing how likely it is to be a correct match. While the L2-distance between query and database descriptors is useful for ranking feature correspondences based on visual similarity, the distinctiveness of the match is more useful for ranking the query annotation's similarity to a database annotation~\cite{lowe_distinctive_2004, arandjelovic_dislocation_2014, mccann_local_2012}. However, highly distinctive matches from other objects --- like background matches --- do not provide relevant information about a query annotation's identity and should not contribute to the final score. Therefore, each feature correspondence is scored using a mechanism that combines both distinctiveness and likelihood that the object belongs to the foreground. For each feature correspondence $m = (i, j)$ with query descriptor $\desc_i$ and matching database descriptor $\desc_j$, several scores are computed which are then combined into a single feature correspondence score $s_{i,j}$. \subsubsection{LNBNN score}\label{sec:lnbnnscore} The LNBNN scoring mechanism measures the difference in similarity between the query descriptor and (1) its match and (2) its normalizer $\descnorm_{i}$. This serves as an estimate of local feature density and measures the distinctiveness of the feature correspondence. A match is distinctive when the query-to-match distance is much smaller than the query-to-normalizer distance, \ie{} the local density of descriptor space around the query is sparse. The LNBNN score of a feature match is computed as follows: \begin{equation}\label{eqn:lnbnn} \fs_{\LNBNN} \eqv \frac{\elltwo{\desc_i - \descnorm_{i}} - \elltwo{\desc_i - \desc_j}}{Z} \end{equation} All descriptors used in this calculation are L2-normalized to unit length --- \ie{} to sit on the surface of a unit hypersphere. The $Z$ term normalizes the score to ensure that it is in the range $\rangeinin{0, 1}$. If descriptor vectors have only non-negative components (as in the case of SIFT~\cite{lowe_distinctive_2004}), then the maximum distance between any two L2-normalized descriptors is $Z\tighteq\sqrt{2}$. If descriptors vectors have negative components (like those that might extracted from a deep convolutional neural network~\cite{zagoruyko_learning_2015}), then the maximum distance between them is $Z\tighteq2$. \subsubsection{Foregroundness score} To reduce the influence of background matches, each feature correspondence is assigned a score based on the foregroundness of both the query and database features. The geometric mean of the foregroundness of query feature, $w_i$, and database feature, $w_j$, drives the score to $0$ if either is certain to be background. % show python -m ibeis --db testdb3 --query 325 -y \begin{equation} \fs_{{\tt fg}} \eqv \sqrt{w_i w_j} \end{equation} \subsubsection{Final feature correspondence score} The final score of the correspondence $(i, j)$ captures both the distinctiveness of the match and the likelihood that the match is part of the foreground. \begin{equation}\label{eqn:featscore} \fs_{i,j} \eqv \fs_{{\tt fg}} \fs_{\LNBNN} \end{equation} \subsection{Feature score aggregation}\label{subsec:namescore} So far, each feature in a query annotation has been matched to several features in the database and a score has been assigned to each of these correspondences based on its distinctiveness and foregroundness. The next step in the identification process is to aggregate the scores from these patch-based correspondences into a single \glossterm{\namescore} for each \name{} in the database. Note that this \name-based definition of scoring is a key difference between animal identification and image retrieval, where a score is assigned to each image in the database. In animal identification the analogous concept is an \glossterm{\annotscore} --- a score assigned to each annotation in the database~\cite{philbin_object_2007}. This distinction between a score from a query annotation to a database annotation is important because the goal of the application is to classify a new query annotation as either a known \name{} or as a new \name{}, not to determine which annotations are most similar. This subsection presents two mechanisms to compute \namescores{}. The first mechanism is \csumprefix{} and computes a \namescore{} in two steps. This mechanism aggregates feature correspondence scores into an \annotscore{} for each annotation in the database. Then the \annotscores{} are aggregated into a score for each \name{} in the database. The second mechanism is \nsumprefix{}. This mechanism aggregates feature correspondence scores matching multiple database annotations directly into a \namescore{}. These mechanisms are respectively similar to the image-to-image distance and the image-to-class distance described in~\cite{boiman_defense_2008}. \subsubsection{The set of all feature correspondences} All scoring mechanisms presented in this subsection are based on aggregating scores from features correspondences. The set of all feature correspondences for a query annotation $\X$ is expressed as: \begin{equation} \Matches \eqv \{(i, j) \where \desc_i \in \X \AND j \in \NN(\AnyDB, \desc_i, \K)\} \end{equation} % SAME AS IMAGE-TO-IMAGE \subsubsection{Annotation scoring} An \annotscore{} is a measure of similarity between two annotations. An \annotscore{} between a query annotation and a database annotation is defined as the sum of the feature correspondence scores matching to the features from that database annotation. Let $\daid_j$ be the database annotation containing feature $j$. Let % $\Matches_{\daid} \eqv \curly{(i, j) \in \Matches \where \daid_j = \daid}$ % denote all the correspondences to a particular database annotation. The \annotscore{} between the query annotation $\qaid$ and database annotation $\daid$ is: \begin{equation} \annotscoreop(\qaid, \daid) \eqv \sum_{(i, j) \in \Matches_{\daid}} \fs_{i, j} \end{equation} % SAME AS IMAGE-TO-CLASS \subsubsection{Name scoring (1) --- \csumprefix{}} % The \cscoring{} mechanism aggregates \annotscores{} into \namescores{} by simply taking the maximum \annotscore{} over all annotations belonging to a \name{}. In our experiments we refer to this version of \namescoring{} as \csum{}. Let $\nid$ be the set of database annotations with the same \name{}. The \cscore{} between a query annotation and a database \name{} is: %the maximum over all annotations scores in that \name{}: \begin{equation} \amechscoreop(\qaid, \nid) \eqv \max_{\daid \in \nid} \paren{ \annotscoreop\paren{\qaid, \daid} } \end{equation} \subsubsection{Name scoring (2) --- \nsumprefix{}} % The \cscoring{} mechanism accounts for the fact that animals will be seen multiple times, but it does not take advantage of complementary information available when \aan{\name{}} has multiple annotations. The following aggregation mechanism combines scores on a feature level to correct for this. It allows each query feature at a specific location to vote for a given \name{} at most once. Thus, when a query feature (or multiple query features at the same location) correspond(s) to database features from multiple views of the same animal, only the best correspondence for that feature will contribute to the score. In our experiments we refer to this version of \namescoring{} as \nsum{}. The first step of computing \aan{\namescore{}} for a specific \name{} is grouping the feature correspondences. Two feature correspondences are in the same group if the query features have the same location and the database features belong to the same \name{}. The next step is to choose the highest scoring correspondence within each group. The sum of the chosen scores is the score for \aan{\name{}}. This procedure is illustrated in~\cref{fig:namematch}. \newcommand{\MatchesGroup}{\Matches^{G}} Formally, consider two feature correspondences $\mI\tighteq(\iI, \jI)$ and $\mII\tighteq(\iII, \jII)$. Let $\pt_{\iI}$ and $\pt_{\iII}$ be the $xy$-location of the query feature in each correspondence. Let $\nid_{\jI}$ and $\nid_{\jII}$ be the \name{} of the database annotations containing the matched features. The group that contains feature correspondence $\mI$ is defined as: \begin{equation} \MatchesGroup_{\mI} \eqv \curly{\mII \in \Matches \where \paren{ \paren{\pt_{\iI} \eq \pt_{\iII}} \AND \paren{\nid_{\jI} \eq \nid_{\jII}} } } \end{equation} The correspondence with the highest score in each connected component is flagged as chosen. Ties are broken arbitrarily. \begin{equation} \ischosen(\mI) \eqv \bincase{ \paren{ \fs_{\mI} > \fs_{\mII} \quad \forall \mII \in \MatchesGroup_{\mI} } \OR \card{\MatchesGroup_{\mI}} \eq 1 } \end{equation} Let $\Matches_{\nid} \eqv \{(i, j) \in \Matches \where \nid_j = \nid\}$ denote all the correspondences to a particular \name{}. The \nscore{} of \aan{\name{}} is: \begin{equation} \fmechscoreop(\qaid, \nid) \eqv \sum_{m \in \Matches_{\nid}} \ischosen(m) \; \fs_m \end{equation} \namematch{} \FloatBarrier{} \section{SPATIAL VERIFICATION}\label{sec:sver} The basic ranking algorithm treats each annotation as an orderless set of feature descriptors (with a small exception in name scoring, which has used a small amount of spatial information). This means that many of the initial feature correspondences will be spatially inconsistent. Spatial verification removes these spatially inconsistent feature correspondences. Determining which features are inconsistent is done by first estimating an affine transform between the two annotations. Then a projective transform is estimated using the inliers to the affine transform. Finally, any correspondences that do not agree with the projective transform transformation are removed~\cite{fischler_random_1981, philbin_object_2007}. This process is illustrated in \cref{fig:sver}. We have reviewed related work in spatial verification in~\cref{subsec:sverreview}. %In our problem, the animals are seen in a wide variety of poses, % and projective transforms may not always be sufficient to capture % all correctly corresponding features. %Yet, without strong spatial constraints on matching, many % background features will be spatially verified. %For now, we proceed with standard techniques for spatial % verification and evaluate if more sophisticated methods are needed. \sver{} \subsection{Shortlist selection} Standard methods for spatial verification are defined on the feature correspondences between two annotations (images). Normally, a shortlist of the top ranked annotations is passed onto spatial verification. However, in our application we rank \names{}, which may have multiple annotations. In our baseline approach we simply apply spatial verification to the top $N_{\tt{nameSL}}=40$ \names{} and the top $N_{\tt{annotSL}}=3$ annotations within those \names{}. \subsection{Affine hypothesis estimation} Here, we will compute an affine transformation that will remove a majority of the spatially inconsistent feature correspondences. Instead of using random sampling of the feature correspondences as in the original RANSAC algorithm~\cite{hartley_multiple_2003}, we estimate affine hypotheses using a deterministic method similar to~\cite{philbin_object_2007, chum_homography_2012}. Given a set of matching features between annotations $\annotI$ and $\annotII$, the shape, scale, orientation, and position of each pair of matching keypoints are used to estimate a hypothesis affine transformation. Each hypothesis transformation warps keypoints from annotation $\annotI$ into the space of $\annotII$. Inliers are estimated by using the error in position, scale, and orientation between each warped keypoint and its correspondence. Each inlier is weighted by its feature correspondence score. The final affine transform is chosen to be the one that produces the maximum weight set of inliers. % vmat = V here % V maps from ellipse to u-circle \newcommand{\AffMat}{\mat{A}} \newcommand{\HypothSet}{\set{A}} \newcommand{\AffMatij}{\mat{A}_{i, j}} \newcommand{\HypothAffMat}{\hat{\mat{A}}} \subsubsection{Enumeration of affine hypotheses} %The deterministic set of hypothesis transformations mapping from % query annotation $\annotI$ to database annotation $\annotII$ is % computed for each feature correspondence in a match from to an % annotation. Let $\Matches_{\annotII}$ be the set of all correspondences between features from query annotation $\annotI$ to database annotation $\annotII$. For each feature correspondence $(i, j) \in \Matches_{\annotII}$, we construct a hypothesis transformation, $\AffMatij$ using the matrices $\rvmat_{i}$ and $\inv{\rvmat_{j}}$, which where defined in~\cref{eqn:RVTConstruct,eqn:invTVRConstruct}. The first transformation $\rvmat_{i}$, warps points from $\annotI$-space into a normalized reference frame. Then the second transformation, $\inv{\rvmat_{j}}$, warps points in the normalized reference frame into $\annotII$-space. Formally, the hypothesis transformation is defined as $\AffMatij \eqv \inv{\rvmat_{j}}\rvmat_{i}$, and the set of hypothesis transformations is: \begin{equation} \HypothSet \eqv \curly{ \AffMatij \where (i, j) \in \Matches_{\annotII} } \end{equation} \subsubsection{Measuring the affine transformation error} The transformation $\AffMatij$ perfectly aligns the corresponding $i\th{}$ query feature with the $j\th{}$ database feature in the space of the database annotation ($\annotII$). If the correspondence is indeed correct, then we can expect that other corresponding features will be well aligned by the transformation. The next step is to determine how close the other transformed features from the query annotation ($\annotI$) are to their corresponding features in database annotation ($\annotII$). This can be measured using the error in distance, scale, and orientation for every correspondence. The following procedure is repeated for each hypothesis transform % $\AffMatij \in \HypothSet$. Note that the following description is in the context of the correspondence between the $i\th{}$ query feature and the $j\th{}$ database feature, and the $i,j$ suffix is omitted for clarity. In this context, the suffixes $\idxI$ and $\idxII$ will be used to index into features correspondences. Let $\set{B}_{\idxI} = \curly{\invvrmatI \where (\idxI, \idxII) \in \Matches_{\annotII}}$ be the set of keypoint matrices in the query annotation with correspondences to database annotation $\annotII$. Given a hypothesis transform $\AffMat$, each query keypoint in the set of matches %(mapping from the normalized reference frame to feature space) $\invvrmatI \in \set{B}_{\idxI}$, is warped into $\annotII$-space: \begin{equation} \warp{\invvrmatI} = \AffMat \invvrmatI \end{equation} %--- The warped position $\warp{\ptI}$, scale $\warp{\scaleI}$, and orientation $\warp{\oriI}$, can be extracted from $\warp{\invvrmatI}$ using~\cref{eqn:affinewarp}. The warped query keypoint properties in $\annotII$-space and can now be directly compared to the keypoint properties of their database correspondences. %Each warped point is checked for consistency with its % correspondence's $\ptII$, scale $\scaleII$, and orientation % $\oriII$, in $\annotII$. The absolute distance in position, scale, and orientation between the $\idxI\th{}$ query feature and the $\idxII\th{}$ database feature with respect to hypothesis transformation $\AffMat$ is measured as follows: \begin{equation}\label{eqn:inlierdelta} \begin{aligned} \Delta \pt_{\idxI, \idxII} & \eqv \elltwo{\warp{\ptI} - \ptII}\\ \Delta \scale_{\idxI, \idxII} & \eqv \max( \frac{\warp{\scaleI}}{\scaleII}, \frac{\scaleII}{\warp{\scaleI}}) \\ \Delta \ori_{\idxI, \idxII} & \eqv \min( \modfn{\abs{\warp{\oriI} - \oriII}}{\TAU},\quad \TAU - \modfn{\abs{\warp{\oriI} - \oriII}}{\TAU}) \end{aligned} \end{equation} \subsubsection{Selecting affine inliers} Any keypoint correspondence $(\idxI, \idxII) \in \Matches_{\annotII}$ is considered an inlier \wrt{} $\AffMat$ if its absolute differences in position, scale, and orientation are all within a spatial distance threshold $\xythresh$, scale threshold $\scalethresh$, and orientation threshold $\orithresh$. This is expressed using the function $\isinlierop$, which determines if match is an inlier: %\begin{equation} %\label{eqn:inlierchecks} % \begin{gathered} % %\begin{aligned} % \txt{isinlier}(\kp_1, \kp_2) \rightarrow \elltwo{\pt_1' - \pt_2} < \xythresh \AND \\ % %----- % {\frac{\scale_1'}{\scale_2} < \scalethresh \txt{ if } % \paren{\scale_1' > \scale_2} \txt{ else } % \frac{\scale_2}{\scale_1'} < \scalethresh} \AND\\ % %----- % \txt{minimum}( % \modfn{\abs{\ori_1' - \ori_2}}{\TAU}, % \TAU - \modfn{\abs{\ori_1' - \ori_2}}{\TAU}) < \orithresh % %\end{aligned} % \end{gathered} %\end{equation} \begin{equation}\label{eqn:inlierchecks} \begin{gathered} \isinlierop((\idxI, \idxII), \AffMat) \eqv \Delta \pt_{\idxI, \idxII} < \xythresh \AND \Delta \scale_{\idxI, \idxII} < \scalethresh \AND \Delta \ori_{\idxI, \idxII} < \orithresh \end{gathered} \end{equation} The set of inlier matches for a hypothesis transformation $\AffMat$ can then be written as: \begin{equation}\label{eqn:affinliers} \Matches_{\AffMat} \eqv \curly{m \in \Matches_{\annotII} \where \isinlierop(m, \AffMat)} \end{equation} The best affine hypothesis transformation, $\HypothAffMat$, maximizes the weighted sum of inlier scores. \begin{equation} \HypothAffMat \eqv \argmax{\AffMat \in \HypothSet} \sum_{(\idxI, \idxII) \in \Matches_{\AffMat}} \fs_{\idxI, \idxII} \end{equation} \subsection{Homography refinement} Feature correspondences that are inliers to the best hypothesis affine transformation, $\HypothAffMat$, are used in the least squares refinement step. This step is only executed if there are at least $4$ inliers to $\HypothAffMat$, otherwise all correspondences between features in query annotation $\annotI$ to features in database annotation $\annotII$ are removed. The refinement step estimates a projective transform from $\annotI$ to $\annotII$. To avoid numerical errors the $xy$-locations of the correspondence are normalized to have a mean of $0$ and a standard deviation of $1$ prior to estimation. A more comprehensive explanation of estimating projective transformations using point correspondences can be found in~\cite[311--320]{szeliski_computer_2010}. Unlike in the affine hypothesis estimation where many transformations are generated, only one homography transformation is computed. Given a set of inliers to the affine hypothesis transform $\Matches_{\HypothAffMat}$, the least square estimation of a projective homography transform is: \begin{equation} \HmgMatBest \eqv \argmin{\HmgMat} \sum_{(i, j) \in \Matches_{\HypothAffMat}} \elltwosqrd{\HmgMat \pt_{i} - \pt_{j}} \end{equation} Similar to affine error estimation, we will identify the subset of inlier features correspondences $\Matches_{\HmgMatBest} \subseteq \Matches_{\annotII}$. A correspondence is an inlier if the query feature is transformed to within a certain spatial distance threshold $\xythresh$, orientation threshold $\orithresh$, and scale threshold $\scalethresh$ of its corresponding database feature. For convenience, let $\tohmg{\cdot}$ and $\unhmg{\cdot}$ transform points into and out of homogeneous form (recall homogeneous form augments an $xy$-point with a $z$ coordinate). For each feature correspondence $(\idxI, \idxII) \in \Matches_{\annotII}$, the query feature position, $\ptI$, is warped from $\annotI$-space into $\annotII$-space. \begin{equation} \warp{\ptI} = \unhmg{\HmgMatBest \tohmg{\ptI}} \end{equation} However, because projective transformations are not guaranteed to preserve the structure of the affine keypoints, warped scales and orientations cannot be estimated using~\cref{eqn:affinewarp}. Therefore, we estimate values that will serve a similar purpose relative to a reference point. \subsubsection{Estimation of warped shape parameters} Because we cannot warp the shape of an affine keypoint using a projective transformation, we instead estimate the warped scale and orientation for the $\idxI\th{}$ query feature using a reference point. Given a single feature correspondence $(\idxI, \idxII) \in \Matches_{\annotII}$, we associate a reference point $\refptI$ with the query location $\ptI$, scale $\scaleI$ and orientation $\oriI$. The reference point is defined to be $\scaleI$ distance away from the feature center at an angle of $\oriI$ radians in $\annotI$-space. \begin{equation} \refptI = \ptI + \scale_1 \BVEC{\sin{\oriI} \\ -\cos{\oriII}} \end{equation} To estimate the warped scale and orientation, first the reference point is warped from $\annotI$-space into $\annotII$-space. \begin{equation} \warp{\refptI} = \unhmg{\HmgMatBest \tohmg{\refptI}} \end{equation} %----------- Then we compute the residual vector $\ptres$ between the warped point and the warped reference point: \begin{equation} %\Delta \warp{\refptI} = \BVEC{\Delta \warp{\inI{x}} \\ \Delta \warp{\inI{y}}} = \warp{\ptI}- \warp{\refptI}. \ptres = \BVEC{\xres \\ \yres} = \warp{\ptI}- \warp{\refptI}. \end{equation} The warped scale and orientation are estimated using the length and angle of the residual vector. Recall, the warped location can be computed exactly. In summary, the warped location, scale, and orientation of the $\idxI\th{}$ query feature is: \begin{equation}\label{eqn:homogwarp} \begin{aligned} \warp{\ptI} &\eqv \unhmg{\HmgMatBest \tohmg{\refptI}} \\ \warp{\scaleI} &\eqv \elltwo{\ptres}\\ %\warp{\oriI} &= \atantwo{\yres, \xres} - \frac{\TAU}{4}. %\warp{\oriI} &= \atantwo{\yres, \xres} - \frac{\pi}{2}. % is this adjustment right? \warp{\oriI} &\eqv \atantwo{\yres, \xres} \end{aligned} \end{equation} \subsubsection{Homography inliers} The rest of homography inlier estimation is no different from affine inlier estimation. \Cref{eqn:inlierdelta} is used to compute the errors $( % \Delta \pt_{\idxI, \idxII}, % \Delta \scale_{\idxI, \idxII}, % \Delta \ori_{\idxI, \idxII})$ % between the warped query location, scale, and orientation, $(\warp{\ptI}, \warp{\scaleI}, \warp{\oriI})$, % and the corresponding database location, scale, and orientation, % $({\ptII}, {\scaleII}, {\oriII})$. The final set of homograph inliers is given as: \begin{equation}\label{eqn:homoginliers} \Matches_{\HmgMatBest} \eqv \curly{m \in \Matches_{\annotII} \where \isinlierop(m, \HmgMatBest)} \end{equation} Spatial verification results in a reduced set of inlier feature correspondences from the query annotation to the database annotations. The \namescoring{} mechanism from~\cref{subsec:namescore} is then applied to these inlier feature correspondences. This final per-name score is the output of the identification algorithm and used to form a ranked list that is presented to a user for review. \FloatBarrier{} \section{EXEMPLAR SELECTION}\label{sec:exempselect} To scale one-vs-many matching to larger databases and to allow the LNBNN mechanism to find appropriate normalizers we restrict the number of examples of each individual in the database to a set of exemplars. Exemplars that represent a wide range of viewpoints and poses are automatically chosen by using a modified version of the technique presented in~\cite{oddone_mobile_2016}. The idea is to treat exemplar selection as a maximum weight set cover problem. For each individual, the input is a set of annotations. A similarity score is computed between pairs of annotations. To compute covering sets we first choose a threshold, each annotation is assigned a covering set as itself and the other annotations it matches with a similarity score above that threshold. The maximum number of exemplars is restricted by setting a maximum weight. Searching for the optimal set cover is NP-hard, therefore we use the greedy % $(1 - \frac{1}{e})$-approximation algorithm~\cite{michael_guide_1979}. The algorithm is run for several iterations in order to find a good threshold that minimizes the difference between the weight of the set cover and the maximum weight limit. The similarity score between annotations can be computed using the LNBNN scores, but a better choice is the of the algorithm we will later describe in \Cref{chap:pairclf} to produce the probability that a pair of annotation correctly matches. \section{EXPERIMENTS}\label{sec:rankexpt} This section presents an experimental evaluation of the identification algorithm using annotated images of plains zebras, Grévy's zebras, Masai giraffes, and humpback whales. The input to each experiment is (1) a dataset, (2) a subset of query and database annotations from the database, (3) a pipeline configuration. The datasets are described in~\cref{sub:datasets}. The subsets of query and database annotations are carefully chosen to measure the accuracy of the algorithm under different conditions and to control for time, quality, and viewpoint. The pipeline configuration is a set of parameters --- \eg{} the level of feature invariance, the number of the nearest neighbors, and the \namescoring{} mechanism --- given to the identification algorithm. We will vary these pipeline parameters in order to measure their effect on the accuracy of the ranking algorithm. For each query annotation, the identification algorithm returns a ranked list of \names{} with a score for each name. The accuracy of identification is measured using the cumulative match characteristic~\cite{decann_relating_2013} which can be understood as the probability that a query correctly finds a match at a specified rank under the assumption that a correct match exists in the database. We are primarily concerned with only the first point in this curve --- the fraction of queries with a correct result at rank $1$ --- because often a user of the system will only review the first result of a query. The outline of this section is as follows. First, \cref{sub:datasets} introduces and describes each dataset. Our first experiment in \cref{sub:exptbase} establishes the accuracy of our ranking algorithm on several datasets using a default pipeline configuration. We then compare our approach to an alternative SMK approach in \cref{sub:exptsmk}. The next subsections perform in depth experiments on the parameter settings of our algorithm. \Cref{sub:exptfg} tests the effect of the foregroundness weight on identification accuracy. \Cref{sub:exptinvar} investigates the effect of the level of feature invariance and viewpoint. \Cref{sub:exptscoremech} compares the \csumprefix{} and the \nsumprefix{} \namescoring{} mechanism. \Cref{sub:exptk} varies the $\K$ parameter (the number of the nearest neighbors used in establishing feature correspondences) and investigates the relationship between $\K$ and database size in terms of both the number of annotations and the number of exemplars per name. \Cref{sub:exptfail} discusses the failure cases of our ranking algorithm. Finally,~\cref{sub:exptsum} summarizes this section. \subsection{Datasets}\label{sub:datasets} All the images in the datasets used in these experiments were taken by photographers in the field. Each dataset is labeled with \groundtruth{} in the form of annotations with name labels. Annotations (bounding boxes) have been drawn to localize animals within the image. A unique \name{} label has been assigned to all annotations with the same identity. Some of this \groundtruth{} labeling was generated independently. However, large portions of the datasets were labeled with assistance from the ranking algorithm. While this may introduce some bias in the results, there was no alternative because the amount of time needed to independently and completely label a large dataset is prohibitive and error prone. There are two important things to note before we describe each dataset. First, in order to control for challenging factors in the images such as quality and viewpoint some experiments sample subsets of the datasets we describe here. Second, labeling errors exist in some datasets. \DatabaseInfo{} \timedist{} The number of names, annotations, and their distribution within each database are summarized in the following tables. In these tables we distinguish between \glossterm{singleton} and \glossterm{resighted} names. Singleton names are individuals sighted only once, \ie{} contain only a single encounter. Resighted names contain more than one encounter. We make this distinction because resighted names have correct matches across a significant time delta. Note, that singleton names may still have more than one annotation, but those annotations are all from the same encounter. We have pre-filtered each database to remove annotations that are unidentified, are missing timestamps, or are labeled as ``junk'' quality. \Cref{tbl:DatabaseStatistics} summarizes the number of annotations and individuals in each database as well as the number of times (distinct encounters) each individual was sighted. \Cref{tbl:AnnotationsPerQuality} summarizes the quality labels of the annotations. \Cref{tbl:AnnotationsPerViewpoint} summarizes the viewpoint labels of the annotations. Distributions of when images in each dataset were taken are illustrated in \cref{fig:timedist}. The name and a short description of each dataset is given in the following list. \FloatBarrier{} \begin{itemln} \item \textbf{Plains zebras}. Our plains zebras dataset is an aggregate of several smaller datasets. There is variation in how the data was collected and preprocessed. Some images are cropped to the flank of the animal, while others are cropped to encompass the entire body. The constituent datasets were collected in Kenya at several locations including Nairobi National Park, Sweetwaters, and Ol Pejeta. More than $90\percent$ of the \groundtruth{} generated for this dataset was assisted using the matching algorithm. This dataset contains many imaging challenges including occlusion, viewpoint, pose, quality, and time variation. There are some annotations in this dataset without quality or viewpoint labelings and some images contain undetected animals. This data was collected between $2006$ and $2015$, but the majority of the data was collected in $2012$--$2015$. \item \textbf{Grévy's zebras}. This is another aggregated dataset. The original \groundtruth{} for this dataset was generated independently of the ranking algorithm, however the ranking algorithm revealed several \groundtruth{} errors that have since been corrected. The Grévy's dataset was collected in Mpala, Kenya. Most of the annotations in this database have been cropped to the animal's flank. This dataset contains a moderate degree of pose and viewpoint variation and occlusion. This data was collected between $2003$ and $2012$, but the majority was collected in $2011$ and $2012$. \item \textbf{Masai giraffes}. These images of Masai giraffes were all taken in Nairobi National Park during the \GZC{} between February $20$, $2015$ and March $2$, $2015$. All \groundtruth{} was established using the ranking algorithm followed by manual verification. This dataset contains a high degree of pose and viewpoint variation, and occlusion. Because of their long necks, it is difficult to ensure that only a single giraffe appears in each annotation. This results in many \glossterm{photobombs} --- pairs of annotations where a background animal in one annotation matches the foreground animal in the other --- when matching. \item \textbf{Humpbacks}. The humpback dataset was collected by FlukeBook~\cite{levenson_flukebook_2015} over nearly 15 years. Images were contributed by both marine and citizen scientists. The original \groundtruth{} was established using both manual and automated methods that are disjoint from these techniques considered here; however our software was used to correct mistakes. The annotations in this dataset have not been manually reviewed. Some are cropped to the fluke while others encompass the entire image. Quality and viewpoint labels do not exist for this dataset. \end{itemln} \FloatBarrier{} \subsection{Baseline experiment}\label{sub:exptbase} This first experiment determines the accuracy of the identification algorithm using the baseline pipeline configuration. The baseline pipeline configuration uses affine invariant features oriented using the gravity vector, $\K\tighteq4$ as the number of feature correspondences assigned to each query feature, and \nscoring{} (\nsum{}). In this test we control for several biases that may be introduced by carefully selecting a subset of our datasets. We only use annotations that (1) are known (\ie{} have been assigned a name), (2) are comparable to the species primary viewpoint (\eg{} left, front-left, and back-left for plains zebras), (3) have not been assigned a quality of ``junk''. Furthermore, to account for the fact that some \names{} contain more annotations than others, we constrain our data selection such that there is only one correct exemplar in the database for each query annotation. Of these annotations, we group them into encounters. For each encounter we sample one annotation with the highest quality. Names with only one encounter are added to the database as distractors. For the other names, we randomly sample two encounters --- regardless of quality --- one for the database and one to use as a query. This defines a set of query and database annotations that are separated by time, testing the ability of our system to match animals across gaps of time using only a single image per individual. The CMC curves for this baseline test are illustrated in~\cref{fig:BaselineExpt}. The results of this baseline experiment demonstrates that our algorithm is able to reliably find matching annotations in a database with many other images. The accuracy is over $60\percent$ for all species considered. Subsequent experiments will restrict our focus to Grévy's and plains zebras in order to investigate detailed parameter choices of the ranking algorithm and alternative ranking algorithms. \BaselineExpt{} \FloatBarrier{} \subsection{SMK as an alternative}\label{sub:exptsmk} Before we investigate the parameter choices of the LNBNN ranking algorithm, we briefly evaluate the performance of an alternative ranking algorithm, namely VLAD-flavored SMK. The SMK algorithm is a vocabulary-based algorithm that is representative of more traditional approaches to instance recognition problems. In contrast to the raw descriptors used in LNBNN, SMK assigns each descriptor to a visual word and builds a weighted histogram of visual words (accumulated residual vectors in the case of VLAD) to represent each annotation as a sparse fixed length vector. We have experimented with several configurations of VLAD and report our best results here. In our SMK implementation, we pre-trained an $8000$ word vocabulary using mini-batch k-means on the stacked descriptors from all database annotations. Note that typically the vocabulary is trained using a disjoint external dataset in order to prevent overfitting. However, we \naively{} train using the database annotations to be indexed, understanding that this will inflate the accuracy measurements. Each word in the vocabulary is weighted with its inverse document frequency. We use the vocabulary to compute an inverted index that maps each visual word to annotations containing that word in the database. Initial feature correspondences for a descriptor are computed using single assignment to a visual word and then creating a correspondence to every feature in that word's inverted index. We use spatial verification to filter spatially invalid correspondences, and re-score the remaining matches. The results of the SMK experiment are illustrated in \cref{fig:SMKExpt}. The query and database annotations are the same in each experiment. Despite the bias in the SMK vocabulary, our measurements show that LNBNN provides more accurate rankings. For plains zebra's there is a difference of $8\percent$ in the number of correct matches at rank $1$, and for Grévy's zebras the difference is $6\percent$. \SMKExpt{} \FloatBarrier{} \subsection{Foregroundness experiment}\label{sub:exptfg} In this experiment we test the effect of our foregroundness weights --- weighting the score of each features correspondence with a foregroundness weight --- on identification accuracy. When foregroundness is enabled (\pvar{fg=T}), each feature correspondence is weighted using a foregroundness measure learned using a deep convolutional neural network~\cite{parham_photographic_2015}. When disabled (\pvar{fg=F}), the weight of every correspondence effectively becomes $1$. Running this experiment with using the query / database sample as outlined in the baseline experiment does not result in a noticeable difference in scores because the purpose of the foregroundness measure is to down weight matches between scenery objects (\eg{} trees, grass, bushes) that appear in multiple annotations. The baseline database sample contains only a single image from each encounter and only two encounters per individual. This means that it will be unlikely for an annotation in the query set and another annotation in the database set to have a similar background. To more clearly illustrate the effect of the foregroundness measure we use a different sampling strategy. We group all encounters by which occurrence they belong to. Annotations within the same occurrence are more likely to share background. We sample query and database annotations from within occurrences to simulate matching annotations within an encounter. We do not limit the number of exemplars in this test to ensure that annotation pairs that share common scenery exist. We perform this test over multiple occurrences and aggregate the results. Therefore, the reported database size will be an average, and the query size is the sum of all unique query annotations. The accuracy of the foregroundness is illustrated in~\cref{fig:FGIntraExpt}. The results show that using foregroundness weights improves the number of correct results at rank $1$ by a significant margin for both species. In the higher ranks using the \pvar{fg=T} line occasionally dips below the \pvar{fg=F} line because sometimes the foregroundness mask covers distinguishing keypoints, but this is neither significant nor common. Therefore, we find it beneficial to always include foregroundness when a trained estimator is available. \FGIntraExpt{} \FloatBarrier{} \subsection{Invariance experiment}\label{sub:exptinvar} % In this experiment we vary the feature invariance configuration. This influences the location, shape, and orientation of keypoints detected in each annotation, which in turn influences which regions in each annotation are matchable using SIFT descriptors extracted at each keypoint. The best invariance settings will be depend on properties of the data. In our experiments we test different settings by enabling (denoted as T) or disabling (denoted as F) the parameters affine invariance (AI), and our query-side rotation heuristic (QRH). Initially we also tested rotation invariance, but found that it provided the poorest results for all datasets by a significant margin, likely because the gravity vector assumption is mostly satisfied in all images. Therefore, we exclude rotation invariance from our experiments. In configurations where \pvar{AI=F}, keypoints are circular with a radius defined by the scale at which it was detected. When \pvar{AI=F}, the keypoint shape is adapted into an ellipse to normalize for small viewpoint changes. When \pvar{QRH=F}, each keypoint is assigned its orientation as normal, but when \pvar{QRH=T}, each keypoint in a query annotation is replaced by three keypoints, one rotated slightly to the left, another slightly to the right, and the last is the original keypoint. The four specific configuration that we test are outlined in the following list: \FloatBarrier{} \begin{itemln} \item \NoInvar{} (\pvar{AI=F,QRH=F}): % This configuration uses circular keypoints and assumes the gravity vector. \item \AIAlone{} (\pvar{AI=T,QRH=F}): % This is the baseline setting that assumes the gravity vector and where each feature's shape is skewed from a circle into an ellipse. \item \QRHCirc{} (\pvar{AI=F,QRH=T}): % This is a novel invariance heuristic where each {database} feature assumes the gravity vector, but {query} feature is $3$ orientations: the gravity vector and two other orientations at $\pm15\degrees$ from the gravity vector. Ideally, this will allow feature correspondences to be established between features seen from slightly different orientations. \item \QRHEll{} (\pvar{AI=T,QRH=T}): % This is the combination of \QRHCirc{} and \AIAlone{}. \end{itemln} \FloatBarrier{} % Invar Conclusions The example in~\cref{fig:kptstype} illustrates the difference between \AIAlone{} and \QRHCirc{} features for plains and Grévy's zebras. The accuracy of the invariance experiment is shown in~\cref{fig:InvarExpt}. For plains zebras, the \QRHCirc{} scores are significantly better than all other invariance settings. Interestingly, affine invariance results in worse performance when QRH is on, but if the QRH is off then affine invariance improves accuracy. This suggests that the QRH better handles matching the coarse patterns seen on the plains zebras across pose and viewpoint variations than using affine invariance, which can tend to adapt itself around non-distinctive diagonal stripes. Even though affine keypoints provide more precise localization, the area they describe is often smaller than a circular keypoint. It makes sense that affine keypoints would not describe coarse features as well as circular keypoints do, because circular keypoints typically cover a larger area. % The results for Grévy's zebras demonstrate similar levels of accuracy for \AIAlone{} and \QRHEll{}. Affine invariance seems to be the most important setting for matching Grévy's zebras. The distinctive details on Grévy's zebras are finer than plains zebras and are well captured by affine keypoints. While the QRH does improve accuracy for Grévy's zebras the density of the distinctive keypoints means that it is less important because it is more likely two annotations will have at least one distinctive region aligned and in common. \InvarExpt{} \kptstype{} \FloatBarrier{} \subsection{Scoring mechanism experiment}\label{sub:exptscoremech} % Database setup for name scoring The purpose of the scoring mechanism is to aggregate scores of individual feature correspondences across multiple annotations into a single score for each name --- \ie{} an annotation-to-name similarity, which is analogous to the image-to-class distance used in~\cite{boiman_defense_2008}. We test the identification accuracy of the two name scoring mechanisms that were described earlier in~\cref{subsec:namescore}: (1) \cscoring{} (denoted as \csum{}) and (2) \nscoring{} (denoted as \nsum{}). Because the scoring mechanism is meant to take advantage of multiple database annotations, we vary the number of \exemplars{} per database \name{} (\pvar{dpername}) between $1$, $2$, and $3$. Varying the number of \exemplars{} will cause each database to contain a different number of annotations. To normalize difference in database size we include additional confuser annotations (annotations that do not match any query) in the smaller databases to maintain a constant database size across experiments. Each \exemplar{} is chosen from a separate encounter. The accuracy of the scoring mechanism experiment is shown in~\cref{fig:NScoreExpt}. The results of this test does suggest that \nsum{} results in slightly more accurate ranking, but the overall difference in accuracy is relatively small (about $1-3\percent$). Intuitively, the \nsum{} scoring should produce better ranks because it can combine scores from multiple correspondences to different correct annotations. Note that when $\pvar{dpername}=1$, the \csum{} and \nsum{} scores might still be different because multiple correspondences to a name which may be generated when $\K > 1$ or when \pvar{QRH=T}. Perhaps the more interesting result of this experiment is the effect of increasing the number of exemplars in the database from $1$ to $2$. There is a drastic improvement in ranking accuracies in both species. The accuracy of plains zebras increases by $10\percent$ and for Grévy's zebras the gain is almost $20\percent$. It makes sense that this should be the case. If there are more examples of an individual in the database, then the probability that the query is similar to at least one of them should increase as long as there is sufficient variation. This suggests that even if a new query new annotation initially fails to rank the correct result, subsequent annotations added to the system of the same individual will be more likely to correctly match a previous annotation. As more annotations of that individual are added, the likelihood that the ranking algorithm will make a connection between all instances of that individual will increase. \NScoreExpt{} \FloatBarrier{} \subsection{K experiment}\label{sub:exptk} % Introduce varied parameters In this experiment we investigate the effect of $\K$ (the number of the nearest neighbors used in establishing feature correspondences, which was discussed in~\cref{sub:featmatch}) on identification accuracy. We vary $\K$ between the values $1, 2, 4$, and $6$. In all these experiments we set the number of normalizing neighbors to be $\Knorm=1$. Two database factors that may influence the best choice of $\K$ are the number of annotations in the database and the number of annotations per name in the database. If there are more correct matches for a query annotation it would be beneficial to allow it to match more annotations. Likewise, if there are more overall annotations in the database, then it might be beneficial to search deeper into all the database descriptors to find the correct matches. Therefore, in addition to varying $\K$ we also vary the number exemplars per name (\pvar{dpername}) and the overall number of annotations in the database (\pvar{dsize}) We use a protocol similar to the one used in the scoring mechanism experiment to sample databases. The difference is that we use the extra confuser annotations to vary the total number of annotations in the database. However, controlling for these factors constrains the number of annotations we can use. For Grévy's, we can vary the total database size between $476$ and $774$. For plains, we have more confuser annotations allowing us to test database sizes of $578$ and $1650$. We vary the number of \exemplars{} per name between $1$ and $2$. The results of this experiment are illustrated in ~\cref{fig:KExptA,fig:KExptB}. Similar to the previous experiment, the number of exemplars per name is the most significant variable impacting accuracy. Furthermore, when there are more exemplars in the database the choice of $\K$ starts become less significant. The results also show that accuracy does slightly decrease when the database becomes larger, but magnitude of the decrease is between $1\percent$ and $3\percent$. Interestingly, the optimal choice of $\K$ is not consistent between species when there is only one exemplar per name. For Grévy's zebras using a lower $\K$ results in better results, but for plains zebras there is a significant loss when $\K=1$ and the database size is large. This is likely due to the nature of the distinguishing patterns on the different zebras. When matching the detailed patterns of the Grévy's zebras, it is better to use a low $\K$ to reduce noise, but for coarser plains zebras patterns a low $\K$ might not find a correct match immediately. Thus, the choice of $\K$ is a trade-off between precision and recall that depends on the type of texture patterns that are being matched. Overall the experiments on the setting of $\K$ does not yield definitive choice for this parameter. However, it appears that $\K$ only has a small influence on identification accuracy. This section does shows that the number of exemplars per annotation has a significant impact on identification accuracy. \KExptA{} \KExptB{} \FloatBarrier{} \subsection{Failure cases}\label{sub:exptfail} We now investigate the causes of identification failure and consider example failure cases. When investigating the cause of a failure case we consider both (1) the matches between the query annotation and the incorrect \name{} at rank $1$ and (2) the matches between the query annotation and the correct \name{} that appears further down the ranked list. We identify the main 3 failure cases as: (1) unaligned annotations, (2) quality factors, and (3) non-primary correspondences. The remainder of this subsection defines, discusses, and provides examples of these failure cases. \FloatBarrier{} \subsubsection{Alignment} When two annotations are not aligned (ignoring translation and small scale differences), there can be significant differences in appearance that can cause inconsistency in feature localization and description. There are two major causes of alignment error: (1) viewpoint variations which cause out-of-plane rotations and (2) pose variations which can cause local non-rigid non-linear transformations of distinguishing features. These issues cause variations in feature description which renders the approximate nearest neighbor algorithm unable to establish the correct correspondence. Furthermore, non-projective transformations between annotations can cause homography-based spatial verification to discard correctly established correspondences. Failing to establish correspondences and incorrectly discarding them ultimately results in identification failure. The example in~\cref{fig:FailViewpoint} illustrates a failure case due to a difference in viewpoint and pose. To address matching across different viewpoints and poses it helps to choose an appropriate level of feature invariance (like affine invariance and the query-side rotation heuristic), but these only work up to a point. However, in practice the animal identification problem is not a one-shot identification challenge. Given multiple annotations of an individual we expect that the matching algorithm will be able to overcome viewpoint and pose differences by matching annotations with intermediate positions. \FailViewpoint{} \FloatBarrier{} \subsubsection{Quality factors} Factors such as low resolution, blurring, poor exposure, lighting (shadows / non-uniform illumination), and occlusion can significantly reduce the density of distinctive features on an annotation. Note that lighting and occlusion (scene quality factors) should be distinguished from the other factors (capture quality factors) because they are related to the scene itself rather than a poor capturing of that scene. Annotations with low capture quality tend to generate fewer, larger, less distinct, and distorted features. Annotations with low scene quality tend to have their distinguishing features distorted or masked by grass, tree branches, shadows, and other animals. This is a significant problem for species with relatively few distinctive features like plains zebras. The example in~\cref{fig:FailOcclusion} illustrates a failure case due to occlusion, and \cref{fig:FailQuality} illustrates failure case due to low resolution. In some cases low quality annotations can be still be matched, but in the worst case all distinctive features are missing and there is no way to visually identify the individual. Therefore, the best way to handle these annotations is either to ignore them entirely, or to first attempt to match them, but then discard them if they cannot be matched. \FailOcclusion{} \FailQuality{} \FloatBarrier{} \subsubsection{Non-primary correspondences} Sometimes an annotation bounding box cannot be placed tightly around an animal (this happens often for some species like giraffes), which means that other objects will appear in the background. Similarly, objects that occlude the animal will be in the foreground. Ideally, the primary animal in each annotation would be segmented, but when simply matching raw annotations non-primary correspondences may be formed. This results in the photobomb and scenery-match failure cases. Photobombs are caused by correct correspondences to a non-primary animal (seen either in the foreground or background) in an annotation. Likewise, scenery matches are caused by matches in the background landscape. Both cases are most commonly caused by pairs of annotations with the same occurrence, but photobombs can occur over larger time deltas. The example in~\cref{fig:FailPhotobomb} illustrates a failure case due to a photobombing background animal, and \cref{fig:FailScenery} illustrates a scenery match. For plains and Grévy's zebras, most scenery matches are eliminated using the foregroundness measure, but the problem remains in databases without a trained foregroundness estimator. Accounting for photobombs is a more challenging problem because a simple patch-based classifier cannot distinguish a primary feature from a secondary feature without having information about the animal identity. However, there are some patterns that photobomb matches present that we seek to take advantage of later in \cref{sec:learnpb}. \FailScenery{} \FailPhotobomb{} \FloatBarrier{} \subsection{Experimental conclusions}\label{sub:exptsum} In this section we have evaluated our ranking algorithm on multiple species, compared it to an alternative ranking algorithm, and evaluated detailed parameter choices. Our experiments were performed under restrictive conditions to control for the effect of time, database size, and number of exemplars. Based on the results of these experiments we are able to make several observations and conclusions. Our experiments with comparing the SMK and LNBNN ranking algorithm demonstrated that LNBNN achieved better ranking accuracy. LNBNN does not quantize descriptor and therefore it is able to distinguish more subtle descriptor details. Because LNBNN does not require an expensive pre-training phase, it makes it ideal to rank the databases on the scales considered in this thesis. However, we note that SMK is more efficient on larger scales, and it may be necessary to consider when databases become very large. % NUM EXEMPLARS MATTERS In most experiments we evaluated our ranking algorithm as if it were addressing a single-shot identification problem. This was to establish the performance of the algorithm when an individual has only been seen once before. However, in practice this will not be the norm. The name scoring and $\K$ experiments demonstrated that the ranking accuracy significantly increases with the number of exemplars per database name. We will use this observation to address the challenges of matching through viewpoint and occlusion by taking advantage of multiple images of an individual in~\cref{chap:graphid}. %\item \textbf{Invariance settings are data dependent}: We also saw that the best choice for feature invariance is data dependent. The invariance experiment demonstrated that affine invariance produces better results for Grévy's zebras, whereas circular keypoints lead to more accurate results for plains zebras. This experiment also showed that the query-side rotation heuristic improves accuracy by adding a small amount of orientation invariance to feature localization. Likewise, the $\K$ experiment shows that identification accuracy is not significantly influenced by the choice of $\K$ for plains zebras, but for Grévy's zebras the most accurate results were obtained with $\K\tighteq1$. This is likely because the features from plains zebras are less distinguishing than features from Grévy's zebras, hence the correct match of a plains zebra feature is less likely to be its closest neighbor. Both the choice of $\K$ and invariance settings should be evaluated on a per-dataset basis and there is likely benefit to performing identification using multiple parameter choices. \section{RANK-BASED IDENTIFICATION SUMMARY}\label{sec:staticsum} In this chapter we have addressed the problem of animal identification using a computer-assisted algorithm based on LNBNN that ranks a labeled database of \names{} by their similarity to a single query annotation. This algorithm begins by extracting local patch-based features from cropped and normalized chips. Features from database annotations are indexed for fast nearest neighbor search using a kd-tree. A mechanism based on LNBNN is used to compute a matching score for each database annotation. A shortlist of top scoring results have their feature correspondences spatially verified and then are re-scored. We have shown how this algorithm can be applied to individual animal identification and demonstrated that in a majority of cases the correct match is ranked first by our algorithm for plains zebras, Grévy's zebras, Masai giraffes, and humpback whales. Because we have used the algorithm to curate the \groundtruth{}, we do not claim the reported accuracies in our experiments to be quantitatively absolute. However, qualitative evidence for the algorithm's overall success is provided by the facts that (1) we were able to use the algorithm to identify a significant number of individuals from different species and (2) our approach provides more accurate rankings than standard instance recognition techniques. We therefore make the conclusion that the algorithm is effective at identifying medium to high quality images of animals with distinguishing patterns when taken from similar viewpoints. While we have demonstrated that the ranking algorithm accurately ranks correct matches when they exist, there are several limitations to this approach. \begin{enumln} \item All results must be manually verified, which can be time-consuming. \item There is no mechanism for recovering from errors once they occur. \item There is no mechanism to determine when identification is complete. \end{enumln} In the following chapters we seek to address these issues. In \Cref{chap:pairclf} we introduce an algorithm to make automatic decisions based on results from this algorithm, and in \cref{chap:graphid} we introduce a graph-based framework that determines identification confidence and introduces error recovery mechanisms.