\begin{comment} ./texfix.py --fixcap --dryrun ./texfix.py --findcite --unused-important ./texfix.py --findcite --close-keys fixtex --fpaths chapter2-related-work.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_chapter2-related-work.md \end{comment} \chapter{RELATED WORK}\label{chap:relatedwork} To address the individual animal identification problem we draw upon related research in % feature detection~\cite{mikolajczyk_comparison_2005,tuytelaars_local_2008, perdoch_efficient_2009}, % feature description~\cite{lowe_distinctive_2004, mikolajczyk_performance_2005,simonyan_learning_2014, winder_picking_2009,zagoruyko_learning_2015,han_matchnet_2015}, % approximate nearest neighbor search~\cite{silpa_anan_optimised_2008, muja_fast_2009}, % instance recognition~\cite{sivic_efficient_2009,nister_scalable_2006, philbin_object_2007,jegou_hamming_2008,bo_efficient_2009, jegou_aggregating_2012, tolias_aggregate_2013}, % face verification~\cite{chopra_learning_2005,huang_labeled_2007,berg_tom_vs_pete_2012, chen_blessing_2013,taigman_deepface_2014} % fine-grained recognition~\cite{parkhi_cats_2012,berg_poof_2013, gavves_local_2014}, % category recognition~\cite{lazebnik_beyond_2006,zhang_local_2006, mccann_local_2012,boiman_defense_2008,sanchez_compressed_2013}, % and convolutional neural networks~\cite{krizhevsky_imagenet_2012, razavian_cnn_2014, zagoruyko_learning_2015,han_matchnet_2015,arandjelovic_netvlad_2016}. At a high level, the main questions addressed by the aforementioned research can be summarized as: How should image features be detected? How should detected image features be represented? How much invariance should features have? Should image features be quantized? How should image features be matched? How should feature matches be aggregated? How should feature matches be scored? How should all of this be done accurately? How should all of this be done efficiently? Answers to these questions address many of the challenges to animal identification previously introduced in~\cref{sec:challenges}. This chapter summarizes literature relevant to addressing these questions in the context of animal identification. The outline of this chapter is as follows: \Cref{sec:featuredetect} will discuss feature detection. \Cref{sec:featuredescribe} will discuss feature description. \Cref{sec:ann} will discuss approximate nearest neighbor algorithms. \Cref{sec:ir,sec:cr,sec:fgr}, will discuss approaches to recognition. \Cref{sec:dcnn} will discuss convolutional neural networks. \section{IMAGE FEATURE DETECTION}\label{sec:featuredetect} Before an image can be analyzed, it must be broken down into smaller components. An image's visual appearance can be captured using a combination of local image patterns --- \glossterm{patch-based features}. The most informative patch-based features are typically centered on simple image structures such as junctions, corners, edges, and blobs~\cite{tuytelaars_local_2008}. If these features can be reliably detected, localized, and described then image matching can be posed as a problem in matching sets of features. This section describes work related to detecting features in an image, and~\cref{sec:featuredescribe} will discuss how detected features are then described. The region where a feature is detected is called a \glossterm{keypoint}. The simplest definition of a keypoint is just an $xy$-location in an image. However, images contain information at multiple scales; therefore a keypoint is typically associated with a scale. The scale of a keypoint is a non-negative real number that defines the level of detail at which to interpret the underlying image information. A keypoint with a scale can be thought of as a circular region with a radius that is the scale multiplied by some constant % (\eg{} $3\sqrt{3}$ is the constant used to determine a keypoint's radius in~\cite{perdoch_efficient_2009}). To account for changes in viewpoint and pose, it is also common to augment features with an orientation and shape. Adding these properties is said to add invariance to the feature. Invariant features can provide similar descriptions of the same semantic image region under different viewing conditions. However, adding invariance can cause features to lose distinguishing information~\cite{mikolajczyk_comparison_2005, tuytelaars_local_2008, perdoch_efficient_2009, lowe_distinctive_2004}. Many detectors have been developed to detect patch-based feature keypoints~\cite{mikolajczyk_comparison_2005, tuytelaars_local_2008}. Algorithms such as Harris, SUSAN, and FAST detect corners~\cite{harris_combined_1988, mikolajczyk_indexing_2001, smith_susannew_1997, rosten_machine_2006}. Blobs and corners can be detected with Hessian~\cite{beaudet_rotationally_1978, lindeberg_shape_adapted_1994} or difference of Gaussians~\cite{gaussier_neural_1992, lowe_distinctive_2004} detectors. There are also region-based detectors: maximally stable extremal regions~\cite{matas_robust_2004}, saliency based methods~\cite{buoncompagni_saliency_based_2015} and superpixel-based methods~\cite{ren_learning_2003, mori_recovering_2004}. Some applications choose to skip keypoint detection and use a uniform grid of dense features~\cite{liu_sift_2008, revaud_deep_2015, iscen_comparison_2015}. Other applications, such as face recognition, use specialized keypoint detectors~\cite{dantone_real_time_2012, berg_tom_vs_pete_2012}. There currently exists no principled method for selecting the appropriate feature detector. Different feature detectors perform differently given the application~\cite{tuytelaars_local_2008}. This section describes the representation of an image over multiple scales, the detection of features to sub-pixel and sub-scale accuracy, and the adaption of features to specify orientation and shape. We focus on the Hessian-based keypoint because it has been experimentally shown to be a reliable choice for instance recognition~\cite{tuytelaars_local_2008}. \subsection{Scale-space} Scale-space theory describes image features as existing at multiple scales~\cite{lindeberg_scale_space_1993}. The same point on an object seen close up appears quite different compared to when it is at a distance. For example, a zebra in the distance may appear to have two stripes that are connected, but when the animal appears closer it becomes clear that the stripes are actually disconnected. %This problem is addressed by detecting features at multiple scales. Multi-scale detection is formalized by the theory of scale-space, which parameterizes a continuous signal, $f$, with a scale, $\scale$. The original signal is said to exist at scale $0$. Convolving the original signal with a Gaussian kernel produces coarser scales. Let $f$ be a continuous $2$-dimensional signal that defines an image. Let vector $\pt=\ptcolvec$ be a location in the image. The function $g(\scale)$ %= \frac{1}{\TAU\scale^2} \exp{-(\vec{i} \cdot \vec{i})/2\scale^2}$ is the isotropic 2D Gaussian kernel. The scale-space representation of a continuous image (for any non-zero scale) takes the form: $\img(\pt, \scale) = g(\scale) \conv f(\pt)$, where $\conv$ is the convolution operator. However, we do not have access to a continuous representation of an image. Therefore, in practice, the continuous Gaussian kernel is replaced with the discrete Gaussian kernel. This can be efficiently implemented as a discrete convolution with the $1$-dimensional discrete Gaussian kernel in the $x$-direction and then in the $y$-direction, because the discrete Gaussian kernel is separable in orthogonal directions~\cite{lindeberg_scale_space_1993}. Using the definition of an image at a single scale the next step is to represent an image at multiple scales. \subsubsection{Gaussian pyramid} \newcommand{\downsamp}[2]{#1[::\tightpad#2,::\tightpad#2]} % See page 39 of Scale-space theory. %The continuous image signal is defined to be the zeroth scale $\img(\pt, 0) = \rawimg(\pt)$. The discrete scale-space representation of an image is efficiently implemented using a Gaussian pyramid. A scale-space pyramid consists of $L$ levels. Each level covers an octave. Starting from the base of each level with scale parameter $\sigma$ the next octave is reached when $\sigma$ doubles. There are $s$ intervals represented within each octave. A Gaussian pyramid is illustrated in~\cref{fig:ScaleSpaceFigure}. \ScaleSpaceFigure{} The pyramid's base, % $\img(\pt, 1) = g( 1 ) \conv \rawimg(\pt)$ % is the $\ell=0\th{}$ level of the pyramid, and is computed by blurring the original image (sometimes with small initial blurring) with $\sigma=1$. Subsequent levels of the pyramid are produced by doubling sigma, thus the $\ell\th{}$ level of the pyramid is $\img(\pt, 2^\ell)$. A property of discrete scale-space is that after appropriate smoothing downsampling the image by half is equivalent to doubling sigma. Let % --- %$\img_\ell(\pt) = \downsamp{\rawimg}{2^{\ell}}({\pt} / {2^{\ell}})$ $\img_\ell(\pt)$ %= \downsamp{\rawimg}{2^{\ell}}({\pt} / {2^{\ell}})$ % --- denote the raw image downsampled by a factor of $2^{\ell}$ using Lanczos resampling. Now, each level of the pyramid can be written as % %$\img(\pt, 2^\ell) = g( 1 ) \conv \rawimg^\ell$. $\img(\pt, 2^\ell) = g( 1 ) \conv \img_\ell(\pt)$. Given the raw image at level $\ell$, the scale corresponding to $\sigma$ can be written as a relative scale % --- $\sigma_\ell = \sigma / 2^\ell$. % --- Thus, a discrete image at any scale can be efficiently computed as: \begin{equation} \img(\pt, \sigma) = g(\sigma_\ell) \conv \img_\ell(\pt) \end{equation} Discrete convolution is applied using a window of size $\floor{6\sigma_\ell + 1} + (1 - (\modfn{\floor{6\sigma_\ell + 1}}{2}))$. Interpolation between discrete values of $\pt$ is used to sample intensity at sub-pixel accuracy. A scale between two levels of the pyramid is called an interval. Typically, $s$ intervals --- with relative scales $2^{0/s}, 2^{1/s}, \ldots 2^{s/s}$ --- are computed to represent the octave between level $\ell$ and $\ell + 1$. If differences between scales are needed, then the scales $2^{-1/s}$ and $2^{1 + 1/s}$ are also computed~\cite{lowe_distinctive_2004}. \subsection{Hessian keypoint detection} Hessian-based keypoint detection searches for extrema of the Hessian operator in both space and scale~\cite{beaudet_rotationally_1978, lindeberg_shape_adapted_1994}. The Hessian detector can qualitatively be viewed as a blob detector, but it also detects corners which may appear as blobs in scale-space~\cite{tuytelaars_local_2008}. The Hessian keypoint detector will compute a response value for each point in scale space indicating how blob-like each pixel is. The extrema of this response defines a set of Hessian keypoints. Post processing removes non-robust keypoints and localizes all other keypoints to sub-pixel and sub-scale accuracy. \subsubsection{Hessian response} Let subscripts denote the partial derivatives of the image intensity (\eg{} $\img_{x}$ is the first $x$ derivative, $\img_{xx}$ is the second $x$ derivative, and $\img_{xy}$ is the first derivative in both $x$ and $y$). The Hessian is a matrix of second order partial derivatives and is defined at each point in scale space. \begin{equation} \hessMAT = \BIGMAT{ \img_{xx}(\pt, \scale) & \img_{xy}(\pt, \scale) \\ \img_{xy}(\pt, \scale) & \img_{yy}(\pt, \scale) } \end{equation}\label{eqn:hessianmatrix} %(derivatives are computed in scale-space over an integration scale %$\scale_I$). The initial response of the detector at each point is the determinant of the Hessian matrix. This response is computed for level and every pixel in the scale-space pyramid. At coarser scales the Hessian response weakens, so to ensure that responses between scales are comparable, the initial response is scale normalized by multiplying with $\sigma^2$. (See~\cite{lindeberg_feature_1998} for more details about the choice of this normalization factor.) The extrema of this space defines a set of candidate keypoints, $\kpts'$. \begin{equation} \kpts' = \argextrema{\pt,\scale} \paren{\scale^2 \detfn{\hessMAT}} \end{equation} %\devcomment{Is this 2D or 3D extrema detection. What would need to change in my math if it was 3D?} A point in this 3D space is a maxima/minima if its scale normalized value is greater/less than the scale normalized values of all its neighbors in the pyramid --- \ie{} the $8$ neighbors in its current interval, its $9$ neighbors in the next interval, and its $9$ neighbors in the previous interval. \subsubsection{Edge filtering} Edge responses are not robust --- \ie{} the same point can not be localized reliably in two views of the same scene --- due to their elongated nature. Because of this, the extrema that appear too edge-like are filtered using a threshold $t_{\tt{edge}}$ which is compared to the ratio of the Hessian's squared trace and the determinant. \begin{equation} %\kpts = \{(\pt, \scale) \in \kpts' \where % \frac{\trfn{\hessMAT}^2}{\detfn{\hessMAT}} > t_{\tt{edge}}\} \kpts = \left\{(\pt, \scale) \in \kpts' \bigwhere \frac{\trfn{\hessMAT}^2}{\detfn{\hessMAT}} > t_{\tt{edge}}\right\} \end{equation} \subsubsection{Sub-pixel and sub-scale localization} To compensate for the discrete nature of pixel images, each keypoint detection is localized to sub-pixel and sub-scale accuracy. The importance of feature localization is demonstrated in~\cite{ke_pca_sift_2004}, where descriptors were computed on the normalized vectors of patch gradients using only principal component analysis (PCA)~\cite{jolliffe_principal_1986}. Despite the simplicity of the descriptors the authors were still able to effectively match two images due to the robust localization of the features. Sub-pixel and sub-scale localization transforms a keypoint $\kp_0$ into $\kp^*$ using an iterative process. At each iteration $i$, a $2\nd{}$ order Taylor expansion, centered at % $\kp_i = (\pt_i, \scale_i)$, approximates the scale normalized Hessian response: % $T_i(\pt, \scale) \approx \scale^2 \detfn{\hessMAT}$. The keypoint is updated to the position of the maximum response of the Taylor expansion: $\kp_{i + 1} = \argmax{\kp} T_i(\kp)$. This process iterates until convergence. If the process does not converge before a threshold number of iterations, the keypoint is deemed not robust and thrown out. \subsection{Affine adaptation} So far, the keypoints we have described correspond to circular regions where the pixel radius is some multiple of the scale. To account for small affine changes seen in non-planar objects (like zebras), the shape of each circular keypoint is adapted into an ellipse. An affine shape $\vmat=\vMATRIX$ is estimated (as a lower triangular matrix) for each keypoint using an iterative technique involving the second moment matrix~\cite{lindeberg_shape_adapted_1997,baumberg_reliable_2000,mikolajczyk_comparison_2005}. The affine shape matrix transforms an ellipse into a unit circle. Note that because the matrix is lower triangular one of its eigenvectors points in the downward direction. Thus, the shape has no influence on the orientation of the keypoint. For each point in scale space the second moment matrix is evaluated as: \begin{equation}\label{eqn:secondmoment} \momentmat \tighteq \MAT{ \img_x^2(\pt, \scale) & \img_x(\pt, \scale) \img_y(\pt, \scale) \\ \img_x(\pt, \scale) \img_y(\pt, \scale) & \img_y^2(\pt, \scale) } \end{equation} The goal is to ``stabilize'' each keypoint shape by searching for the transformation, $\vmat^*$, that causes the second moment matrix to have equal eigenvalues. For each keypoint, its elliptical shape is initialized as a circle $\vmat_0=\eyetwo$. For each iteration $i$: \begin{enumln} \item Compute the second moment matrix, $\warpedmomentmat{\vmat_i}$, at the warped image patch. \item Check if the keypoint shape is stable. A keypoint shape is stable if the eigenvalue ratio of the second moment matrix is close to $1$. If the keypoint has been stable for two consecutive iterations, then accept $\vmat^* \leftarrow \vmat_{i}$ and stop iteration. Otherwise, if the number of iterations, $i$, is greater than some threshold, then stop and discard the keypoint. \item Update the affine shape using the rule % --- $\vmat_{i + 1} = \sqrtm{\warpedmomentmat{\vmat_i}} \vmat_i$. \end{enumln} The matrix $\vmat$ only defines the transformation from an ellipse to a circle. The standard representation of an ellipse is a conic of the form $\mat{E} = \vmat^T\vmat$. This means that $\vmat$ is only defined up to an arbitrary rotation~\cite{mikolajczyk_comparison_2005,perdoch_efficient_2009}. Thus, we can freely rotate $\vmat$ into a lower triangular matrix. This ensures that one of its eigenvectors is pointing downwards --- \ie{} in the direction of the ``gravity vector''~\cite{perdoch_efficient_2009}. Making use of the gravity vector removes a dimension of invariance. To allow for the specification of keypoint orientation, the keypoint representation can be extended with a parameter $\theta$ that defaults to $0$. \subsection{Orientation adaptation} The keypoint orientation is defined using the parameter $\theta$. By default, the orientation of a keypoint can be assumed to be aligned with the ``gravity vector'' --- \ie{} $\theta=0$~\cite{perdoch_efficient_2009}. Otherwise, an orientation must be computed. A common method for determining a keypoint's orientation is to use the dominant gradient orientation. In theory adapting the orientation to match the dominant gradient will cause a computed keypoint description to be invariant to rotations. To compute a keypoint's dominant orientation the pixels around a keypoint vote into a fine-binned orientation histogram~\cite{lowe_distinctive_2004}. A pixel's vote is weighted by its gradient magnitude multiplied by its Gaussian weighted distance to the keypoint center. The dominant orientation % $\ori \in \rangeinex{0,\TAU}$ is chosen as the peak of this histogram. If there is more than a single peak it is common to create a copy of the keypoint for each maxima in this histogram. This process is illustrated in~\cref{fig:testfindkpdirection}. \testfindkpdirection{} \FloatBarrier{} \subsection{Discussion --- detector and invariance choices} To identify individual animals, features must be detected in distinguishing areas of an animal. For a feature to be useful, it must be detected in the multiple images of the same individual despite variations in viewpoint, pose, lighting, and quality. In our baseline algorithm we choose to use a Hessian based detector~\cite{perdoch_efficient_2009, lindeberg_feature_1998} because it generally produces a large number of features and has been experimentally shown to be repeatable, accurate, and adaptable to multiple degrees of invariance~\cite{tuytelaars_local_2008}. Once a keypoint is detected, it is described using a keypoint description algorithm. It is desirable for a keypoint description to be invariant to small changes in viewpoint, pose, and lighting. Accurate localization of a keypoint in scale and space helps to ensure that similar images contain similar features. Sometimes, it is beneficial to further localize a keypoint in shape and orientation, thus adding invariance to the feature. However, if too much invariance is used, it may not be possible to distinguish between semantically different features. It is a challenge to choose the correct level of invariance when computing features. Often an application chooses one of two extremes. Consider the computation of keypoint orientation. Standard methods for orientation invariance assume patches can freely rotate, when in fact they may be constrained to be consistent with the orientation of surrounding patches~\cite{lowe_distinctive_2004}. On the other side extreme is the ``gravity vector''~\cite{perdoch_efficient_2009}, which globally enforces all keypoint to have a downward orientation. This may be a safe assumption when working with features from images of rigid objects taken in an upright position. It may not be correct when dealing with non-rigid objects like zebras. %Orientation invariance assumes that orientation is a local property % of a patch, but the orientation of a patch is usually not % independent of its surrounding patches on an object. In our experiments in~\cref{sub:exptinvar} we test different degrees on invariance. This test includes a novel method that achieves a middle ground between full orientation invariance and the gravity vector. \section{IMAGE FEATURE DESCRIPTION}\label{sec:featuredescribe} Once each feature has been localized its visual appearance must be described before it can be matched. The goal of feature description is to encode raw image data into a vector --- \ie{} a \glossterm{descriptor}. To represent the visual appearance of a keypoint a descriptor vector should have the following properties: (1) two visually similar patches produce vectors with a small metric distance and (2) visually dissimilar patches have vectors with large distances between them. Constructing such a descriptor vector has been a core problem throughout the history of computer vision. The first texture descriptor robust to small image transformations was the scale invariant feature transform (SIFT) descriptor first published in 1999~\cite{lowe_object_1999, lowe_distinctive_2004}. Since then, other hand-crafted algorithms have been proposed. However, results have always been at least comparable to the SIFT descriptor, and SIFT is still an effective and widely used hand-designed descriptors~\cite{mikolajczyk_performance_2005, calonder_brief_2010, bay_surf_2006, leutenegger_brisk_2011, alahi_freak_2012, jegou_triangulation_2014}. A promising direction for outperforming the SIFT descriptor is descriptor learning~\cite{simonyan_descriptor_2012, simonyan_learning_2014, winder_picking_2009}; specifically descriptor learning using deep neural networks~\cite{razavian_cnn_2014, bengio_representation_2013, russakovsky_imagenet_2015}. This section first describes the basic SIFT algorithm and then provides an overview of alternatives that have been proposed to SIFT{}. Work related to learning descriptor vectors using deep neural networks is discussed later in~\cref{sec:dcnn}. \subsection{SIFT} The {SIFT descriptor} is a $128$ dimensional vector that summarizes the spatial distribution of the gradient orientations in an image patch~\cite{lowe_distinctive_2004}. To describe a keypoint with a SIFT descriptor, the keypoint's image data is warped using the affine transform of the scale space gradient image into a normalized reference frame (typically $41 \times 41$ pixels). For a descriptor to be useful in matching it is important that the keypoint is properly localized before a descriptor is computed~\cite{ke_pca_sift_2004}. Because it is not always possible to perfectly localize a keypoint, the SIFT descriptor aggregates information into a soft-histogram. Allowing data to contribute to multiple bins helps the SIFT descriptor to be robust to small localization errors and viewpoint variations. Distance between two SIFT descriptors is typically computed using the Euclidean distance. The SIFT descriptor of a patch is visualized in~\cref{fig:vizfeatrow}. The structure of a SIFT descriptor is as follows: A $4\times4$ regular grid is superimposing over the normalized patch. Each of the $16$ spatial grid cells contains an orientation histogram discretized into $8$ bins. The SIFT descriptor is the concatenation of all orientation histograms, resulting in a single % $16 \times 8 = 128$ dimensional vector. The patch information populates the SIFT descriptor as follows: For every pixel, the patch gradient (the derivative in the $x$ and $y$ direction) is computed. Next, each pixel computes its gradient magnitude and orientation. Each pixel then casts a weighted vote. The bin that a pixel votes into is computed from its $xy$-location and gradient orientation. The weight of a pixel's vote is based on its gradient magnitude and Gaussian weighted distance to the patch center. To be robust to small localization errors, a pixel's vote is split via trilinear interpolation ($x$-location, $y$-location, and orientation) into the orientation histograms of the pixel's nearest grid cells as well as neighboring orientation bins in each grid cell's orientation histogram. Once voting is completed a SIFT descriptor is normalized to account for lighting differences between images. First, the vector is L2-normalized to unit length, which makes the descriptor invariant to linear changes in intensity. Then, a heuristic --- that truncates each dimension to a maximum value of $0.2$ --- is applied to increase robustness to non-linear changes in illumination. Finally, the vector is renormalized. For storage considerations the resulting $512$-byte floating-point (float32) descriptor is typically cast as an array of unsigned $8$-bit integers (uint8), resulting in a $128$-byte descriptor. To reduce the impact of this quantization a trick is to multiply by $512$ instead of $255$ and then truncate values to $255$ before converting from a float to a uint8. Even though each component is $8$-bits and therefore can only store a maximum value of $255$, value overflow is not likely to occur because of truncation, L2-normalization, and properties of natural images. \vizfeatrow{} \subsection{Other descriptors and SIFT extensions} Even more than a decade after its original publication, SIFT remains a popular descriptor for patch-based matching because it is versatile, unsupervised, widely available, and easy to use. The principles used to guide the construction of the SIFT descriptor --- particularly the use of aggregated gradients --- have inspired many variants, extensions, and new techniques~\cite{mikolajczyk_performance_2005, dalal_histograms_2005, bay_surf_2006}. Hand crafted alternatives to SIFT have been developed that are faster to compute and more efficient to store, but these alternatives do not significantly outperform SIFT's matching accuracy on general data~\cite{lowe_distinctive_2004, mikolajczyk_performance_2005, alahi_freak_2012}. This subsection provides a brief overview of these alternatives. % ALTERNATIVES FOR DETECTION The use of aggregated gradient information in SIFT has been adapted for use in other computer vision problems such as detection and scene classification. % GIST The GIST descriptor is a low dimension descriptor used for scene classification that coarsely summarizes rough appearance of an entire image~\cite{oliva_modeling_2001, douze_evaluation_2009}. % HOG The histogram of oriented gradients (HoG) descriptor is a high dimensional descriptor used in detection. The HoG descriptors describes the shapes of objects in an image~\cite{dalal_histograms_2005}. Like the SIFT descriptor, the HoG descriptor illustrates the value of gradient-based image descriptions and has inspired extensions such as the discriminatively trained parts model~\cite{felzenszwalb_object_2010}. As a general single-scale patch-based descriptor, the matching accuracy of SIFT has not been significantly outperformed on general datasets. % GLOH One attempt at an improved general descriptor is the gradient location-orientation histogram (GLOH) descriptor~\cite{mikolajczyk_performance_2005}. GLOH uses a similar structure to SIFT but replaces the rectangular-bins with log-polar bins. GLOH did achieve higher matching accuracy on some datasets, but it was not by a significant margin. Despite the lack of generic success, hand-crafted SIFT variants have been successful when applied to specific tasks. % COLORED SIFT Colored SIFT variants such as opponent-SIFT are valuable in category recognition tasks, where a color difference could be the distinguishing factor between categories~\cite{van_de_sande_evaluating_2010} % SCALE-LESS SIFT Combining multiple SIFT descriptors over different scales has also shown moderate improvements. The scale-less SIFT descriptor combines SIFT descriptors computed at multiple scales into a single descriptor. It has been shown to produce more accurate dense correspondences than representing each scale with an individual descriptor~\cite{hassner_sifts_2012}. %Despite However, domain specific modifications have shown promising results. %The Rotation Invariant Feature Transform (RIFT) descriptor~\cite{lazebnik_sparse_2005} uses concentric % circles to make a similar modification. %The RIFT descriptor are used in texture classification~\cite{lazebnik_sparse_2005}. % SURF Efficiency is one area where SIFT has been significantly outperformed. An approximation to SIFT called speeded up robust features (SURF) is a fast approximation to SIFT based on integral images that achieves similar accuracy using a smaller $64$ dimensional descriptor~\cite{bay_surf_2006}. % DAISY The DAISY descriptor uses a similar binning structure to GLOH, but uses convolutions with Gaussian kernels to quickly aggregate gradient histograms~\cite{tola_fast_2008}. % BINARY PATTERNS Binary descriptors such as local binary patterns (LBP)~\cite{ojala_comparative_1996, zhang_local_2010}, local derivative patterns~\cite{heikkila_description_2009}, and their variants such as BRIEF~\cite{calonder_brief_2010}, BRISK~\cite{leutenegger_brisk_2011}, and FREAK~\cite{alahi_freak_2012} also quickly compute compact distinctive descriptors. Binary descriptors are built using multiple pairwise comparisons of average image intensity at predetermined locations. This results in a small descriptor that effectively represents aggregated gradient information. Machine learning is able to outperform the matching accuracy of SIFT, however these techniques require training data to adapt to each new problem domain. Learned descriptors make use of the same aggregated gradient information used in the construction of SIFT descriptors. The Liberty, Yosemite, and Notre-Dame buildings datasets are standard datasets for descriptor learning~\cite{brown_discriminative_2011}. Error on these datasets is measured using false positive rate at $95\percent$ recall (FPR95). The baseline SIFT error on this dataset is $27.02\percent$. The configuration of a DAISY descriptor is learned in~\cite{winder_picking_2009} and achieves an error of $15.16\percent$ on the buildings datasets. In~\cite{simonyan_learning_2014}, large scale non-convex optimization is used to learn a spatial pooling configuration of log-polar bins, a dimensionality reduction matrix, and a distance metric to further reduce the FPR95 error to $10.98\percent$. The current state-of-the-art error of $4.56\percent$ on the buildings dataset is achieved using a convolutional neural network~\cite{zagoruyko_learning_2015}. \subsection{Discussion --- descriptor choices} In our application we use the SIFT~\cite{lowe_distinctive_2004} as our baseline descriptor because it is one of the most widely used and well-known descriptors. SIFT describes images patches in such a way that small localization errors do not significantly impact the resulting representation. Exploration of alternative convolutional descriptors is discussed later in~\cref{sec:dcnn}. \section{APPROXIMATE NEAREST NEIGHBOR SEARCH}\label{sec:ann} In computer vision applications it is often necessary to search a database of high dimensional vectors~\cite{shakhnarovich_nearest_neighbor_2006, datar_locality_sensitive_2004, muja_fast_2009, kulis_kernelized_2012, weiss_spectral_2009}. Patch descriptor vectors like SIFT are constructed such that the distance (under some metric) between vectors is small for matching patches and large for non-matching patches. Thus, finding matching descriptor vectors is often framed as a nearest neighbor search problem~\cite{lowe_distinctive_2004}. It becomes prohibitively expensive to perform exact nearest neighbor search as the size of the database increases. Therefore, approximate algorithms --- which can trade off a small amount of accuracy for substantial speed-ups --- can be used instead. \subsection{Kd-tree} A \glossterm{kd-tree} is a data structure used to index high dimensional vectors for fast approximate nearest neighbor search~\cite{bentley_multidimensional_1975}. A kd-tree is an extension of a binary tree to multiple dimensions. Each non-leaf node of the tree is assigned a dimension and threshold value. The node splits data vectors between the left and right children by comparing the value of the data vector at the assigned dimension to the assigned threshold. \subsubsection{Building a kd-tree index} Indexing a set of vectors involves first choosing a dimension and threshold to split the data into two partitions. Then this procedure is recursively applied to each of the partitions. A common measure for choosing the dimension is to choose the dimension with the greatest variance in the data. The threshold is then selected as the median value of the chosen dimension. \subsubsection{Augmenting a kd-tree index} It is possible to augment an existing kd-tree by adding and removing vectors. Addition of a vector to a kd-tree is performed by appending the point to its assigned leaf. Removal of points from a kd-tree is done using lazy deletion --- \ie{} by masking the removed data. To avoid tree imbalance, a kd-tree is re-indexed after the number of points added or removed passes a threshold. Any masked point is deleted whenever the tree is re-indexed. \subsubsection{Searching a kd-tree index} Searching for a query point's exact nearest neighbor in a kd-tree has been shown to take expected logarithmic time for low ($k < 16$) dimensional data~\cite{friedman_algorithm_1977}. However, for higher dimensional data this same method takes nearly linear time~\cite{sproull_refinements_1991}. This is because a query point and its nearest neighbor might be on opposite sides of a partition. Therefore, searching for nearest neighbors is typically done by approximate search using a priority queue~\cite{beis_shape_1997}. A priority queue orders nodes to further search based on their distance to the query vector. The search returns the best result after a threshold number of checks have been made. % $ ~\cite{beis_shape_1997}. Search accuracy is improved by using multiple randomized kd-trees~\cite{silpa_anan_optimised_2008}. If a single kd-tree has a probability of failure $p$, then $m$ independently constructed trees have a $p^m$ probability of failure. For each kd-tree a random Householder matrix is used to efficiently rotate the data. Using a random rotation preserves distances between rotated vectors but does not preserve the dimension of maximum variance. This means that each of the $m$ kd-trees yields a different partitioning of the data, although it is not guaranteed to be independent. When searching multiple random kd-trees, a single priority queue keeps track of the next nearest bin boundaries to search over all the trees. \subsection{Hierarchical k-means} Another tree-based method for approximate nearest neighbor search is the hierarchical k-means. Each level in the hierarchical k-means tree partitions the data using the k-means algorithm~\cite{lloyd_least_1982} with a small value of $k$ (\eg{} 3). To query a new point it moves down the tree into the bin of the closest centroid at each level until it reaches a leaf node. Hierarchical k-means was one of the first techniques used to define a visual vocabulary~\cite{nister_scalable_2006} --- a structure used for indexing and quantizing large amounts of descriptors. \subsection{Locality sensitive hashing} A hashing-based method for approximate nearest neighbor search is locality-sensitive hashing (LSH). This method is able to search a dataset of vectors for approximate nearest neighbors in sub-linear time~\cite{charikar_similarity_2002, datar_locality_sensitive_2004, kulis_fast_2009, kulis_kernelized_2012, tao_locality_2013}. LSH trades off a small amount of accuracy for a large query speed-up. A database is indexed using $M$ hash tables. Each hash table uses its own randomly selected hash function. For each hash table, a query vector computes its hash and adds the database vectors it collided with to a shortlist. The shortlist is sorted by distance and returned as the approximate nearest neighbors. \subsection{FLANN} The fast library for approximate nearest neighbors (FLANN) is a software package built to quickly index and search high dimensional vectors~\cite{muja_fast_2009}. The FLANN package implements efficient algorithms for hierarchical k-means, kd-trees, and LSH{}. It also implements a hybrid between the k-means and kd-tree, as well as configuration optimization, to select the combination of algorithms that best reaches the desired speed/accuracy trade-off for a given dataset. Configuration optimization is performed using the Nelder-Mead downhill simplex method~\cite{nelder_simplex_1965} with cross-validation. \subsection{Product quantization} Product quantization is a method for speeding up approximate nearest neighbor search of a set of high dimensional vectors~\cite{jegou_product_2011,ge_optimized_2013}. Each vector is split up into a set of sub-vector components. For each component, the sub-vectors are separately quantized using a codebook/dictionary/vocabulary. The pairwise squared distances between centroids in the vocabulary are stored in a lookup table. To comparing the distance between two vectors first each vector is split into sub-vectors, next the sub-vectors are quantized, and then the squared distances between quantized sub-vectors are read from the lookup table. The approximated squared distance between these two vectors is the sum of the squared distances between the quantized sub-vectors. \subsection{Discussion --- choice of approximate nearest neighbor algorithm} In our single annotation identification algorithm a query descriptor searches for its nearest neighbor in a database containing all descriptors from all \exemplars{}. Each annotation contains \OnTheOrderOf{4} features, which are described with $128$-component SIFT descriptors. Searching exact nearest neighbors becomes prohibitive when hundreds or thousands of images are searched. Thus, we turn towards approximate nearest neighbor algorithms. In this \thesis{} all of our approximate nearest neighbors are found using the multiple kd-tree implementation in the FLANN package~\cite{muja_fast_2009}. Using the configuration optimization built into the FLANN package, we have found that multiple kd-trees provide more accurate feature matches for our datasets than those computed by hierarchical k-means trees or LSH{}. In addition to being fast and accurate, multiple kd-trees support efficient addition and removal of points, which is needed in a dynamic setting~\cite{silpa_anan_optimised_2008}. \section{INSTANCE RECOGNITION}\label{sec:ir} There are many variations on the problem of visual recognition such as: specific object recognition (\eg{} CD-covers)~\cite{lowe_distinctive_2004, sivic_efficient_2009, nister_scalable_2006}, % -- location recognition~\cite{jegou_hamming_2008,jegou_aggregating_2012,tolias_aggregate_2013}, % -- person re-identification~\cite{shi_embedding_2016,karanam_person_2015,wu_viewpoint_2015}, % -- face verification/recognition~\cite{chopra_learning_2005, huang_labeled_2007, berg_tom_vs_pete_2012, chen_blessing_2013, taigman_deepface_2014, schroff_facenet_2015}, % -- category recognition~\cite{lazebnik_beyond_2006,zhang_local_2006,mccann_local_2012,boiman_defense_2008}, % -- and fine-grained recognition~\cite{parkhi_cats_2012,berg_poof_2013, gavves_local_2014}. % -- The different types of recognition problems lie on a spectrum of specificity \wrt{} the objects they attempt to recognize. On one end of the spectrum, \glossterm{instance recognition} techniques --- like scene recognition or face verification --- search for matches of the same exact object. On the other end of the spectrum category recognition algorithms --- like car, bird, dog, and plane detectors --- look for the same type of objects. Other problems sit --- like fine-grained recognition where the goal might be to recognize specific subspecies of dog (\eg{} German shepherd, golden retriever, boxer, beagle, \ldots{}) --- somewhere in the middle. Animal identification is closest to the instance recognition side of the spectrum, but the proposed solution draws upon techniques from other forms of recognition. The discussion in this section focuses on instance recognition. The next two sections will discuss category recognition and fine-grained recognition. \subsection{Spatial verification}\label{subsec:sverreview} Before discussing specific techniques in instance recognition, we describe work related to spatial verification. Most instance recognition techniques initially match local image features without using any spatial information~\cite{lowe_distinctive_2004, sivic_efficient_2009, philbin_object_2007, tolias_image_2015}. This results in pairs of images with spatially inconsistent feature correspondences. Spatially inconsistent matches are illustrated in~\cref{fig:figSVInlier}. Inconsistent features are removed using \glossterm{spatial verification}, a process based on the random sample consensus (RANSAC) algorithm~\cite{fischler_random_1981}. RANSAC has come to refer to a family of iterative techniques to sample inliers from a noisy dataset that are consistent with some model~\cite{fischler_random_1981, hartley_multiple_2003, chum_locally_2003, raguram_usac_2013}. In the context of spatial verification the model is an affine transformation matrix, and the dataset is a set of feature correspondences~\cite{lowe_distinctive_2004, sivic_video_2003, philbin_object_2007, chum_total_2011, arandjelovic_three_2012}. At each iteration of RANSAC a small subset of points is sampled from the original dataset and used to fit a hypothesis model. All other data points are tested for consistency with the hypothesis model. A score is assigned to the hypothesis model based on how well the out of sample data fit the model (\eg{} the number of transformed points that are within a threshold distance of their corresponding feature). After a certain number of iterations the process stops and returns the hypothesized model with the highest score as well as the inliers to that model. When RANSAC returns a large enough set of inliers (\wrt{} some threshold), the hypothesis model it is generally considered to be a ``good fit''. In this case a more complex model --- that may be more sensitive to outliers --- can be fit. In spatial verification, it is common to use the RANSAC-inliers to estimate a homography transformation~\cite[311--320]{szeliski_computer_2010}. The homography is then used to estimate a new set of refined inliers, and these are returned as the spatially verified feature correspondences. \figSVInlier{} \FloatBarrier{} \subsection{Lowe's object recognition} Lowe's introduction of SIFT descriptors includes an algorithm for recognizing objects in a training database and serves as an instance recognition baseline~\cite{lowe_distinctive_2004}. A single kd-tree indexes all database image descriptors. Approximate nearest neighbor search of the kd-tree is performed using the best-bin-first algorithm~\cite{beis_shape_1997}. For a query image, each keypoint is assigned to its nearest neighbor as a match. The next nearest neighbor (belonging to a different object) is used as a normalizer --- a feature used to measure the distinctiveness of a match. Any match with a ratio of distances to the match and the normalizer greater than threshold $t_{\tt{ratio}} \teq 0.8$ is filtered as not distinctive. Features likely to belong to the same object are clustered using a Hough Transform, and then clusters of features are spatially verified with a RANSAC approach~\cite{fischler_random_1981}. \subsection{Bag-of-words instance recognition}\label{subsec:bow} One of the most well-known techniques in instance recognition is the \glossterm{bag-of-words} model introduced to computer vision by Sivic and Zisserman~\cite{sivic_video_2003, sivic_efficient_2009}. The bag-of-words model addresses instance recognition using techniques from text retrieval. An image is cast as a text document where the image patches (detected at keypoints and described with SIFT) are the words. The concept of a visual word is formalized using a visual vocabulary. A \glossterm{visual vocabulary} is defined by clustering feature descriptors traditionally constructed using k-means~\cite{lloyd_least_1982} (however more recent methods have learned vocabularies using neural networks~\cite{arandjelovic_netvlad_2016}). The centroids of the clusters represent the \glossterm{visual words} in the vocabulary. These centroids are used to quantize descriptor space. A feature in an image is assigned (quantized) to the visual word that is the feature's approximate nearest neighbor in the vocabulary. This means that each descriptor vector can be represented using just a single number --- \ie{} its index in the vocabulary. Vocabulary indices are used to construct an inverted index, which allows multiple feature correspondences to be made using a single lookup. Given a visual vocabulary, the bag-of-words algorithm consists of two high level steps: (1) an offline indexing step and (2) an online search step. The offline step indexes a database of images for fast search. First each descriptor in each database image is assigned to its nearest visual word. An inverted index is constructed to map each visual word to the set of database features assigned to that word. Each database feature is assigned a weight based on its term frequency (tf). Finally, each word in the vocabulary is assigned a weight based on its inverse document frequency (idf). The online step searches for the images in the database that are visually similar to the query image. First, each descriptor in the query image is assigned to its visual word, and the term frequency of each visual word in the query image is computed. Then, the inverted index is used to build a list of all images that share a visual word with the query. For each matching image, the sum of the tf-idf scores of the corresponding features is used as the image score. Finally, the ranked list of images is returned. These steps are now described in further detail. \subsubsection{The inverted index} The visual vocabulary allows for a constant length image representation. An image is represented as a histogram of visual words called a bag-of-words. A bag-of-words histogram is sparse because each image contains only a handful of words from a vocabulary. The sparsity of these vectors allows for efficient indexing using an inverted index. An inverted index maps each word to the database images that contain the word. Therefore, when a feature in a new query image is quantized the inverted index looks up all the database features that it matches to. A new feature correspondence is created for each database feature the inverted index maps to. For each correspondence a feature matching score is computed. Because all the word assignments and feature correspondences are known, the scores of all matching images can be efficiently computed by summing the scores of their respective feature correspondences. \subsubsection{Vocabulary tf-idf weighting} Each word in the database is weighted by its inverse document frequency (idf), and each individual descriptor is weighted by its term-frequency (tf)~\cite{sivic_efficient_2009}. The idea behind the idf weight is that words appearing infrequently in the database are discriminative and should receive higher weight. The idea behind the tf weight is that words occurring more than once in the same image are more important. \subsubsection{Formal bag-of-words scoring} Let $\X$ be the set of descriptor vectors in an image. We also use $\X$ to refer to the image in general. Descriptor space is quantized using a visual vocabulary where $\C$ is the set of word centroids and $w_\c$ is the weight of a specific word. Let $\X_\c \subset \X$ be the set of descriptors in an image assigned to visual word $\c$. Let $q(\x)$ be the function that maps a vector to a visual word. We overload notation to also let $q(\X)$ map a set of descriptors into a set of visual words. The tf-idf weighting of a single word $\c$ in the vocabulary is computed as follows: Let $N$ be the number of images in the database. Let $N_\c$ be the number of images in the database that contain word $\c$. $\card{\X}$ is the number of descriptors in an image, and $\card{\X_\c}$ is the number of descriptors quantized to word $\c$ in that image. The idf weighting of word $\c$ is: \begin{equation} w_\c = \opname{idf}(\c) = \ln{N/N_\c} \end{equation} The tf weighting of a word $\c$ in an image $\X$ is: \begin{equation} \opname{tf}(\X, \c) = \frac{\card{\X_\c}}{\card{\X}} \end{equation} Similarity between bag-of-words vectors is computed using the weighted cosine similarity. It is only necessary to sum the scores of matching features because the weight of a word that is not in both a query and database image is $0$. The tf-idf similarity between two images can be written as \begin{equation} \opname{sim}(\X, \Y) = \sum_{\c \in q(\X) \isect q(\Y)} \opname{tf}(\X, \c) \opname{tf}(\Y, \c) \opname{idf}(\c) \end{equation} or equivalently \begin{equation} \opname{sim}(\X, \Y) = \frac{1}{\card{\X}\card{\Y}}\sum_{\c \in \C} w_\c \sum_{\xdesc \in \X_\c} \sum_{\ydesc \in \Y_\c} 1 \end{equation} The second formulation unifies the bag-of-words model with other vocabulary-based methods in the SMK framework, which will be discussed later in~\cref{sec:smk}. \subsubsection{Extensions to bag-of-words} The main strength and the main weakness of vocabulary-based matching is its use of quantization. Quantization allows for large databases of images to be searched very rapidly~\cite{nister_scalable_2006}. However, quantization causes raw descriptors to lose much of their discriminative information~\cite{philbin_lost_2008, boiman_defense_2008}. When a high-dimensional feature vector is quantized, it only encodes the presence of a word in a single number. This number is as descriptive as the partitioning of descriptor space, which is quite coarse in $128$ dimensions, even with a large vocabulary. Several methods have been developed to help reduce errors caused by quantization. Soft-assignment helps avoid quantization errors by mapping each raw descriptor to multiple words~\cite{philbin_lost_2008}. Another way to reduce quantization error is to use a finer partitioning of descriptors space~\cite{philbin_object_2007}. Approximate hierarchical clustering and approximate k-means have been used to build vocabularies with up to $1.6 \times 10^7$ words~\cite{nister_scalable_2006, philbin_object_2007, mikulik_learning_2010}. Alternative similarity measures for descriptor quantization are also explored in~\cite{mikulik_learning_2010}. A projection matrix for SIFT descriptors is learned in~\cite{philbin_descriptor_2010} to preserve information that would be lost in quantization. Because the tf-idf weighting was originally designed for text recognition, it does not take into account challenges that occur in image recognition such as bursty features --- a single feature that appears in an image with a higher than term expected frequency (\eg{} bricks on a wall or vertical stripes on a zebra). Strategies for accounting for burstiness involve penalizing frequently occurring features by removing multiple matches to the same feature, using inter-image normalization, and using intra-image normalization.~\cite{jegou_burstiness_2009}. Query Expansion is a way to increase the recall of retrieval techniques and recover from tf-idf failure~\cite{chum_total_2007, chum_total_2011, arandjelovic_three_2012, tolias_visual_2014}. After an initial query, all spatially verified feature correspondences are back-projected onto the query image. Then the query is re-issued. A model of ``confusing features'' --- features more likely to belong to the background --- can be used to filter out matches that should not be back projected onto the query image. Query expansion enriches the query with intermediate information that may help retrieve other viewpoints of the query image. However, because this technique requires at least one correct result in the ranked list, it only improves recall for queries that already have high accuracy. One method to improve the performance of bag-of-words search is to remove non-useful features. It is found that as few as $4\percent$ of the features can be used in location recognition without loss in accuracy~\cite{turcot_better_2009}. This related work defines a useful feature as one that is robust enough to be matched with a corresponding feature and stable enough to exist in multiple viewpoints. Thus, these useful features are computed as those that produced a spatially verified match to a correct image. Any feature that does not produce at least one spatially verified match is removed. Removing non-robust features both saves space and improves matching accuracy. \subsection{Min hash} Min-hashing is the analog of locality-sensitive hashing for sets. Min-hashing has been applied as an instance recognition technique for near-duplicate image detection~\cite{chum_near_2008}, logo recognition~\cite{romberg_bundle_2013}, large scale image search~\cite{wang_semi_supervised_2012}, scene recognition~\cite{zhang_image_2011}, and unsupervised object discovery~\cite{chum_geometric_2009, chum_large_scale_2010}. The basic idea is to represent an image as a set of hashes based on permutations of a visual vocabulary. Recognition is accomplished by performing a lookup for each hash. Collisions are returned as the recognition results. Like LSH, the primary advantage of using min hash for instance recognition is its speed. \subsection{Hamming embedding} Hamming embedding is an extension of the bag-of-words framework that reduces the information lost in quantization by assigning each descriptor a small binary vector~\cite{jegou_hamming_2008, jegou_burstiness_2009, jegou_improving_2010}. Each visual word $\c$, is assigned a $d_b \times d$ random orthogonal projection matrix $\mat{P}_\c$, where $d$ is the number of descriptor dimensions and $d_b$ is the length of the binary code. A set of $d_b$ thresholds, $\vec{t}_\c \in \Real^{d_b}$, is pre-computed for each word using the descriptors used to form the visual word cluster. These descriptors are projected using the word's random orthogonal matrix, and the median value of each dimension is chosen as that dimension's threshold. When any descriptor, $\desc$, is assigned to a word $\c$ it is also assigned a binary Hamming code, $\vec{b}$. To compute the binary Hamming code the descriptor is projected using the word's orthogonal matrix, $\vec{b}' = \mat{P}_\c \desc$, and then each dimension is binarized based on a threshold, % $b_i = (b'_i > t_{\c{}i})$. When a query descriptor, $\desc$, is assigned to a word, $\c$, it is matched to all database descriptors belonging to that word. Each match is then assigned a score. First, the Hamming distance, $h_d$, is computed between the binary signature of the query and database descriptors. If the Hamming distance of the match is not within threshold, $h_t$, the score of the match is $0$ and does not contribute to bag-of-words scoring. Otherwise, the score is the word's squared idf weight multiplied by a Gaussian falloff based on the Hamming distance. Using the inverted index, each image is scored by summing the scores of the descriptors that matched that image. The image scores are used to define a ranked list of results. %In~\cite{jegou_hamming_2008} only the idf weight is used %In~\cite{jegou_burstiness_2009} squared idf and the Gaussian falloff %In~\cite{jegou_improving_2010} squared idf and the probability having a Hamming distance lower than or equal to a. %\begin{equation} %w_d(h_d) = -\log_2\paren{2^{-d_b} \sum_{i = 0}^{h_d} \binom{i}{d_b}} %\end{equation} \subsection{Fisher vector} A Fisher vector is an alternative to a bag-of-words~\cite{perronnin_large_scale_2010_1, jegou_aggregating_2010}. Like bag-of-words, Fisher vector representations have been used in both instance and category recognition~\cite{perronnin_fisher_2007, cinbis_image_2012, sun_large_scale_2013, sanchez_image_2013, juneja_blocks_2013, douze_combining_2011, ma_local_2012, murray_generalized_2014, gosselin_revisiting_2014}. Instead of training discrete visual vocabulary using the cluster centers of k-means, a Fisher vector encoding uses a continuous Gaussian mixture model (GMM). The number of Gaussian components in the GMM is similar to the number of words in a vocabulary. An image is encoded using the GMM by computing the likelihood of each feature with respect to the GMM{}. Likelihoods for different components of the GMM are aggregated using a soft-max function. Often, each component of this vector $\vec{v}$ is then power law normalized with fixed constant $0 \leq \beta < 1$. Power law normalization is a simple post processing method written as $v_i = \txt{sign}\paren{v_i}\abs{v_i}^\beta$~\cite{jegou_aggregating_2012}. Fisher vectors produce a much richer representation than normal bag-of-words vector because each descriptor is assigned to a continuous mixture of words rather than a single word. It is noted in~\cite{perronnin_large_scale_2010_1} that using Fisher vectors for instance recognition is similar to tf-idf. Normalized Fisher vectors down-weight frequently occurring GMM components --- \ie{} words with low idf weights. Furthermore, Fisher vector representations are well suited for compression, which allows scaling to large image collections. \subsection{VLAD --- vector of locally aggregated descriptors} A vector of locally aggregated descriptors (VLAD) is similar to a Fisher vector descriptor --- in fact it is a discrete analog of a Fisher vector~\cite{jegou_aggregating_2010, jegou_aggregating_2012}. Like Fisher vectors, VLAD has been used in the context of both instance and category recognition~\cite{jegou_negative_2012, delhumeau_revisiting_2013, arandjelovic_all_2013}. VLAD still computes a visual vocabulary and assigns each feature to its nearest word, but instead of only recording presence or absence of a word, each feature computes the residual vector from the centroid of its assigned word. The residual vectors are summed to produce one constant length vector per word. All summed residuals are concatenated to produce a constant length image representation. Aggregation of the residual vectors allows for an accuracy similar to bag-of-words methods to be obtained, but using a smaller vocabulary ($\OnTheOrderOf{1} - \OnTheOrderOf{2}$ words). Like Fisher vectors, VLAD descriptors are also power-law normalized~\cite{jegou_aggregating_2012}. There have been many extensions of the VLAD descriptor. The value of PCA, whitening, and negative evidence was shown in~\cite{jegou_negative_2012}. The MultiVLAD scheme is inspired by~\cite{torii_visual_2011}, and allows for retrieval of smaller objects that appear in larger images~\cite{arandjelovic_all_2013}. The basic idea is that VLAD descriptors are tiled in $3 \times 3$ grids. An integral image~\cite{viola_robust_2004} of unnormalized VLAD descriptors is used to represent many possible tiles. A vocabulary adaptation scheme is also introduced in~\cite{arandjelovic_all_2013}. The vocabulary is updated when a new image is added to the VLAD inverted index. This is performed by updating any word centroid $\c$ to $\c'$, where $\c'$ is the average of all the descriptors currently assigned to that word. The residuals of the affected words are recomputed and re-aggregated into updated VLAD descriptors. Recently, NetVLAD --- a convolutional variant of the VLAD descriptor --- has been introduced~\cite{arandjelovic_netvlad_2016,radenovic_cnn_2016}. NetVLAD uses deep learning with a triplet loss function to simultaneously learn both the patch-based descriptors and the vocabulary. This convolutional approach shows large improvements (a $19\percent$ improvement on Oxford 5k) over previous state-of-the-art image retrieval techniques. \subsection{SMK --- the selective match kernel}\label{sec:smk} The selective match kernel (SMK) encapsulates the vocabulary-based techniques such as bag-of-words, Hamming embedding, VLAD, and Fisher vectors into a unified framework~\cite{bo_efficient_2009, tolias_aggregate_2013, tolias_image_2015, jegou_triangulation_2014}. SMK provides a framework that ``bridges the gap'' between matching-based (here a match refers to a feature correspondence) approaches and aggregation-based approaches. The scores of matching-based approaches such as Hamming embedding and bag-of-words are based on establishing individual features correspondences. In contrast, the scores of aggregation approaches such as VLAD and Fisher vectors are computed from compressed image representations, where the individual features are not considered. An advantage of a matching-based approach like Hamming embedding is that it can define a selectivity function. A selectivity function down weights individual feature correspondence with low descriptor similarity. Aggregation schemes have been shown to have their own advantages. Aggregated approaches like VLAD allow for matching applications to scale to a large number of images because each image is indexed with a compressed representation. Furthermore, aggregation-based approaches have been shown to provide better matching results on many datasets because they implicitly down weight bursty features~\cite{tolias_aggregate_2013, tolias_image_2015}. In the SMK framework a matching function and selectivity function are chosen. Different selections of these functions can implement and blend desirable attributes of the aforementioned frameworks. The matching function assigns correspondences between query and database descriptors. The choice of the matching function determines whether the resulting kernel is aggregated or non-aggregated. The selectivity function weights a correspondence's contribution to image similarity. It also can apply either power-law like normalization or hard thresholding in order to down weights correspondences with low visual similarity. One advantage of the SMK framework is that the selectivity function can be used in aggregated matching. In this case the selectivity function is applied to all correspondences assigned to a particular word. \subsection{Face recognition and verification} Face recognition is a specific form of instance recognition with the goal of recognizing individual human faces~\cite{zhao_face_2003, huang_labeled_2007}. Related to face recognition is the problem of face verification. In contrast to face recognition, face verification takes two unlabeled images and decides if they show the same face or different faces~\cite{taigman_deepface_2014}. Clearly these techniques are complementary because highly ranked results from a face recognition algorithm can be verified as true or false by a face verification algorithm. Due to the specific nature of this problem specialized features detectors are often used. Facial feature detectors localize facial-landmarks such as the eye, mouth, and nose center and corner locations~\cite{dantone_real_time_2012, berg_tom_vs_pete_2012}. Local texture-based descriptors such as Gabor filters~\cite{liu_gabor_2002, zhang_histogram_2007, shen_review_2006} and local binary patterns (LBP)~\cite{ahonen_face_2006, chen_blessing_2013} are extracted at detected facial regions~\cite{belhumeur_localizing_2011}. Facial recognition researchers have also developed global descriptors --- such as eigenfaces~\cite{turk_eigenfaces_1991}, Fisherfaces~\cite{belhumeur_eigenfaces_1997}, and neural network based descriptions~\cite{lawrence_face_1997, taigman_deepface_2014}. --- that represent the entire face. Recently, algorithms using both local and global representations computed using deep convolutional neural networks have shown state-of-the-art performance on both machine and human verification and recognition benchmarks~\cite{taigman_deepface_2014}. In face recognition, each face image is encoded into a single vector. A function is trained to classify an unseen test image as an individual from the database of known faces. Many techniques are used in the literature to retrieve or classify a face. Examples of these techniques are: neural networks~\cite{turk_eigenfaces_1991, taigman_deepface_2014}, sparse coding~\cite{wright_robust_2009, jiang_label_2013}, principal component analysis (PCA)~\cite{craw_face_1992}, Fisher linear discriminant (FLD)~\cite{liu_robust_2000}, linear discriminant analysis (LDA)~\cite{lu_face_2003}, and support vector machines (SVMs)~\cite{phillips_support_1999, levy_svm_minus_2013}. Before the neural network revolution~\cite{krizhevsky_imagenet_2012}, sparse coding was one of the most popular techniques to retrieve faces~\cite{aharon_k_svd_2006, wright_robust_2009, zhang_sparse_2011, jiang_label_2013}. Sparse coding attempts to reconstruct unlabeled test vectors by searching for a linear combination of basis vectors from an over-complete labeled training database. Coding-based techniques are very similar to vocabulary-based methods. A codebook, dictionary, and vocabulary all are used to build image-level vector representations by quantizing raw features. Another interesting technique is the Tom-vs-Pete classifier~\cite{berg_tom_vs_pete_2012}. Given a set of $N$ individuals (classes), a set of Tom-vs-Pete classifiers are used for both verification and indexing. At each facial landmark, $k$ Tom-vs-Pete classifiers are computed. A single Tom-vs-Pete classifier is a linear SVM trained on a single corresponding feature for a single pair of classes. \Eg{} all the nose descriptors from class $T$ and class $P$ make up the SVM training data, and the learned SVM classifies a new nose feature as $T$-ish or $P$-ish. A descriptor vector for a single face is made by selecting $5000$ out of the total $k\binom{N}{2}$ classifiers and concatenating the signed distances from all the classifiers' separating hyperplanes. This descriptor facilitates both search and verification. A pair of face descriptor vectors can be verified as either a correct or incorrect match by constructing a new vector. The new vector is constructed by concatenating the element-wise product and difference of the two descriptor vectors. Then this new vector is classified using a radial basis function SVM{}. One of the most recent advances in face verification and recognition is the DeepFace system~\cite{taigman_deepface_2014}. The DeepFace system implements face verification using the following pipeline: (1) detect, (2) align, (3) represent, and (4) classify. Specialized facial point detectors and a 3D face model are used to register a 3D affine camera to an RGB-image. The image is then warped into a ``frontalized'' view using a piecewise affine transform. A face is represented as the $4096$ dimensional output of a deep $7$ layer convolutional neural network that exploits the aligned nature input images. An $8$\th layer is used in supervised training where each output unit corresponds to a specific individual. At test time the L2-normalized output of the network is used as the feature representation. In a supervised setting, a $\chi^2$-SVM is trained to recognize the individuals in a training dataset using the descriptor vectors produced by the network. In an unsupervised setting an ensemble of classifiers is used. The ensemble is composed of the output of a Siamese network~\cite{chopra_learning_2005} and several non-linear SVM classifiers with different inputs. The inputs are deep representations --- the activations of a deep neural network's output layer --- of the 3D aligned RGB-image, the 2D aligned RGB-image (generated using a simpler model based on similarity transforms), and an image comprised of intensity, magnitude, and orientation channels. Each input was fed through four deep networks each with different initialization seeds. DeepFace achieves an accuracy of $0.9735$ on the Labeled Faces in the Wild dataset~\cite{huang_labeled_2007}, which is comparable to the human performance measured at $0.975$. When using unaligned faces the ROC score drops to $0.879$, which demonstrates that alignment is very important for handling the problem of viewpoint in face verification. \subsection{Person re-identification} %Radke dictionary learning ICCV~\cite{karanam_person_2015} %Radke pose priors TPAMI~\cite{wu_viewpoint_2015} %Deep model of person re-id~\cite{shi_embedding_2016}. The person re-identification problem is typically posed in the context of locating the same person within a few minutes or hours from low-resolution surveillance video~\cite{hirzer_relaxed_2012,karanam_person_2015,wu_viewpoint_2015,shi_embedding_2016}. Common approaches to person re-identification typically transform images into a fixed length mid-level vector representation and a learned distance metric is used to compare representations. Mid-level representations can be built from color and texture histograms or extracted using a convolutional neural network. The distance metric is commonly learned as a Mahalanobis distance using linear discriminant analysis~\cite{hirzer_relaxed_2012}. However, alternative approaches using dictionary learning ~\cite{karanam_person_2015} have also been shown to work well. Improvements to baseline can be achieved by conditioning person descriptors on viewpoint and pose~\cite{wu_viewpoint_2015}. Recently both features and distance metric have been learned using neural networks~\cite{shi_embedding_2016}. %The data for person re-identification typically is composed of % low-resolution image captured by surveillance cameras. %The goal is often to identify images of people taken within minutes or % hours of each other. %and % therefore keypoint algorithms have typically proven most successful. %Therefore, additional work is needed to generalize to other species. %We have performed initial experiments that support this claim. \subsection{Discussion --- instance recognition} Most instance recognition techniques use an indexing scheme based on a visual vocabulary~\cite{tolias_image_2015, jegou_hamming_2008, philbin_object_2007, cao_learning_2012, arandjelovic_all_2013, jegou_negative_2012, chum_fast_2012, gong_multi_scale_2014}. However, our baseline approach for animal identification does not use a visual vocabulary. This is because a visual vocabulary quantizes the raw features in the image and thus removes some of their discriminative ability~\cite{philbin_lost_2008, boiman_defense_2008}. We have found this quantization to cause a noticeable drop in performance. Many aspects of our baseline algorithm are similar to Lowe's recognition algorithm~\cite{lowe_distinctive_2004}, which does not quantize descriptors. The guiding principles of matching, filtering based on distinctiveness, filtering based on spatial consistency, and scoring are shared with our approach. However, our approach features several improvements to this algorithm. Furthermore, animal identification is a dynamic problem with specific domain-based concerns --- such as quality and viewpoint in natural images --- and requires innovation beyond Lowe's recognition algorithm. % chktex-file 8 Even though we would prefer to retain the discriminative information contained in raw descriptors, quantized image search has the ability to scale beyond our current suites~\cite{chum_fast_2012, perronnin_large_scale_2010_1, tolias_image_2015}. In the future it may be necessary to investigate a VLAD-based SMK framework as a quantized alternative to our matching algorithm. Techniques such as soft-assignment~\cite{philbin_lost_2008} and learned vocabularies~\cite{mikulik_learning_2010} could be used to reduce quantization errors. It is necessary to update the vocabulary as new images are added to the system. This issue could be addressed using the vocabulary adaptation technique in~\cite{arandjelovic_all_2013}. However, in this research we are more focused on the problem of verifying identifications to reduce manual effort. As such we leave the scalable search issue for future work. Facial recognition is similar to the problem of animal identification. %Technically is is a subset of the problem. Both problems seek to identify individuals. Some techniques used for face verification such as the Siamese network~\cite{chopra_learning_2005, taigman_deepface_2014} can be extended to the scope of animal identification. However, there is a much more mature literature on face recognition that has resulted in easily accessible and specialized algorithms for face feature detection and --- most importantly --- for face alignment. Individual animal identification does not have such a corpus of knowledge. We do not have access to highly specialized animal part detectors and alignment algorithms. Furthermore, we would like our algorithms to generalize to multiple species, so we would like to avoid over-specialized approaches. These are some reasons why convolutional neural networks will not make a prominent appearance in this \thesis{}. Other reasons involve the size of our datasets. The recent NetVLAD network was trained using training datasets with $10,000$ to $90,000$ images~\cite{arandjelovic_netvlad_2016}. We simply do not have this much labeled data. However, one goal of this \thesis{} is to develop techniques that will help bootstrap labeled datasets of this size. Future research should investigate these deep learning techniques so they can be used after enough data has been collected for a specific species. While the problem of animal identification and person re-identification are conceptually similar --- sharing challenges such as lighting, pose, and viewpoint variation --- differences in data collection creates the need for different solutions in practice. In contrast to the low-resolution image captured by surveillance cameras, the images used in animal identification are often manually captured by scientists in the field using high resolution DSLR cameras, and the goal is to match individuals over longer periods of time (years). Furthermore, re-identification techniques commonly focus on aggregate features that emphasize clothing, color, texture, and the presence of objects such as coats and backpacks, while in the animal id problem, it is often subtle localized variations in patterns on the skin and fur that distinguish individuals. \section{CATEGORY RECOGNITION}\label{sec:cr} Different types of image recognition lie at different points on a spectrum of specificity. If instance recognition is at one end of the spectrum, then category recognition is at the other. The goal of a \glossterm{category recognition} algorithm is to assign a categorical class label to a query image~\cite{everingham_pascal_2010, everingham_pascal_2015, russakovsky_imagenet_2015, deng_imagenet_2009, fei_fei_one_shot_2006, griffin_caltech_256_2007}. The categories often have visual appearances with a high degree of intra-class variance. \Eg{}, a recliner and a bench both belong to the chair category. Image representations and similarity measures are constructed to account for this. Despite this, techniques in category recognition have many similarities to instance recognition techniques. Until the neural network revolution~\cite{krizhevsky_imagenet_2012}, most category recognition techniques have been based on vocabulary methods~\cite{csurka_visual_2004, yang_linear_2009, sanchez_compressed_2013, russakovsky_imagenet_2015, krizhevsky_imagenet_2012} similar to those discussed in~\cref{subsec:bow}. This section first provides a brief overview of this literature. Then, we discuss naive Bayes classification techniques~\cite{boiman_defense_2008,mccann_local_2012} that play a large role in our baseline animal identification algorithms. \subsection{Vocabulary-based methods for category recognition} After vocabulary-based techniques demonstrated success in instance recognition, these techniques were quickly adapted and applied to category recognition~\cite{csurka_visual_2004}. Thus, there are many similarities --- and some differences --- in the techniques used to address these two problems. One difference is the size of the visual vocabulary. Instance recognition tends to require huge vocabularies ($\OnTheOrderOf{5}$ --- $\OnTheOrderOf{7}$ words) to achieve a fine sampling of descriptor space~\cite{nister_scalable_2006, philbin_object_2007}. In contrast, category recognition uses smaller vocabulary sizes ($\OnTheOrderOf{4}$ words) to more coarsely sample descriptor space~\cite{zhang_local_2006}. However, the vocabularies used in instance recognition have decreased in size with the advent of aggregated representations like VLAD and the Fisher vector~\cite{arandjelovic_all_2013, sanchez_compressed_2013}. A second difference is how similarity between images is computed. In instance recognition the similarity between bag-of-word vectors is computed using a weighted cosine similarity. However, in category-recognition intra-class variation requires more sophisticated similarity measurements. Here, image similarity is computed using SVMs with different either linear or non-linear kernels such as $\chi^2$, earth mover's distance, Hellinger, and Jensen-Shannon~\cite{zhang_local_2006, varma_learning_2007, vedaldi_efficient_2012}. A third difference is the way that spatial information is used. Instead of filtering correspondences using spatial verification, spatial information is incorporated into category recognition algorithms using spatial pyramids~\cite{grauman_pyramid_2005, lazebnik_beyond_2006}. A spatial pyramid sub-divides an image into a hierarchy of grids. Max pooling is often used to select only the strongest features in each spatial region~\cite{boureau_theoretical_2010, boureau_learning_2010}. Each section of the image is encoded using the vocabulary and images are scored based on matches in each region. \subsubsection{Enhancements to category recognition} There are a wide variety of extensions and enhancements for image classification techniques based on bag-of-words, such as soft assignment of visual-words~\cite{liu_defense_2011} and vocabulary optimization~\cite{wang_locality_constrained_2010}. Numerous matching kernels --- both linear and non-linear --- have been developed such as kernel PCA, histogram intersection, and SVM square root bag-of-words vectors~\cite{vedaldi_multiple_2009, maji_classification_2008, perronnin_large_scale_2010}. % chktex-file 8 Generalized coding schemes improve performance over a bag-of-words image encoding. Vocabularies can be seen as codebooks or dictionaries in coding-based image classification techniques such as sparse coding and locally constrained linear coding~\cite{jurie_creating_2005, yang_linear_2009, yang_supervised_2010, yang_efficient_2010, wang_locality_constrained_2010}. Many coding schemes learn both the centroids and the function that quantizes a raw descriptor into a word~\cite{jurie_creating_2005, yang_linear_2009, yang_supervised_2010, yang_efficient_2010, wang_locality_constrained_2010, vedaldi_multiple_2009}. Techniques other than k-means are used to create vocabularies such as mean shift~\cite{jurie_creating_2005}, coordinate descent with the locally constrained linear code criterion~\cite{wang_locality_constrained_2010}, and random forests~\cite{perronnin_fisher_2007}. Fisher vectors with linear classifiers have been found to outperform non-linear bag-of-words based SVM classifiers by using an L1-based distance measure and careful L2 and power-law normalization of descriptors~\cite{perronnin_improving_2010, perronnin_large_scale_2010}. \subsection{\Naive{} Bayes classification}\label{sec:nbnn} The \naive{} Bayes nearest neighbor (NBNN) classifier is a simple non-parametric algorithm for category recognition that does not quantize descriptor vectors~\cite{boiman_defense_2008}. Boiman responds to the dominance of complex non-linear category recognition algorithms in the field~\cite{varma_learning_2007, marszalek_learning_2007} by showing that simple techniques can compete with complex methods for category recognition. Boiman's paper also provides insight into the magnitude of information loss resulting from quantization. Previous to~\cite{boiman_defense_2008}, nearest neighbor classifiers had shown underwhelming accuracy in category recognition~\cite{varma_unifying_2004, lazebnik_beyond_2006, marszalek_learning_2007}. This was shown to be a result of using image-to-image distance. To remedy this, NBNN aggregates the information from multiple images by swapping the image-to-image distance for an image-to-class distance. In NBNN, features of each class are indexed for fast nearest neighbor search, typically with a kd-tree~\cite{bentley_multidimensional_1975}. For each feature, $\desc_i$, in a query image, the algorithm searches for the feature's nearest neighbors in each class, $\opname{NN}_C(\desc_i)$. The result of the algorithm is the class, $C$, that minimizes the image-to-class distance. In other words, the class of a query image is chosen by searching for the class that minimizes the total distance between each query descriptor and the nearest database descriptor in that class. This is expressed in the following equation: \begin{equation} C = \argmin{C} \sum_{i=1}^n ||\desc_i - \opname{NN}_C(\desc_i)||^2 \end{equation} This formulation where each descriptor is assigned to only the single nearest neighbor has been shown to be a good approximation to the minimum image-to-class Kullback-Leibler divergence~\cite{boiman_defense_2008} --- a measure of how much information is lost when the query image is used to model the entire class. \subsection{Local \naive{} Bayes nearest neighbor}\label{sec:lnbnn} Local \naive{} Bayes nearest neighbor (LNBNN) is an improved version of the NBNN algorithm in both accuracy and speed~\cite{mccann_local_2012}. In the original NBNN formulation a search is executed find each query descriptor's nearest neighbor in the database for each class separately. In contrast, the LNBNN modification searches all database descriptors simultaneously and ignores classes that do not return descriptor matches. Each descriptor $\desc_i$ in the query image searches for its nearest $K+1$ neighbors, % $\{\desc_1, \ldots, \desc_K, \desc_{K + 1}\}$ over all classes. The first $K$ neighbors are used as matches. The last neighbor is used as a normalizing term to weight the query descriptor's distinctiveness. Let $(\desc_i, \desc_j)$ be a matching descriptor pair, and let $C$ be the class of $\desc_j$. The score of each match is computed as the distance to the match subtracted from the distance to the normalizer. \begin{equation} s_{i, C} = \elltwo{\desc_j - \desc_K} - \elltwo{\desc_i - \desc_j} \end{equation} The score of a class $C$ is the sum of all the descriptor scores that match to it. \subsection{Discussion --- class recognition} Progress in category recognition is generally made using techniques that allow classes with high intra-class variance to have lower matching scores. This is of little value to an instance recognition application, therefore we do not investigate most of the techniques in this section. However, the LNBNN~\cite{mccann_local_2012} approach is interesting to us because it is a simple algorithm that does not suffer from quantization artifacts. NBNN and LNBNN~\cite{boiman_defense_2008,mccann_local_2012} never achieved state-of-the-art performance in image classification, however they have produced competitive results using simple techniques. The simplicity of the techniques allowed for the authors to gain insight into visual recognition. Due to its simplicity and the insight that quantization significantly reduces the descriptive power of SIFT features, we adopt LNBNN as the baseline algorithm for animal identification. \section{FINE-GRAINED RECOGNITION}\label{sec:fgr} Fine-grained recognition is a problem more general than instance recognition, but more specific than category recognition~\cite{parkhi_cats_2012, berg_poof_2013, gavves_local_2014}. Given an object of a known category, such as a bird, the goal of fine-grained recognition is to sub-classify the object into a fine-grained category such as a blackbird or a raven~\cite{berg_how_2013}. Algorithms for fine-grained recognition typically start by localizing the object and its parts with a detection algorithm~\cite{dalal_histograms_2005} and parts-based models. Parts are segmented to remove background noise using algorithms like GrabCut~\cite{rother_grabcut_2004}. Classification is performed locally on aligned parts as well as globally on the entire body and aggregated to yield a final classification. Because fine-grained recognition lies on the same spectrum as instance recognition and category recognition it is not surprising that many of the same techniques --- like Fisher vectors --- are used~\cite{gosselin_revisiting_2014}. Recently convolutional models have been successfully applied to fine-grained recognition~\cite{catherine_wah_similarity_2014, branson_bird_2014, zhang_weakly_2015, xiao_application_2015}. \subsection{Discussion --- fine-grained recognition} The goal of fine-grained recognition is somewhat similar to animal identification. Fine-grained recognition localizes subtle information to distinguish between two similar species, whereas animal identification localizes subtle information to distinguish between two similar individuals. However, the domains of species and individuals are dissimilar enough that off the shelf techniques for fine-grained recognition would need to be adapted before identification could be performed. One interesting avenue of research would be to use a parts model~\cite{felzenszwalb_object_2010} as in~\cite{gavves_local_2014}, to align individuals before they are compared. \section{DEEP CONVOLUTIONAL NEURAL NETWORKS}\label{sec:dcnn} Convolutional networks have been around for over more than two decades~\cite{lecun_gradient_based_1998, fukushima_neocognitron_1988}. However, they did not receive major attention from computer vision researchers until $2012$ when a deep convolutional neural network (DCNN)~\cite{krizhevsky_imagenet_2012} outperformed the best support vector machines (SVMs)~\cite{vapnik_statistical_1998} by over $10\percent$ in the ImageNet category recognition challenge~\cite{russakovsky_imagenet_2015}. Since then, many successful category recognition techniques based on DCNNs have been published~\cite{simonyan_very_2015, chatfield_efficient_2014, chatfield_return_2014, oquab_learning_2014, szegedy_going_2015, long_convnets_2014, he_spatial_2014, dean_fast_2013}. DCNNs have also been shown to produce excellent results when applied to other computer vision problems such as: % instance recognition~\cite{razavian_cnn_2014, razavian_baseline_2015, liu_learning_2015, held_deep_2015,arandjelovic_netvlad_2016,radenovic_cnn_2016}, % fine-grained recognition~\cite{branson_bird_2014, donahue_decaf_2014, catherine_wah_similarity_2014}, % detection~\cite{girshick_rich_2014, sermanet_overfeat_2013, li_wan_end_end_2015}, % face verification~\cite{huang_learning_2012, taigman_deepface_2014, sun_deep_2013}, % and learning similarity between feature patches~\cite{osendorfer_convolutional_2013, han_matchnet_2015, ng_exploiting_2015, zagoruyko_learning_2015, han_matchnet_2015}. The sudden success of deep nets has been attributed (1) a larger volume of available of training data, and (2) implementations using faster GPUs~\cite{krizhevsky_imagenet_2012}. Several techniques are employed to increase accuracy, reduce over-fitting, and reduce training time. Data augmentation is used to artificially increase the amount of training data~\cite{ciresan_multi_column_2012, ciresan_high_performance_2011, simard_best_2003}. The dropout technique has been shown to reducing over-fitting~\cite{dahl_improving_2013, srivastava_dropout_2014}. At training time outputs of hidden units are randomly suppressed which forces the network to learn a more robust representation. It has been shown that dropout can be viewed as a form of model averaging~\cite{hinton_improving_2012}. Rectified linear units (ReLU) have been shown to be a faster alternative to the standard sigmoid activation functions~\cite{vinod_rectified_2010, dahl_improving_2013}. A ReLU is similar to a hinge function and simply outputs the signal of a unit if it is positive and outputs a zero otherwise. Leaky rectified linear units (LReLU) further improve network accuracy by including a ``leakiness'' term while maintaining the speed of ReLUs~\cite{maas_rectifier_2013}. While a ReLU strictly suppresses a feature activation if it is negative a LReLU returns a small negative signal (by multiplying by a constant) instead of zero. A deep neural network is constructed by stacking several layers of units (neurons) together. Data is used to initialize the activations of an input layer, and the information is forward propagated through the network. Weights are chosen to optimize a loss function --- \eg{} categorical cross-entropy error or triplet loss~\cite{schroff_facenet_2015} --- which is chosen to depend on the application. Optimization of the loss function is performed using back-propagation~\cite{rumelhart_learning_1986} --- typically using mini-batches and stochastic gradient descent with momentum~\cite{sutskever_importance_2013}. Traditionally each layer in a neural network is fully connected --- each pair of units between the previous layer and the current layer has its own edge weight --- to the previous layer. However, in computer vision networks are constructed using convolutional layers. A DCNN connects the input layer to a stack of convolutional layers~\cite{krizhevsky_imagenet_2012}. A convolutional layer differs from a fully connected layer in that it is sparsely connected and that most of the edge weights between layers are shared~\cite{lecun_gradient_based_1998, fukushima_neocognitron_1988, serre_robust_2007}. Each convolutional layer is broken into several channels. Each channel is given its own weight matrix with a fixed width and height. This matrix of weights is convolved with the input layer to produce a feature activation map, one for each channel. Convolutional layers often use several pooling layers that aggregate information over a small area, reduce the size of the feature map, and increase robustness to transformations. Common pooling operations are max-pooling~\cite{serre_robust_2007, krizhevsky_imagenet_2012} and maxout~\cite{goodfellow_maxout_2013}. The convolutional layers may also be connected to a stack of fully connected layers. In this case, hierarchies of feature maps are built in the low level convolutional layers, and then fully connected layers learn decision boundaries between these features~\cite{zeiler_visualizing_2014}. Because of weight sharing, convolutional networks must learn significantly fewer parameters than fully connected networks. This allows convolutional networks to be trained much faster. Fewer weights also acts as a form of regularization for the network. Intuitively learned convolutional filters are similar to Gabor filters~\cite{gabor_theory_1946}, which are a naturally suited for extracting features from images. Even without learning weights, convolutions can be used to extract powerful features for matching~\cite{revaud_deep_2015}. The popular SIFT and HoG features~\cite{mahendran_understanding_2015} can even be implemented as convolutional networks. Despite the lack of hard theoretical insight into the inner workings of these networks, their empirical performance cannot be denied. \subsection{Discussion --- deep convolutional neural networks}\label{subsec:dcnndiscuss} Because of the astounding success of convolutional networks in almost every area in computer vision, we have investigated their use in animal identification. Specifically, we have investigated two approaches. The first approach used deep convolutional feature descriptors as a replacement for the SIFT~\cite{lowe_distinctive_2004} descriptor following the patch-based scheme in~\cite{zagoruyko_learning_2015}. The basic idea is to have two patches fed through the same (Siamese)~\cite{chopra_learning_2005} architecture and then compare their resulting encodings. This comparison can be as simple as Euclidean distance, or as complex as a learned distance measure. Training can be performed on pairs of patches, labeled as correct or incorrect, using the discriminative loss function~\cite{lecun_loss_2005}. Unfortunately, due to issues with the quality and quantity of our training data our convolutional replacements for the SIFT descriptor have not been successful. The second approach aimed to use Siamese networks to directly compare two images of an animal to determine if they were the same or different, similar to the method used in DeepFace~\cite{taigman_deepface_2014} for face verification. However, without the large training datasets and specialized alignment procedures used in DeepFace, we were unable to produce promising results. Due to these issues, this \thesis{} does not further pursue techniques based on DCNNs. We include this discussion to note the potential of deep learning applied to animal identification and to strongly suggest further investigation of these techniques in the future research. Of particular interest for future research is the matching technique presented in~\cite{rocco_convolutional_2017}. This method is particularly interesting because it learns to match and align images by mimicking a classic computer vision pipeline while using only synthetic training data. This may be able to overcome the issues mentioned above, however further investigation is needed.