Recommendations for acoustic recognizer performance assessment with application to five common automated signal recognition programs

Automated signal recognition software is increasingly used to extract species detection data from acoustic recordings collected using autonomous recording units (ARUs), but there is little practical guidance available for ecologists on the application of this technology. Performance evaluation is an important part of employing automated acoustic recognition technology because the resulting data quality can vary with a variety of factors. We reviewed the bioacoustic literature to summarize performance evaluation and found little consistency in evaluation, metrics employed, or terminology used. We also found that few studies examined how score threshold, i.e., cut-off for the level of confidence in target species classification, affected performance, but those that did showed a strong impact of score threshold on performance. We used the lessons learned from our literature review and best practices from the field of machine learning to evaluate the performance of five readily-available automated signal recognition programs. We used the Common Nighthawk (Chordeiles minor) as our model species because it has simple, consistent, and frequent vocalizations. We found that automated signal recognition was effective for determining Common Nighthawk presence-absence and call rate, particularly at low score thresholds, but that occupancy estimates from the data processed with recognizers were consistently lower than from data generated by human listening and became unstable at high score thresholds. Of the five programs evaluated, our convolutional neural network (CNN) recognizer performed best, with recognizers built in Song Scope and MonitoR also performing well. The RavenPro and Kaleidoscope recognizers were moderately effective, but produced more false positives than the other recognizers. Finally, we synthesized six general recommendations for ecologists who employ automated signal recognition software, including what to use as a test benchmark, how to incorporate score threshold, what metrics to use, and how to evaluate efficiency. Future studies should consider our recommendations to build a body of literature on the effectiveness of this technology for avian research and monitoring. Recommandations pour l'évaluation des performances de reconnaissance acoustique et application à cinq programmes courants de reconnaissance automatisée de signaux sonores RÉSUMÉ. Les logiciels de reconnaissance automatisée de signaux sonores sont de plus en plus utilisés pour extraire les données de détection des espèces d'enregistrements acoustiques récoltés au moyen d'unités d'enregistrement autonomes (ARU en anglais), mais il existe peu d'instructions pratiques sur l'utilisation de cette technologie pour les écologistes. L'évaluation de la performance est une étape importante dans l'utilisation d'une technologie de reconnaissance acoustique automatisée parce que la qualité des résultats peut varier en fonction de divers facteurs. Nous avons passé en revue la littérature sur la bioacoustique afin de résumer les critères d'évaluation de la performance, et avons trouvé que l'évaluation, les paramètres choisis et la terminologie utilisée étaient inconsistants. Nous avons aussi constaté que peu d'études examinaient dans quelle mesure le seuil du score, c'est-à-dire la limite du niveau de confiance de la classification de l'espèce cible, influait sur la performance; toutefois, les chercheurs qui l'ont fait ont observé que le seuil du score avait un fort effet sur la performance. Nous avons appliqué les leçons apprises de notre revue de la littérature et les meilleures pratiques dans le domaine de l'apprentissage automatique pour évaluer la performance de cinq programmes de reconnaissance acoustique automatisée rapidement et facilement utilisables. Nous avons choisi l'Engoulevent d'Amérique (Chordeiles minor) comme espèce-modèle, parce que ses vocalisations sont simples, invariables et fréquentes. Nous avons réalisé que la reconnaissance automatisée était efficace pour déterminer la présence-absence de l'engoulevent et sa fréquence de chant, particulièrement à des seuils de score bas. Par contre, l'occurrence calculée à partir des données traitées par reconnaissance automatisée était systématiquement plus faible que celle calculée à partir des résultats issus d'experts ayant écouté les enregistrements, et elle devenait instable à des seuils de score élevés. Des cinq programmes évalués, notre reconnaisseur « Convolutional neural network » (CNN) est celui qui a le mieux performé; les reconnaisseurs intégrés dans Song Scope et MonitoR ont aussi bien performé. Les reconnaisseurs RavenPro et Kaleidoscope ont été moyennement performants et ont produit plus de faux positifs que les autres reconnaisseurs. Enfin, nous proposons six recommandations générales destinées aux écologistes qui utilisent les logiciels de reconnaissance acoustique automatisée, y compris quoi faire comme test de performances, comment incorporer un seuil de score, quels paramètres utiliser et comment en évaluer l'efficacité. Les recherches à venir devraient prendre en compte notre recommandation à l'effet de concevoir un corpus sur l'efficacité de cette technologie pour la recherche et les suivis aviaires.


INTRODUCTION
Autonomous acoustic sampling is a popular method of data collection for ecological research and monitoring because many species use sound as a primary method of communication (Catchpole andSlater 2008, Shonfield andBayne 2017).In avian research, autonomous recording units (ARUs) are used to collect acoustic recordings, which can then be used for monitoring population trends (Furnas and Callas 2015), behavioral studies (Ehnes and Foote 2014), modeling habitat associations (Campos-Cerqueira and Aide 2016), and detecting rare or inconspicuous species (Homes et al. 2014, Sidie-Slettedahl et al. 2015).ARUs provide a variety of benefits over traditional human point counts, including the ability to collect data over repeat visits (Drake et al. 2016) and the flexibility to collect data at any time of day or year (Goyette et al. 2011).Additionally, recordings provide a permanent record that can reduce observer bias (Haselmayer andQuinn 2000, Campbell andFrancis 2012), be used to verify identification of rare species (Swiston andMennill 2009, Holmes et al. 2015), and analyzed later for other objectives (Luther and Derryberry 2012).ARU technology has also been widely used to study marine mammals, bats, insects, and frogs.
One of the challenges of using ARUs for ecological research and monitoring is the time required to extract target species detections from recordings (Shonfield and Bayne 2017).In response, automated signal recognition programs have been developed (e.g., de Oliveira et al. 2015, Katz et al. 2016, Nicholson 2016).Automated acoustic species recognition is the process of training a computer to detect, recognize, and evaluate the acoustic signature of a target species' vocalization.The computer runs the resultant detection algorithm (hereafter "recognizer") on recordings and evaluates the fit of the acoustic signal in the recording using a moving window.Some programs employ a single step process that runs the algorithm against every window (hereafter "moving window recognizer") while others use a two-step process that first conducts signal detection with a moving window, and then runs the algorithm only on detected signals (hereafter "signal detection recognizer").For each window or signal evaluated, the recognizer assigns a score value, which can be interpreted as a measure of confidence that a target vocalization match has been found.The recognizer then registers a "hit" for each signal with a score above a user-defined threshold.Choosing a high score threshold will minimize false positives, i.e., false identifications, but also results in false negatives, i.e., missed detections.If the score threshold is set low by the user, this will minimize false negatives, but create many false positives.Choosing a score threshold is generally a subjective process based on the priorities of the user (Katz et al. 2016).The results of automated signal recognition are often manually validated by the user to separate true positives from false positives.Many approaches to automated acoustic species recognition or classification have been employed including random forest (Aide et al. 2013, Campos-Cerqueira andAide 2016), Hidden Markov models (HMM; Skowronski and Harris 2006, Potamitis et al. 2014, de Oliveira et al. 2015) and/or Gaussian mixture models (GMM; Ganchev et al. 2015, Heinicke et al. 2015), binary point matching (Katz et al. 2016), spectrogram crosscorrelation (Katz et al. 2016), artificial neural networks (Jennings et al. 2008, Tachibana et al. 2014, Nicholson 2016), decision trees (Digby et al. 2013), and band-pass filters (Charif et al. 2010).There are annual and one-time machine learning competitions that drive the development of new birdsong recognizer methods (Stowell et al. 2016, Goëau et al. 2017), with current state-of-the-art approaches using deep machine learning models such as convolutional neural networks to recognize multiple species from soundscape recordings (Koops et al. 2014, Joly et al. 2016, Salamon and Bello 2017).Some of these approaches are commercially or freely available, while others have been custombuilt for specific research projects.
The number of tools available for automated signal recognition are rapidly increasing, yet there remains a need for a set of general recommendations for recognizer development and performance evaluation in ecology (Blumstein et al. 2011).Many authors have compared individual automated signal recognition programs to human processing to substantiate their use in ecological monitoring and research; however, authors have used a variety of metrics for evaluation, making it difficult to compare across studies.In other acoustic signal processing disciplines such as music analysis, speech classification, and machine learning, there are established best practices that ecologists can draw on to develop standardized evaluation methods (Salzberg 1997, Sokolova and Lapalme 2009, Raffel et al. 2014, Mesaros et al. 2016).Recognizer evaluation is particularly important because the quality of the species detection data produced can depend on a variety of factors including score threshold (Brauer et al. 2016), signal complexity of target species, quality of training data, spectrogram conversion, e.g., FFT size (Crump and Houlahan 2017), and recognition approach (Stowell et al. 2016).Ultimately, the appropriateness of automated acoustic species recognition will depend on the objective of the research or monitoring.
In response to this need for guidance, our goal was to provide general recommendations for recognizer performance comparison and evaluation.First, we review the literature for bioacoustic recognizer evaluation studies to confirm the need for such recommendations and identify the most commonly used metrics.Next, we conduct a recognizer evaluation based on the different approaches used in the literature to compare five Common Nighthawk (Chordeiles minor) recognizers: MonitoR (Katz et al. 2016), convolutional neural networks (CNN;Abadi et al. 2015), Song Scope (Wildlife Acoustics 2011), Kaleidoscope (Wildlife Acoustics 2016), and RavenPro (Charif et al. 2010).Finally, we use our literature review, results from our evaluation, and best practices from other disciplines to synthesize general evaluation recommendations for ecologists who want to use automated acoustic recognition for data processing.

Methods
We searched for ecological journal articles, technical reports, and conference proceedings that have evaluated the performance of automated signal recognition software to scan audio recordings for species detections.We searched the literature using Web of Science and combinations of the keywords "acoustic," "classif*," "recogn*," "autom*," and "song."We found and reviewed 68 papers that used computers to automatically scan audio recordings and identify detections of target species, including birds, frogs, and mammals (Appendix 1).We performed an initial review of these papers to determine recognizer type (single or multiple species), and evaluation data type (clip or recording; Table 1.Recognizer performance metrics used in single-species recognizer studies that assessed recognizer performance on real-field recordings.TP = true positive; FP = false positive; TN = true negative; FN = false negative; β = weighting factor used to balance the weighted average of precision and recall.Table 1).We excluded multispecies recognizers from further review because multiclass evaluation generally employs a different set of metrics than single species evaluation (Sokolova and Lapalme 2009).We also excluded papers that did not use a test dataset of unedited field recordings (see Potamitis et al. 2014) to evaluate their recognizer.The final subset included 12 singlespecies recognizer papers with a real-world evaluation (Appendix 1, Table 1).

Benchmark
Eleven papers used human data processing as the benchmark for recognizer evaluation, and one was unclear about the benchmark used.Of the 11 that specified the benchmark, 8 used detections that had been annotated during human listening, 2 used events that had been annotated during visual spectrogram scanning, and 1 used events that had been annotated during listening and visual spectrogram scanning, i.e., two benchmarks.One paper also included a decibel level threshold as part of their benchmark (Katz et al. 2016).

Score threshold
Score threshold is a user-selected parameter that is the minimum score of any given hit reported by the recognizer.Of the 12 papers reviewed, 7 described the score threshold selected.Of those seven, four papers reported selecting a single score threshold after tests such as Youden's J statistic (Youden 1950, Swiston and Mennill 2009, Ganchev et al. 2015, Ulloa et al. 2016, Crump and Houlahan 2017), two reported choosing low thresholds that allowed for analysis of metrics across score values (Digby et al. 2013, Katz et al. 2016), and one reported a comparison of three score thresholds (Brauer et al. 2016).Two of those seven papers also reported receiver operating characteristic (ROC) metrics (Katz et al. 2016, Ulloa et al. 2016), which incorporate scores from 0 to 1 implicitly.
Of the other five papers that did not report score threshold, four mentioned score but did not report threshold used (Waddle et al. 2009, Bardeli et al. 2010, Potamitis et al. 2014, Jahn et al. 2017) and one did not mention score at all (Duan et al. 2013).
All papers that examined the performance of the recognizer across score values reported that the performance improved with increasing score.Digby et al. (2013) found that recall (true negative rate) varied from nearly 100% at high scores to 0% at low scores.Similarly, Katz et al. (2016) showed that recall and specificity (the proportion of true negatives) ranged from 0 to 1 depending on the chosen score threshold.Brauer et al. (2016) compared three different score thresholds, "low" (minimized false negatives), "medium" (balanced false negatives and positives), and "high" (minimized false positives), and found that the total error of the recognizer ranged from 30% for the low threshold to 18% for the high threshold.

Metrics
In total, 11 different metrics were used across the 12 papers reviewed (Table 1).The most frequently used metrics were recall and precision.Among the metrics used, we found a lack of standardization and clarity in the 12 papers reviewed.There was variation in the terminology used for the metrics, with synonyms for 4 of the 12 metrics, and up to 4 synonyms per metric.In particular, the term "accuracy" was used to describe precision and accuracy; however, the formula for accuracy used in the papers we reviewed differs from the formula defined in the classifier evaluation literature (Sokolova et al. 2006, Sokolova andLapalme 2009).Furthermore, "accuracy" was undefined in one of the papers reviewed (Duan et al. 2013), so we assigned it the same mathematical formula as the other two papers that did define accuracy.Two of the papers reviewed (Bardeli et al. 2010, Brauer et al. 2016) did not cite or define the metrics used, including "total error," which is not a widely used classifier metric, so we backcalculated the mathematical formula or assigned the metric to the common name used in the paper.The remaining nine papers either provided the mathematical formula for the metrics used, explained the metric in plain language, or provided a citation for the metric formula.

Methods
We used a standardized training dataset to allow for a comparison of four commercially or freely available recognizer programs.We also included one custom recognizer program to compare the other programs to the current state-of-the-art.To make this comparison useful to ecologists with minimal bioacoustic experience, we used an "out-of-the-box" approach by relying on the advice given by the program developer for recognizer construction and allowed ourselves 8-12 hours of learning time for each program.The exception was the custom CNN recognizer, which required us to write a Python script to carry out model training and evaluation.

Species
We used the Common Nighthawk as a model species to test singlespecies automated acoustic recognition software because this species has simple and consistent calls that have minimal acoustic masking from other species because nighthawks vocalize primarily at dusk and before dawn (Fig. 1).Further, the Common Nighthawk vocalizes frequently, making it an ideal candidate with which to evaluate recognizer error rates in detectability and calling rate.The development of a high quality Common Nighthawk recognizer is also a conservation priority because this species is listed as Threatened under Canada's Species at Risk Act, and there are limited data for the species because of its crepuscular nature (Environment Canada 2016).

Training dataset
We built Common Nighthawk recognizers for five automated signal recognition programs using vocalizations from a standardized training dataset.

Song Scope recognizer
Song Scope is a signal detection recognizer that uses Hidden Markov models (HMMs) to maximize the probability of the arrangement of individual syllables, based on the spectral feature vectors of those syllables.We built the Song Scope recognizer iteratively, following advice available in the software manual (Wildlife Acoustics 2011).First, we extracted 100 "high-quality" calls evenly distributed across 11 locations (9-10 calls from each location).We defined "high-quality" calls as calls that were produced near the microphone, i.e., had little attenuation, and were not masked by any other acoustic signals, e.g., other birds or weather.We included approximately 0.1 seconds of silence preceding and following the vocalization.We then converted the clips to Song Scope annotations and loaded them into the Song Scope software as a single class.Common Nighthawk calls have frequencies below 8 kHz, so we set the sample rate at 20 kHz to exceed the Nyquist frequency (double the highest frequency of interest in the signal) with some headroom.We set the frequency minimum, range, max syllable, max syllable gap, max song, and dynamic range at values that maximized the detection of the 100 training annotations in the logarithmic scale with signal detection view (Appendix 2 Table A2.1).All other settings were left at default values.We reviewed each of the 100 training annotations to determine how much of each annotation was detected by Song Scope and removed any annotations where the full call was not completely detected.We replaced annotations with new annotations from the same location and reviewed those for detection completeness without adjusting the settings.We repeated this process until all 100 calls were completely detected in the logarithmic scale with signal detection view, and then generated the recognizer with the Song Scope software.The resultant recognizer had a cross training value of 77.32 +/-5.87%(mean +/-SD) and a total training value of 77.22 ± 4.87% (Wildlife Acoustics 2011).

Kaleidoscope recognizer
Similar to Song Scope, Kaleidoscope is a signal detection recognizer that builds a classification algorithm by running individual call syllables through HMMs that maximize the probability of detecting the entire call structure.Kaleidoscope differs from Song Scope in that it uses K-means clustering of Fisher scores from a 12-state HMM to cluster all the signals detected into different classes, as opposed to only identifying the signals that match the algorithm above a user-set score threshold.We built the Kaleidoscope recognizer using the cluster analysis function following the tutorial video available from the software manufacturer for "Converting Song Scope Recognizers to Kaleidoscope Cluster-based Classifiers" (Wildlife Acoustics 2016).We exported the annotation information from the 100 Song Scope annotations into a text file as presence training data.Because Kaleidoscope performs cluster analysis, it requires at least two classes to build a recognizer, so we created an absence training class by scanning our 200-minute absence dataset with Song Scope and exporting the highest scored 100 detections into the same text file.As per the training video, we then used the Kaleidoscope software to rescan the training dataset with the training clips to create a Kaleidoscope recognizer.We set maximum cluster distance to the maximum allowable value to simulate a minimum score threshold (Appendix 2 Table A2.2).
We adjusted the clustering parameters to create a two-cluster recognizer with a presence class and an absence class (Appendix 2 Table A2.2).We then processed the test dataset with the Kaleidoscope recognizer using similar signal detection parameters to the Song Scope recognizer (Appendix 2 Table A2.2).We validated only those detections that were classified as presence by the Kaleidoscope recognizer and used only hits from channel 1 to prevent duplicate hits.http://www.ace-eco.org/vol12/iss2/art14/

MonitoR recognizer
We used the binary-point matching function in MonitoR instead of the cross-correlation approach because our initial tests suggested it was more effective for Common Nighthawk calls.The binary-point matching function in MonitoR is a template-based approach, where the program converts each cell of the spectrogram of a clip to a 1 or 0 using an amplitude cut-off.As a moving window recognizer, MonitoR then processes audio data by comparing this single-call template to each moving window of the data and scores how many cells the window has in common with the template.Multiple calls can be used to train MonitoR recognizers, but the program creates a template for each training call and scans the data once with each template, as opposed to other programs that aggregate the training calls and scan the data only once.We built the MonitoR recognizer following the training vignette (Hafner and Katz 2017).We used the MakeBinTemplate function to inspect each of the 100 training clips from the Song Scope training dataset, and adjusted the time limit, frequency limits, and amplitude cut-off manually for each template to ensure each call was completely detected (Appendix 2 Table A2.3).

CNN recognizer
Convolutional neural networks (CNNs) are a class of machine learning models that have been successfully applied in a range of domains including speech recognition and visual object recognition (LeCun et al. 2015).CNNs are a type of artificial neural network (ANN) that use moving window convolutional layers to extract features from their inputs, which makes CNNs particularly suited to acoustic detection as they can be applied directly to variable length raw audio, spectrogram inputs, or other representations of sound.ANNs have previously been used for automated acoustic signal recognition, but require that call parameters are first extracted from each acoustic signal before being passed to the ANNs for classification (e.g., Jennings et al. 2008), whereas CNNs can scan and classify the spectrograms directly.In general, the filters in convolutional layers are used to detect acoustic features while sliding over the spectrogram, or other visual input.To train a CNN as a moving window recognizer, we used a simple architecture that had multiple convolutional layers, but output a single convolutional feature map (detection function) in the final layer (Appendix 2 Table A2.4).During model training we presented short clips to the network, typically with a single Common Nighthawk call either present or absent.We used the maximum value of the detection function to classify presence/absence, which forced the model to learn a discriminative detection function.We used the TensorFlow framework and the Python API to define and train our CNN model (Abadi et al. 2015).As input to our model, we used log-power mel-scaled spectrograms calculated using librosa (McFee et al. 2017).We used rectified linear units (ReLUs) as the activation function in all layers of the network except the last, which used a sigmoid function.We trained the network for 100 epochs with a cross-entropy cost function, using minibatch stochastic gradient descent with batch size 64 and Adam optimization (Kingma and Ba 2014) with learning rate of 0.001.During model evaluation on continuous recordings, the full timeseries output of the detection function was used as the recognizer score.A simple threshold-based peak-picking method was then used to extract a list of discrete detections.The CNN model required fixed length inputs during training, so we created a dataset by manually extracting 100 clips of 2-s duration from across the presence dataset and the same number from the absence dataset.

RavenPro recognizer
RavenPro uses band-pass filters, a band-limited energy detector, and an amplitude detector, to perform signal detection and identify calls of the appropriate duration within the frequency range of the target species.We followed the RavenPro 1.4 manual to configure our RavenPro recognizer (Charif et al. 2010).We extracted 100 high-quality calls (defined as above) and measured target signal parameters, i.e., frequency, duration, and separation, for each Common Nighthawk vocalization.We used the default setting for most noise estimation parameters, with adjustments made to those that increased the true positive rate (Appendix 2 Table A2.5).

Test dataset
To

Automated processing
The test dataset was processed with each recognizer.We chose low score thresholds for each of the recognizers so that we could evaluate performance across a gradient of score thresholds (Appendix 2).We set the score threshold at 0 for the signal detection recognizers (Song Scope, Kaleidoscope, RavenPro) to allow for full analysis of the score threshold gradient.We then ran the moving window recognizers (MonitoR and CNN) with a similarly low threshold and selected the highest scored 6750 hits, which was the maximum number of hits detected by any of the signal detection recognizers (Song Scope).Without this hit threshold, both moving window recognizers would have produced as many hits as moving windows, i.e., hundreds of thousands (Fig. 2) because they have no signal detection process.We ran each recognizer with the same MacBook Pro (late 2013) with a 2.3 GHz Intel Core i7 and 16 GB 1600 MHz DDR3 of RAM.We timed the processing duration of the test dataset while no other software was running.

Benchmark development
We compared our recognizers to human listening and used the maximum number of true detections by any method as our benchmark because the recognizers detected the presence of Common Nighthawks in several recordings that human listeners had missed.Using a human listening benchmark would have decreased the presence-absence recall of those recognizers because the comparison would have been to a benchmark that included false negatives.To develop the human listening dataset, two human observers viewed and simultaneously listened to each 5-min recording in its entirety using sound visualization software

Statistical analysis
We referred to existing best practices in the machine learning literature and other acoustic signal detection disciplines to develop our evaluation approach (Davis and Goadrich 2006, Sokolova and Lapalme 2009, Raffel et al. 2014).We evaluated the overall performance of each of the five Common Nighthawk recognizers relative to the benchmark.We also evaluated the applied performance of each of the recognizers including presence-absence recall, occupancy modeling, and call rate correlation.All analyses were conducted in R version 3.3.1 (R Core Team 2016) with the base package, the PRROC package (Grau et al. 2015), and the ROCR package (Sing et al. 2005).
Prior to analysis, we standardized the score of each hit for each recognizer on a scale from 0 (lowest score) to 1 (highest score) to enable comparison between recognizers.We standardized the score of each hit by dividing it by the maximum score for that recognizer minus the minimum score for that recognizer.Kaleidoscope does not directly report a score, but instead uses a clustering approach to report distance between detections, so we used the inverse of the distance to cluster center as a surrogate for score.We included score threshold in our evaluation by applying a score threshold in 0.01 increments to the dataset for each recognizer before calculating each metric.
To evaluate overall performance of the recognizers, we calculated precision, recall, F-score, and area under the curve (AUC) because these metrics are suitable for one-class classifiers (recognizers trained only with examples of the target species, e.g., Song Scope, MonitoR, RavenPro) and binary classifiers (recognizers trained with examples of both the target species and nontarget species, e.g., CNN, Kaleidoscope; Sokolova et al. 2006).Precision is the proportion of recognizer hits that are true detections of the target species (Table 1).Recall is the proportion of target species vocalizations detected as hits by a recognizer (Table 1).F-score incorporates precision and recall, and allows the user to weight the relative importance of precision versus recall by setting the β value (Table 1).For AUC, we plotted precision-recall as well as ROC curves for each of the recognizers because some authors suggest precision-recall is more appropriate for recognizer performance evaluation (Davis and Goadrich 2006).We did not apply a score threshold for this evaluation because AUC incorporates score implicitly.We did not include human listening in AUC calculation because human listening detections do not have score values.
We then evaluated the applied performance of the recognizers and human listening in a presence-absence study because presence-absence data are used for a variety of applications in ecological research and monitoring.To simulate a presenceabsence study and to balance sampling effort across study sites, we subsampled our test recording dataset to the first recording for each of the 45 study sites.We then determined whether the recognizer or listener accurately determined the presence or absence of a Common Nighthawk for each score threshold increment of 0.01, and then modeled this presence-absence recall with a binomial logistic regression for each processing approach.
For each approach, we constructed null, first-order, second-order, and third-order polynomial models with score threshold as the covariate.We compared the four models for each approach using Akaike Information Criteria (AIC; Burnham and Anderson 2002) and selected the model with the lowest AIC score.
We also evaluated the performance of the recognizers and human listening for occupancy modeling.Occupancy modeling is a widely used application of presence/absence data that uses repeated visits to account for imperfect detection of the target species (MacKenzie et al. 2002).ARU data are particularly wellsuited for occupancy modeling because they collect multiple time- We also evaluated the performance of each recognizer and human listening for measuring call rate.Call rate ARU data have been used for behavioral studies (Ehnes and Foote 2014), and can be used as a proxy for abundance of some species if baseline patterns in call rates or song frequency are well known, which can in turn be used for monitoring population trends (Jeliazkov et al. 2016).
We calculated the Spearman correlation coefficient between the benchmark and the call rate for each score threshold increment using the individual recording as the sampling unit.
Finally, we compared the efficiency of each of the five automated acoustic recognition programs and human listening as the time required to learn the software, build the recognizer, scan the test audio dataset, and validate the recognizer results as true or false positives.We limited learning time to 8-12 hours to develop a functional aptitude for each of the programs using our "out-ofthe-box" approach.We quantified the time spent to build each recognizer, including a standardized four hours of training dataset compilation time because we used a single compiled training dataset for all five recognizers.We quantified the time required to scan by timing the computer processing of our test dataset.We quantified the time to validate by timing the validation of each of the recognizer hits and taking the mean validation time per hit.To compare the efficiency of the five recognition programs to human listening, we calculated processing time in hours per hour of audio data for a 10 hour audio dataset and a 1000 hour audio dataset.We calculated processing time as the time required to learn and build the recognizer plus time to validate the recognizer results.We did not include scanning time in our efficiency calculation because this part of the process does not require human supervision.For time to validate, we calculated the time it would take to validate the recognizer when run with a score threshold for the peak of the precision-recall curve, i.e., the maximum value of precision + recall.Finally, we calculated the audio dataset size at which the efficiency of recognizer processing becomes faster than human listening, assuming 1 hour of listening per 1 hour of audio data and 1 hour of initial learning.

Results
A total of 5556 Common Nighthawk calls were detected across the 117 five-minute recordings (mean = 152 per recording, SD = 196), which was used as the benchmark for recognizer evaluation.Common Nighthawks were detected in 85 of the 117 recordings, and at 38 of 45 sites from northwestern Ontario, Canada.

Precision, recall, and F-score
As expected, recall and F-score decreased and precision increased with increasing score threshold for all recognizers (Fig. 3).Score threshold had a minimal impact on precision and recall of the RavenPro recognizer, with impacts seen only at score thresholds above 0.7.

Area under the curve
The CNN recognizer had the highest precision-recall curve AUC (0.94), followed by MonitoR (0.88), Song Scope (0.87), RavenPro (0.82), and Kaleidoscope (0.77; Fig. 4).The ranking of the top two recognizers from the ROC curve AUC was different than the precision-recall curve AUC; the SongScope recognizer had an AUC of 0.90, while the CNN had an AUC of 0.88.The ranking of the other recognizers was the same between the two AUC measures; however, the ROC AUC of the Kaleidoscope recognizer (0.53) was much lower than the precision-recall AUC (0.77).

Presence-absence
At low score thresholds, the CNN, Song Scope, and MonitoR recognizers determined Common Nighthawk presence-absence with similar recall as a human listener (95.4%;Fig. 5).At high score thresholds, only the CNN and RavenPro recognizers detected Common Nighthawk presence-absence with greater than 50% recall.As with precision and recall, score threshold had little impact on the presence-absence recall of the RavenPro recognizer.
The CNN recognizer had the highest presence-absence recall of the five programs across the score threshold gradient.The CNN (w i = 0.95), Kaleidoscope (w i = 0.92), and Song Scope (w i = 0.97) recognizers were modeled as third-order polynomials, and the MonitoR recognizer (w i = 0.69) was modeled as a second-order polynomial (Appendix 3 Table A3.1).The null model with the lowest AIC score for the RavenPro recognizer was the null model (w i = 0.43), suggesting that score threshold had no effect on presence-absence recall.

Occupancy
Naive occupancy of the 110 visits, i.e., recordings, included in occupancy modeling was 0.89 (34 of 38 sites).The occupancy estimate from human listening was 0.87 (SE = 0.06; Fig. 6).In  general, the occupancy estimates from recognizer data were lower than the estimate from human listening, although the occupancy estimate from the CNN recognizer (0.80) was not significantly so.
The occupancy estimates from the Kaleidoscope, MonitoR, and Song Scope recognizers decreased with increasing score threshold as detection also decreased, and at high score thresholds, the estimates became unstable, varying between 0 and 1.The occupancy estimates from the CNN and the RavenPro recognizers were more stable across score thresholds, although the RavenPro estimate was much lower (0.60).

Call rate
At low score thresholds, the CNN and MonitoR call rate correlation was similar to human listening (0.96 and 0.91, respectively); however, call rate correlation of the MonitoR recognizer decreased rapidly and linearly to near 0 with increasing score threshold, while the CNN recognizer call rate correlation decreased slowly before dropping steeply at a score threshold of 0.9 (Fig. 7).The Song Scope recognizer call rate correlation was between 0.7 and 0.8 at moderate score thresholds.Call rate correlation for the RavenPro recognizer varied minimally across score thresholds (max = 0.56, min = 0.48).The Kaleidoscope call rate correlation was 0.7 and decreased steadily but irregularly after a score threshold of approximately 0.3.

Efficiency
All five of the automated signal recognition programs became faster than human listening for datasets larger than 36 hours of audio (Table 2).The CNN recognizer had the largest initial time investment, and thus had the highest processing time per hr of audio data for a small dataset (10 hours audio).For a large audio dataset (1000 hours audio) the differences between the recognizers were due primarily to differences in the number of hits at maximum precision-recall between recognizers.The Song Scope recognizer was the most efficient, while the Kaleidoscope recognizer was the slowest.Although not included in the processing time calculations, scanning time should also be included in efficiency considerations.The CNN and Kaleidoscope recognizers were the fastest to scan our test dataset, while the MonitoR recognizer was two orders of magnitude slower because this program scanned the audio dataset separately through each of the 100 templates.

EVALUATION RECOMMENDATIONS
Based on our analysis, we suggest that ecologists who use automated acoustic recognition for processing acoustic recordings follow six general recommendations.These suggestions are drawn largely from best practices in machine learning and other acoustic signal processing disciplines (Salzberg 1997, Sokolova et al. 2006, Sokolova and Lapalme 2009, Raffel et al. 2014), as well as our literature review of evaluation methods in ecology and lessons learned during our Common Nighthawk recognizer evaluation.We also suggest that ecologists familiarize themselves with general machine learning practices because there http://www.ace-eco.org/vol12/iss2/art14/ is great potential for interdisciplinary research, but a known lack of communication between the two disciplines (Thessen 2016).

Recommendation 1: Benchmark
Recognizer evaluation should employ a test dataset that differs from the training dataset to avoid "overly optimistic" results (Salzberg 1997).Within the test dataset, it is important to establish a benchmark of known target species detections to evaluate recognizer performance.We recommend human listening as a comparison benchmark; however, we remind readers that human listening is also subject to error (Bart and Schoultz 1984, McClintock et al. 2010, Brauer et al. 2016).If any false negatives in human detections are discovered during the process of reviewing recognizer detections, we recommend instead using the maximum number of target species detections detected by any method, i.e., human processing or a recognizer, as the benchmark.
In our performance evaluation, there were 146 Common Nighthawk calls (2.63% of total) detected by a recognizer that were missed by human listeners.Brauer et al. (2016) also reported a 2% error rate in human identification of anuran calls, while Rydell et al. (2017) found error rates ranging from 9-22% for bat species identified by human listeners.If the target species vocalizations are susceptible to false positive identification by human observers, we recommend using a dependent double observer method when developing the benchmark to reduce the probability of misidentification (Forcey et al. 2006).Acoustic signals at farther distances (Skowronski and Brock Fenton 2009), lower sound pressure (Jahn et al. 2017), or with low signal-tonoise ratios, i.e., high levels of background noise, will be difficult to detect for both humans and recognizers, and therefore should not be excluded when preparing a benchmark (Skowronski and Harris 2006).Human listening can also be subject to observer bias (Sauer et al. 1994).Jennings et al. (2008) found that human observers with less than a single year of experience performed worse at classification than recognizers.Human annotation error can also be reduced by using the consensus from multiple observers as the benchmark dataset (e.g., Drake et al. 2016).

Recommendation 2: Score threshold
We strongly recommend that the influence of score be included in recognizer evaluation because our review showed it has a fundamental impact on recognizer performance, no matter what metric was used.Following Katz et al. (2016), we further recommend the use of score threshold instead of the reported raw scores of each detection in recognizer evaluation so that ecologists can use their evaluation results to select an optimal score threshold for data processing.We found in both our own recognizer evaluation and in our review of the literature that performance varied widely with score threshold.Furthermore, not all papers that used recognizers reported how they selected their score threshold despite the importance of this decision.Factors such as project objective, recording quality, call complexity, and signal clarity influence the choice of score threshold and the subsequent performance metrics.In our evaluation, the exception was the RavenPro recognizer, whose performance was largely unaffected by score threshold, perhaps because RavenPro is a band limited energy detector that identifies signals based only on a frequency range specification.It is possible that score threshold may be particularly important for programs with more complex classification approaches.Inclusion of a gradient of score thresholds in evaluation will facilitate selection of an appropriate score threshold for further analysis, which can be chosen based on the objectives of the project (Katz et al. 2016).We also found that some papers did not report score threshold, and we argue that it is crucial that score thresholds are explicitly reported within papers that use automated signal recognition.

Recommendation 3: Metrics
We suggest ecologists use metrics that are considered best practice in other signal processing disciplines (Sokolova and Lapalme 2009).Specifically, we suggest that four metrics always be reported for single species recognizer evaluation: (1) precision, (2) recall, (3) F-score, and (4) area under the curve (AUC).These metrics are regularly reported during classifier evaluation in other disciplines and will also allow ecologists to compare evaluation results with state-of-the-art studies in machine learning and elsewhere.Ecologists can also calculate these statistics across multiple datasets or partitioned datasets so that variance in metrics can be evaluated (Salzberg 1997) and statistical tests to compare recognizer performance can be applied (Dietterich 1998, Demšar 2006).

Precision and Recall
Precision is the proportion of recognizer hits that are true detections of the target species and is calculated as = + (2) where tp is the number of true positives (detections of target species) and fp is the number of false positives (recognizer hits that were mislabelled as the target species).
Recall is the proportion of target species vocalizations detected as hits by a recognizer and is calculated as where fn is the number of false negatives (detections of the target species in the benchmark dataset that the recognizer missed).Precision and recall were the most commonly used metrics in our literature review and in the classification literature (Raghavan et al. 1989, Provost et al. 1998, Davis et al. 2006).Precision and recall are appropriate for signal recognition evaluation because unlike some metrics, they do not require quantification of true negatives, i.e., other species, which are not reported in single-class recognizers such as Song Scope and MonitoR.In contrast, accuracy focuses on true and false negatives and assumes that false negative and positive errors are equally likely and consequential, which is often a poor assumption in signal recognition (Provost et al. 1998).Precision and recall are also particularly appropriate when the target species is rare because a recognizer can have a high accuracy by simply predicting the target species is always absent, and the accuracy of a recognizer can be inflated by adding more negative examples to the dataset.Using precision and recall allows for direct comparison of recognizer performance with other published studies.Across the studies we reviewed, the mean recall was 0.60 and the mean precision was 0.71 (Swiston and Mennill 2009, Bardeli et al. 2010, Digby et al. 2013, Duan et al. 2013, Potamitis et al. 2014, Ganchev et al. 2015, Jahn et al. 2017).With the exception of the Kaleidoscope recognizer and the Song Scope recognizer at low score thresholds, the precision of our Common Nighthawk recognizers was above 0.71.The recall of our MonitoR and CNN recognizers reached 0.60 at low score thresholds, but the other recognizers did not.

F-score
F-score combines precision and recall into a single metric and is calculated as

= +
(2) where β is a user-defined metric that allows for prioritization of precision over recall, or vice-versa.Precision and recall are evenly balanced when β = 1, precision is favored when β > 1, and recall is favored when β < 1 (Sokolova et al. 2006).We recommend that if ecologists choose to use a value for β other than 1, that they also report F-score with β = 1 to allow for comparison across studies.Situations where ecologists might consider using β < 1 include detection of rare species or situations with legal implications.

Area under the curve (AUC)
Following other acoustic signal processing disciplines, we recommend reporting the AUC of a precision-recall curve as a univariate method for comparing recognizers.Receiver operating characteristic (ROC) curve AUC is more commonly used in the classifier evaluation literature; however, precision-recall curves are more appropriate for cases with class imbalance such as recognizer evaluation (Davis and Goadrich 2006).In other words, a large quantity of false positives, as is the case for many recognizers at low score thresholds, is more accurately reflected in the AUC of a precision-recall curve than an ROC curve, and our comparison of the two approaches supports this.We therefore recommend a precision-recall AUC; however, ecologists may also want to calculate an ROC AUC for comparison with other published studies.

Recommendation 4: Application evaluation
Although overall recognizer evaluation is important, the influence of the metrics chosen can depend on the intended application for the data (Stowell et al. 2016).We therefore also recommend evaluation be done for the intended application of the resultant species detection data.Recognizer evaluation for occupancy modeling purposes is particularly important, and as our results suggest this approach becomes unreliable for recognizer data with low recall because species detection probability is too low for reliable occupancy estimates (MacKenzie et al. 2002).We also found that the shape of the curve across the score threshold gradient for all three response variables we examined (presence-absence recall, occupancy estimate, and call rate correlation) was similar to the shape of the recall curve.Future work should investigate whether the relationship between the shape of the score-recall curve is an adequate proxy for all response variables, or whether it varies depending on the detectability, call rate, and occupancy of the target species.
Recommendation 5: Regional generalizability Geographic variation in acoustic signal is demonstrated in many bird species (Slabbekoorn and Smith 2002) and other animals that produce sound (Pröhl et al. 2006, Campbell et al. 2010, Sun et al. 2013), which is important to consider during recognizer evaluation (Gillespie et al. 2013, Russo andVoigt 2016).For simplicity, we evaluated the regional generalizability of our Common Nighthawk recognizer with a test dataset from a different region than the training data; however, in best practice, ecologists should test recognizers across multiple geographic regions.Evaluating with multiple test datasets will help ecologists determine whether a single recognizer is effective or whether regionally specific recognizers are required for their target species.For example, marine mammal classifiers have been shown to be 14.4% less accurate when tested with data from a different region than the training data (Erbs et al. 2017).For ecologists that plan to use recognizers for a single region, training and test data should be sourced from the region of interest.

Recommendation 6: Efficiency evaluation
For many ecologists, the purpose of employing an automated signal recognition approach is to increase the efficiency of audio data processing; therefore, we recommend collecting data on time spent to build and run a recognizer and validate the output.The time per hour of audio data can then be compared to other data processing approaches, including human listening.For our recognizers, we found that human listening became less efficient with datasets larger than 36 hours of audio; however, we note that using a visual scanning approach, i.e., viewing the spectrogram, instead of listening may have improved the efficiency of our human processing approach.If the automated recognizer used performs poorly, however, the manual postprocessing time required may outweigh the advantages of automation because of the time required to validate the results (Stowell et al. 2016).Digby et al. (2013) found that automated recognition (2 minutes per hour of recording) could be at least as or more efficient than manual scanning (2-5 minutes per hour of recording).Joshi et al. (2017) found that manual scanning was more time-efficient than automated signal recognition for four species of forest birds, but noted that the efficiency of a recognizer will depend on the species' vocalization characteristics, call rate, and the quality of recognizer.Indeed, human listening may be more efficient than single-species recognizers if multiple species data are needed from audio recordings; however, there are also many multispecies recognizer approaches currently under development (Stowell et al. 2016, Goëau et al. 2017).Ultimately, relative efficiency will depend on a variety of factors including score threshold, with more time required to validate recognizer output if a low score threshold is chosen to prioritize recall over precision.

DISCUSSION
Autonomous recording units (ARUs) are important tools for ecological monitoring and research because they are portable, collect data over extended periods, can be used in remote locations, are not restricted to a particular season, and the data they collect can be archived as a permanent record (Shonfield and Bayne 2017).The use of automated signal recognition for processing ARU data is growing because it can reduce the time required to process the large amounts of data; however, best practices are needed (Blumstein et al. 2011).In particular, recognizer performance evaluation is a critical step for projects that employ automated signal recognition.All recognizers misclassify detections to some extent, which can have implications for study results and may lead to poor management decisions if the results are not validated (Russo andVoigt 2016, Rydell et al. 2017).In our review of the bioacoustics literature, we found little similarity in recognizer performance evaluation between studies.Some studies reported minimal performance evaluation results, which renders the ecological results of these studies difficult to interpret.In papers that did report performance evaluation, we found an inconsistency in the evaluation terminology used and a lack of reference to the classification literature (Salzberg 1997, Davis and Goadrich 2006, Sokolova and Lapalme 2009).Given the increasing use of recognizers by ecologists, these deficiencies suggest a need for guidance on performance evaluation.We used best practices from other acoustic signal processing disciplines and our own evaluation of automated signal recognition software to provide recommendations for recognizer evaluation.
Using the Common Nighthawk as a model species, we found that a convolutional neural network (CNN) recognizer outperformed the other recognizers across all evaluations.The Song Scope and MonitoR recognizers had similar precision and recall rates to the CNN recognizer at some score thresholds.Currently, the construction of CNN recognizers requires programming expertise, but an increasing number of authors have reported success with this method for automated signal recognition (Koops et al. 2014, Salamon and Bello 2017, Salamon et al. 2017).Using our "out-of-the-box" approach, we found MonitoR and Song Scope had similar learning curves, assuming the operator is already familiar with the R programming language.At the time of writing, however, Song Scope was no longer under development or supported by the manufacturer.As the simplest automated signal recognition program, RavenPro was the easiest to learn, but the simplicity of its band-width delimitation classification approach limited its performance.2016).Although ARUs can be as effective as human surveyors at detecting occurrences (Holmes et al. 2014, Kalan et al. 2015), the greater number of false negatives from an automated analysis (Brauer et al. 2016) reduces the apparent occupancy estimate for an organism at a location (MacKenzie et al. 2002).It has been suggested that the difference in recall between automated signal recognition and human listening is caused by a smaller detection radius of the recognizer relative to the human listener (Jahn et al. 2017;Knight and Bayne, unpublished data), which could be due to both the signal detection and classification components of the recognizer and would explain our reduced occupancy estimates.This may not be an error per se but may instead reflect the fact that more standardization is needed when using ARUs to determine the effective area being sampled (Yip et al. 2017).We also found that occupancy estimates became unstable at high score thresholds with low recall, and therefore caution against the use of occupancy models produced from recognizer data with low recall recognizers because low recall contributes to low detectability, which biases occupancy estimates (MacKenzie et a. 2002).Future research should investigate the sensitivity of occupancy modeling to this new data type.
Although automated signal recognition is effective for Common Nighthawks, there is little consensus to date on the overall effectiveness of the existing technology for avian ecological research and monitoring.Future application of our recommendations would be most useful for taxa with more complex acoustic signals, different calling rates, and in environments with varying levels of ambient noise.Thorough performance evaluation in recognizer studies following our general recommendations will contribute to building a body of literature for future meta-analysis on the overall effectiveness of automated signal recognition for wildlife monitoring and research.
Responses to this article can be read online at: http://www.ace-eco.org/issues/responses.php/1114 Acknowledgments: Our sincere thanks to the Subject Editor and three reviewers for their insightful comments, which greatly improved the clarity and accuracy of the manuscript.

Fig. 2 .
Fig. 2. Distribution of true positive and false positive recognizer hits relative to score for Common Nighthawk (Chordeiles minor) recognizers in five different programs.The top row programs are signal detection recognizers and the bottom row programs are moving window recognizers.Recognizer scores are the raw scores reported by the programs and are unstandardized.Kaleidoscope score is the inverse of the distance metric.

Fig. 3 .
Fig. 3. Precision, recall, and F-score of Common Nighthawk (Chordeiles minor) call detection for automated acoustic recognition programs at varying score thresholds.Precision, recall, and F-score of human listening is provided for comparison.Precision is the proportion of recognizer hits that are true detections of the target species.Recall is the proportion of target species vocalizations detected by the recognizer.F-score combines precision and recall into a single evaluation metric.

Fig. 4 .
Fig. 4. Precision-recall curve (left) and receiver operating characteristic (ROC; right) curve of Common Nighthawk (Chordeiles minor) call detection for automated acoustic recognition programs.AUC is area under the curve for each program.

Fig. 5 .Fig. 6 .
Fig. 5. Recall of five automated acoustic recognition programs for detecting Common Nighthawk (Chordeiles minor) presence per recording at varying score thresholds.Recall of human listening is provided for comparison.Shaded areas indicate 95% confidence intervals.

Fig. 7 .
Fig. 7. Spearman correlation of Common Nighthawk (Chordeiles minor) call rate between automated acoustic recognition programs across varying score thresholds.Correlation of call rate from human listening is provided for comparison.

The standardized training dataset consisted of 400 minutes of audio data processed by human listeners: 200 minutes of audio data with Common Nighthawk detections and 200 minutes of audio data with no Common
Nighthawks.The data were collected from 11 locations in south central British Columbia, Canada during the breeding season from 12 June to 14 July 2014 and 2015 at dawn or dusk.The absence data were collected from the same locations, but during times of year and day when Common Nighthawks are not active.

Table 2 .
Time in hours spent to learn each of the automated acoustic recognition programs, build a recognizer, scan audio recordings with the recognizer, and validate the recognizer output.Total times and dataset size were calculated using the number of hits produced by each recognizer when the score threshold is set to maximize accuracy.
We gratefully acknowledge funding provided for the bioacoustic component of this work from the Natural Sciences and Engineering Research Council of Canada, the Alberta Biodiversity Monitoring Institute, the Ecological Monitoring Committee for the Lower Athabasca, and the Joint Oil Sands Monitoring Program.Funding for Common Nighthawk research was provided by the Natural Sciences and Engineering Research Council of Canada, Environment and Climate Change Canada, Mitacs, the Baillie Fund, Science Horizons, the Public Conservation Assistance Fund, and the TD Friends of the Environment Foundation.