With increasing adoption of face recognition systems, it is important to ensure adequate performance of these technologies across demographic groups, such as race, age, and gender. Recently, phenotypes such as skin tone, have been proposed as superior alternatives to traditional race categories when exploring performance differentials. However, there is little consensus regarding how to appropriately measure skin tone in evaluations of biometric performance or in AI more broadly. Biometric researchers have estimated skin tone, most notably focusing on face area lightness measures (FALMs) using automated color analysis or Fitzpatrick Skin Types (FST). These estimates have generally been based on the same images used to assess biometric performance, which are often collected using unknown and varied devices, at unknown and varied times, and under unknown and varied environmental conditions. In this study, we explore the relationship between FALMs estimated from images and ground-truth skin readings collected using a colormeter device specifically designed to measure human skin. FALMs estimated from different images of the same individual varied significantly relative to ground-truth FALMs. This variation was only reduced by greater control of acquisition (camera, background, and environmental conditions). Next, we compare ground-truth FALMs to FST categories obtained using the standard, in-person, medical survey. We found that there was relatively little change in ground-truth FALMs across different FST category and that FST correlated more with self-reported race than with ground-truth FALMs. These findings show FST is poorly predictive of skin tone and should not be used as such in evaluations of computer vision applications. Finally, using modeling, we show that when face recognition performance is driven by FALMs and independent of race, noisy FALM estimates can lead to erroneous selection of race as a key correlate of biometric performance. These results demonstrate that measures of skin type for biometric performance evaluations must come from objective, characterized, and controlled sources. Further, despite this being a currently practiced approach, estimating FST categories and FALMs from uncontrolled imagery does not provide an appropriate measure of skin tone.