Human evaluation is a critical component in the development of machine translation systems and has received much attention in text translation research. However, there is little prior work on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation discrepancies. We take the first steps to fill this gap by conducting a comprehensive human evaluation of the results of several shared tasks from the latest International Workshop on Spoken Language Translation (IWSLT 2023). We propose an effective evaluation strategy based on automatic resegmentation and direct evaluation with segment context. Our analysis revealed that: 1) the proposed evaluation strategy is robust and correlates well with other types of human judgments; 2) automated metrics are often, but not always, well correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF, despite the segmentation noise introduced by resegmentation step systems. We publish the collected human-annotated data to encourage further research.