Development and performance of a sleep estimation algorithm using a single accelerometer placed on the thigh: an evaluation against polysomnography

Summary Accelerometers placed on the thigh provide accurate measures of daily physical activity types, postures and sedentary behaviours, over 24 h and across consecutive days. However, the ability to estimate sleep duration or quality from thigh‐worn accelerometers is uncertain and has not been evaluated in comparison with the ‘gold‐standard’ measurement of sleep polysomnography. This study aimed to develop an algorithm for sleep estimation using the raw data from a thigh‐worn accelerometer and to evaluate it in comparison with polysomnography. The algorithm was developed and optimised on a dataset consisting of 23 single‐night polysomnography recordings, collected in a laboratory, from 15 asymptomatic adults. This optimised algorithm was then applied to a separate evaluation dataset, in which, 71 adult males (mean [SD] age 57 [11] years, height 181 [6] cm, weight 82 [13] kg) wore ambulatory polysomnography equipment and a thigh‐worn accelerometer, simultaneously, whilst sleeping at home. Compared with polysomnography, the algorithm had a sensitivity of 0.84 and a specificity of 0.55 when estimating sleep periods. Sleep intervals were underestimated by 21 min (130 min, Limits of Agreement Range [LoAR]). Total sleep time was underestimated by 32 min (233 min LoAR). Our results evaluate the performance of a new algorithm for estimating sleep and outline the limitations. Based on these results, we conclude that a single device can provide estimates of the sleep interval and total sleep time with sufficient accuracy for the measurement of daily physical activity, sedentary behaviour, and sleep, on a group level in free‐living settings.


INTRODUCTION
comes from self-reported data (i.e., data reported by the participant themselves; Bull et al., 2020). Self-report data on physical activity, sedentary behaviour, and sleep are influenced by bias, which can lead to inaccurate estimates of the duration (Cespedes et al., 2016;Ekblom et al., 2015;Shephard, 2003;Troiano et al., 2020).
Body-worn accelerometers are emerging as a more objective alternative to self-reported data. In particular, the use of a single accelerometer placed on the thigh is increasing, as demonstrated by their adoption in large cohort studies and consortiums globally , Stevens et al., 2020. A single thigh-worn accelerometer can provide accurate estimates of a wide range of daily physical behaviours from sitting, standing, and walking Skotte et al., 2014;Stemland et al., 2015;Stevens et al., 2020) to stair climbing, running, cycling (Migueles et al., 2017;Skotte et al., 2014;Stevens et al., 2020), and lying down (Hettiarachchi et al., 2021;Lyden et al., 2016). More objective measures of a wide range of activities are increasing as recognition of the inherent interdependence between daily physical behaviours across 24 h grows (Troiano et al., 2020). Interdependence demands that we measure a range of physical behaviours across 24 h, including physical activity, sedentary behaviour, and sleep, so that we can understand how they interact and relate to health. There are established algorithms for deriving physical activity and sedentary behaviour from thigh-worn accelerometers, which have been evaluated against the 'gold standard' Skotte et al., 2014;Stemland et al., 2015;Stevens et al., 2020;Migueles et al., 2017;Hettiarachchi et al., 2021;Lyden et al., 2016). However, we lack algorithms for deriving sleep duration, or other dimensions of sleep that could indicate sleep quality from thigh-worn accelerometers, which have also been evaluated against the gold standard of polysomnography (PSG). To the best of our knowledge, there are just two published non-proprietary algorithms for deriving 'waking time' and a single proprietary algorithm that estimates 'bedtime' using the data from thigh-worn accelerometers (Carlson et al., 2021;van der Berg et al., 2016;Winkler et al., 2016).
None of these algorithms have been evaluated against the gold standard (Carlson et al., 2021;Inan-Eroglu et al., 2021;van der Berg et al., 2016;Winkler et al., 2016). Similarly, although numerous studies exist evaluating the use of hip-and wrist-worn accelerometers to measure several dimensions of sleep (Conley et al., 2019), we are not aware of any evaluations that have used thigh-worn accelerometers to measure anything else than 'bedtime' or 'waking time'.
Our primary aim was to develop a non-proprietary algorithm for the estimation of sleep duration, derived using the raw data from a tri-axial thigh-worn accelerometer, and to evaluate this algorithm against PSG. Our secondary aim was to evaluate the performance of the algorithm when used to estimate other sleep quality variables such as: sleep latency, wake after sleep onset (WASO), sleep efficiency, and awakening index.

SUBJECTS AND METHODS
The algorithm was developed using a two-step process. In step one, the algorithm was optimised to maximise the sensitivity and specificity of sleep estimation in comparison with PSG recordings collected in a sleep laboratory (see below and Table 1). We refer to this dataset as the optimisation dataset. In step two, the optimised algorithm was tested on a new dataset, consisting of ambulatory PSG recordings collected from a new sample of participants, at their homes (see below and Table 1). We refer to this dataset as the evaluation dataset.
Step two was performed because the performance of the algorithm may differ from laboratory conditions to ambulatory/free-living conditions. As our algorithm is intended for use in free-living settings, the results section will focus on the performance of the algorithm when applied to the evaluation dataset.
Informed consent was provided by all participants prior to inclusion, in accordance with the Helsinki Declaration. For the optimisation dataset, data collection was approved by the Regional Ethics Review

Optimisation dataset
A total of 23 overnight PSG recordings were obtained from 15 asymptomatic adults between the ages of 22 and 38 years (Table 1)  Further exclusion also included participants who reported heavy snoring, having current physical or mental health problem, heavy consumption of alcohol or drugs, shift work, having travelled more than one time zone in the last 3 weeks, pregnancy, fever, or allergy to adhesive bandages. All participants in the optimisation dataset, underwent a preliminary overnight recording to allow for habituation to the sleep laboratory and study protocol. This recording was also used for screening of sleep disorders and restless leg syndrome. Participants were monitored for at least an additional 1 night in a controlled laboratory environment. A further eight participants completed 2 non-consecutive nights of PSG recordings. To increase the data available for algorithm optimisation, all recordings were included. For the full PSG assessment (TEMEC Technologies, VitaPort 3, Heerlen, the Netherlands), PSG electrodes were placed according to the American Association of Sleep Medicine (AASM) guidelines (Berry et al., 2015) and included two electroencephalography (EEG) leads (C3-A2 and C4-A1), two electro-occulogram (EOG) leads, and submental electromyogram (EMG). In addition, a single accelerometer (Axivity AX3, Axivity Ltd, Newcastle Upon Tyne, UK) was attached with adhesive tape, on the participant's thigh, midway between the patellar ligament and the anterior superior iliac spine. Acceleration data were sampled at a frequency of 100 Hz, with a range ± 8 g. To ensure synchronisation between devices, a 'synchronisation event' at the start and end of each registration was included. In short, when participants were equipped with all devices and recording active, they were asked to step out of bed and stand still in an upright position for 15 s. Then, they were asked to bite their teeth together three times, perform a single hop, and bite their teeth together a further three times, before again standing still for 10-15 s. This procedure provided a clearly identifiable event in the signals of the various devices, providing a reference point for time synchronisation.

Evaluation dataset
The evaluation dataset consisted of a single night of ambulatory PSG registration from 71 males recruited from the 'Men in Uppsala; a Study of sleep, Apnea and Cardiometabolic Health' (MUSTACHE) study (Table 1). The MUSTACHE study is a population study initiated in 2016 and aimed at reaching 400 male participants within the age range of 35-65 years. The 71 males in the present study were recruited as a convenience sample from the last round of recruitment in the MUSTACHE study. Participants who were not expected to manage to carry out the ambulatory recordings due to self-reported severe somatic or psychiatric disease were excluded. All participants wore ambulatory PSG equipment (Embla Flaga Inc., Reykjavik, Iceland). The PSG recording included EEG (C3-A2, C4-A1), EOG and submental EMG. Additional sensors used were electrocardiograms (V5), airflow with a three-port orinasal thermistor and a nasal flow pressure sensor, respiratory effort from piezo-electric belts (Resp-Ez, EPM Systems, Midlothian, VA, USA), bilateral anterior tibialis muscles, finger pulse oximetry (Embla A10 flex sensor), a piezo vibration sensor for snoring, and a body position sensor. In addition, a single accelerometer (Axivity AX3, Axivity Ltd) was attached with adhesive tape, on the participant's thigh, midway between the patellar ligament and the anterior superior iliac spine. Acceleration data were sampled at a frequency of 25 Hz, with a range of ±8 g. Participants were required to attend an afternoon appointment on the first day, where trained personnel ensured correct sensor placement. Thereafter, participants returned home and wore the sensors continuously until the following morning. PSG recordings began once the participants went to bed and stopped when they awoke the following morning. Each accelerometer and PSG recording was synchronised and then visually inspected to ensure synchronisation was correct. Data collection was carried out between July 2018 and May 2019.

The sleep algorithm
A simple algorithm was developed to estimate sleep from raw accelerometer data, based on the algorithm of Cole-Kripke (Cole et al., 1992). Wake and sleep thresholds were set for each second of lying periods >15 min, based on a constantly changing variable, called the 'sleep index' (S n Þ. A sleep index >1 was considered as 'awake' and a sleep index <1 was considered as 'asleep' (see Formula 1). Thus, thigh movement would increase the value of the sleep index, and time without thigh movement would decrease the value of the sleep index ( Figure 1).
Raw accelerometer data were re-sampled to 30 Hz. Data were then band-pass filtered 0.5-10 Hz, whilst further background noise was removed from the signal using a cut-off value of 0.02 g. Then, the sleep index was calculated using the following formula: Where 'A n ' is the mean band-pass-filtered vector magnitude in n th second.
'τ' is the time constant.
'k' is the gain parameter.
Both τ and k are held constant and were optimised through iterative comparison between the algorithm output and PSG recordings, as described in paragraph 2.5 ( Figure 2). An upper limit of exp(1) = 2.71 was set for S n , meaning that when A n = 0, the value of S n decreases exponentially in line with the time constant τ (i.e., sleep is only defined after τ-seconds, if no further movements are detected) (Figure 1). Sleep or wake-state bouts that lasted <10 s were removed using a median filter. Furthermore, in order to account for the time taken for the sleep index S n ð Þ to rise above the movement detection threshold during awakening, each awakening was considered to occur 2 min prior.

Data analysis
The PSG recordings were scored in 30-s epochs according to standard criteria to detect sleep (Ancoli-Israel et al., 2007;Rechtschaffen & Kales, 1968). This was performed separately by trained specialists for both the optimisation and evaluation datasets.
The output from the accelerometer algorithm was down-sampled with mode-filtering in order to fit the same 30-s epoch lengths as PSG. Then the thigh accelerometer data and PSG recordings were synchronised and manually inspected to assure that the time synchronisation was correct. Epoch-by-epoch comparisons were then made to calculate the sensitivity, specificity, and accuracy of the sleep algorithm to estimate sleep with respect to PSG detection, according to the following formulas:

Performance evaluation
To assess the performance of the optimised algorithm, comparison between sleep defined by ambulatory PSG recordings and sleep defined by the sleep algorithm was performed in the evaluation dataset.
The mean and standard deviation (SD) of sensitivity, specificity and accuracy was calculated for all participants. In addition, the following variables were derived from both the PSG recordings and the sleep algorithm according to Ibáñez et al. (2018): sleep interval, is defined as the time between the onset of the first sleep period and the last awakening; total sleep time, is defined as the total amount of time the participants slept between the start of the PSG recording until the last awakening, identified by PSG; sleep latency, is defined as the time from the start of the PSG recording until the time the participant fell into a stage of sleep for the first time; WASO, is defined as the total time awake after the first sleep onset until the last awakening; sleep efficiency, is defined as the percentage of recorded time asleep until the last awakening; awakening index, is defined as the number of awakenings >10 s per hour.
The difference between the sleep variables derived from PSG recordings and that of sleep variables from the algorithm was calculated. The upper and lower 95% limits of agreement (LoA) of the sleep interval and total sleep time were also calculated by taking the mean differences ±1.96 Â SD of the differences (Bland & Altman, 1999) and presented in Bland-Altman plots ( Figure 3 and Table 3 In order to assess if outliers (i.e., those who had very short or very long sleep duration relative to the norm) affected results, a sensitivity analysis was performed excluding cases that had slept shorter than or longer than 2 SDs from the mean sleep time, according to the PSG recordings. All analysis was performed with Matlab 2020b Windows version and Rstudio 2021.09.1.

RESULTS
Summary of sleep parameters derived from the PSG and the sleep algorithm in the evaluation dataset are presented in Table 2.  (Table 3).
Using threshold values of ±2 SDs from the mean sleep time according to PSG recordings, three outliers were identified. A subsequent sensitivity analyses excluding these outliers indicated that the correlations between accelerometer measured, and PSG measured    (Conley et al., 2019).
The variable that corresponded best between the algorithm and PSG was sleep interval, which was underestimated by 21 min (LoAR 130 min), equating to 5% of the total sleep interval registered by PSG.
Our algorithm underestimated total sleep time by 32 min, whereas the previous trend for wrist-worn accelerometer estimates of sleep has been towards an overestimation by $11 min (Conley et al., 2019). An explanation for this discrepancy may lie in the algorithm design, where each awakening was considered to occur 2 min prior to the time indicated by sleep index (S n Þ. Another reason for underestimation could be that the algorithm always takes at least 18.5 min, due to the time constant τ, to detect sleep-onset from a full awake status, even though in reality people can wake-up and go back to sleep very quickly. These parameters were selected to optimise specificity but may have inadvertently contribute to the underestimation observed. Alternatively, the observed underestimation may indicate that thigh movement during sleep is more prevalent than the movement of other body parts, although further research is needed to confirm this. F I G U R E 4 Performance of diary, and different accelerometer placements in measuring total sleep time (TST) compared to polysomnography. Differences in minutes, bias and 95% limits of agreement ±2 SD (limits of agreement range), (n = number of nights). Only studies with healthy adult populations and studies that have presented the mean differences (bias) and SD of differences of TST between the two methods are presented Relatively low precision (high LoAR) was observed for all variables indicating that the algorithm is mainly suited to studies where groups of individuals are evaluated. We also observed poor correlations between PSG-and algorithm-derived sleep quality variables, including sleep latency; sleep efficiency; WASO; and awakening index. This suggests that the algorithm is perhaps less suited to the measurement of sleep quality. However, the performance is still comparable with many observations using wrist-worn accelerometer (Conley et al., 2019).
The poor correlation across these variables is likely a result of the fact that the sleep state is a multifactorial physiological process. Thus, short awakenings or sleep episodes may not be related to any particular changes in thigh movement. Therefore, if the estimation of these sleep quality variables with high precision is of priority, alternative measurement methods should be considered.
When interpreting the results and comparisons above, the following three points are important to consider. Firstly, the comparison of sensitivity and specificity statistics across studies should not be considered as definitive, because differences in the total recorded sleep time, and prevalence of sleep between the studies can affect the statistic.
Secondly, the comparison of findings using laboratory-based PSG with those using ambulatory PSG should not be considered as a one-to-one comparison, because ambulatory PSG is likely to contain much more natural variation than laboratory-based PSG. Thirdly, as is evident from Figure 3, short sleepers (<6 h) appear to have introduced greater disagreement between PSG and accelerometer estimations than those who slept longer (>6 h). This was also shown in the sensitivity analysis where the precision slightly increased (i.e., LoAR decreased) for total sleep time when very short sleepers were excluded from the analysis.
Therefore, it is important to consider that the performance of the algorithm might vary depending on the population measured. The precision will most likely be lower amongst groups of individuals with sleep durations that are considerably shorter than the norm. It has also been shown, in other studies, that accelerometer estimates of sleep are better when applied to healthy populations than to populations with chronic disorders (Conley et al., 2019