(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . PrimSeq: A deep learning-based pipeline to quantitate rehabilitation training [1] ['Avinash Parnandi', 'Department Of Neurology', 'New York University Langone Health', 'New York', 'United States Of America', 'Aakash Kaku', 'Center For Data Science', 'New York University', 'Anita Venkatesan', 'Natasha Pandit'] Date: 2022-08 Stroke rehabilitation seeks to accelerate motor recovery by training functional activities, but may have minimal impact because of insufficient training doses. In animals, training hundreds of functional motions in the first weeks after stroke can substantially boost upper extremity recovery. The optimal quantity of functional motions to boost recovery in humans is currently unknown, however, because no practical tools exist to measure them during rehabilitation training. Here, we present PrimSeq, a pipeline to classify and count functional motions trained in stroke rehabilitation. Our approach integrates wearable sensors to capture upper-body motion, a deep learning model to predict motion sequences, and an algorithm to tally motions. The trained model accurately decomposes rehabilitation activities into elemental functional motions, outperforming competitive machine learning methods. PrimSeq furthermore quantifies these motions at a fraction of the time and labor costs of human experts. We demonstrate the capabilities of PrimSeq in previously unseen stroke patients with a range of upper extremity motor impairment. We expect that our methodological advances will support the rigorous measurement required for quantitative dosing trials in stroke rehabilitation. Stroke commonly damages motor function in the upper extremity (UE), leading to long-term disability and loss of independence in a majority of individuals. Rehabilitation seeks to restore function by training daily activities, which deliver repeated UE functional motions. The optimal number of functional motions necessary to boost recovery is unknown. This gap stems from the lack of measurement tools to feasibly count functional motions. We thus developed the PrimSeq pipeline to enable the accurate and rapid counting of building-block functional motions, called primitives. PrimSeq uses wearable sensors to capture rich motion information from the upper body, and custom-built algorithms to detect and count functional primitives in this motion data. We showed that our deep learning algorithm precisely counts functional primitives performed by stroke patients and outperformed other benchmark algorithms. We also showed patients tolerated the wearable sensors and that the approach is 366 times faster at counting primitives than humans. PrimSeq thus provides a precise and practical means of quantifying functional primitives, which promises to advance stroke research and clinical care and to improve the outcomes of individuals with stroke. Funding: This work was funded by the American Heart Association/Amazon Web Service postdoctoral fellowship 19AMTG35210398 (A.P.), NIH R01 LM013316 (C.F.G., H.S.), NIH K02 NS104207 (H.S.), NIH NCATS UL1TR001445 (H.S.), and NSF NRT-HDR 1922658 (A.K., C.F.G.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Copyright: © 2022 Parnandi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Here, we present the Primitive Sequencing pipeline (PrimSeq), a deep learning-based framework to automatically identify and count functional primitives in rehabilitation training. Inspired by deep learning methods for speech recognition, our approach uses a sequence-to-sequence model to generate sequences of functional primitives, which are then counted [ 36 ]. PrimSeq encompasses three main steps: (1) capture of upper body motion during rehabilitation with wearable inertial measurement units (IMUs), (2) generation of primitive sequences from IMU data with the trained deep learning model, and (3) tallying of primitives with a counting algorithm. We developed PrimSeq in chronic stroke patients performing a battery of rehabilitation activities. We show that in previously unseen stroke patients with a range of motor impairment, PrimSeq robustly identifies primitives in various rehabilitation activities and outperforms other activity-recognition methods. IMU-based motion capture is also well tolerated by patients, and PrimSeq considerably diminishes the time and burden of quantifying functional motions. To overcome these limitations, we seek to directly identify and measure individual functional motions in rehabilitation. We previously reported that rehabilitation activities can be entirely decomposed into elemental function motions, which we call functional primitives [ 5 , 31 – 33 ]. There are five main classes of functional primitives: reach (UE motion to make contact with a target object), reposition (UE motion to move into proximity of a target object), transport (UE motion to convey a target object in space), stabilization (minimal UE motion to hold a target object still), and idle (minimal UE motion to stand at the ready near a target object). Much like words in a paragraph, primitives are strung together to execute an activity. Primitives typically have a short duration and one goal, resulting in simple motion phenotypes. These phenotypes are surprisingly consistent even across species, activities, and motor impairment [ 5 , 34 , 35 ], indicating that it is possible to identify primitives regardless of individual or context. We thus use primitives as units of motion that are readily identifiable, quantifiable, and replicable. Some rehabilitation researchers have instead opted for precision by manually counting functional motions [ 17 , 27 – 30 ]. This approach meticulously identifies training content and quantity, but has practical challenges. Functional motions are fluid, fast, and difficult to disambiguate in real time. Using treating therapists to count these motions would be a distraction from clinical remediation, and using independent observers would likely outstrip the financial and labor resources of most institutions. Even with offline video analysis, the identification and counting of functional motions is prohibitively time- and personnel-intensive [ 5 ]. The impracticality of manual tallying is thus a major obstacle to its widespread adoption in research and clinical settings. A different pragmatic approach has been to focus on UE joint motions (e.g., forearm rotation, elbow flexion, shoulder flexion) made by stroke patients. This approach uses wearable sensors and deep learning to identify or grade UE joint motions, with the goal of remotely monitoring rehabilitation training or ADL execution at home [ 21 , 22 ]. However, identifying joint displacement does not disentangle functional from nonfunctional UE motions. The identification of functional motions is important, because their repeated practice is key for inducing activity-dependent neuroplasticity, engaging stroke-induced neuroplasticity, and promoting behavioral recovery [ 6 , 7 , 23 – 26 ]. Most rehabilitation researchers and clinicians opt for pragmatism, using time-in-therapy to estimate how much training has occurred [ 9 – 15 ]. Problematically, time metrics do not directly read out motion content or quantity, as the delivery of therapy is not standardized [ 11 ]. For a given rehabilitation session, the number of functional motions can vary considerably across subjects and institutions, or training may focus on nonfunctional motions unrelated to ADL execution, such as stretching, weight-bearing, or strengthening exercises [ 16 , 17 ]. Even if a time-based intervention is successful [ 9 , 13 ], its lack of content detail—precisely what and how much was trained—hinders replication by other researchers or clinicians. Some investigators have sought to automatically detect functional versus nonfunctional UE motion using inertial measurement units (IMUs) and machine learning [ 18 – 20 ], but this approach still outputs time spent in functional motion, rather than the quantity of functional motions trained. In animal models of rehabilitation, when hundreds of UE functional motions are trained per day in the early weeks after stroke, recovery can be potentiated [ 6 , 7 ]. In humans, the number of UE functional motions needed to boost recovery is less clear, primarily because a feasible method to count UE functional motions does not currently exist. To date, the quantitation of UE functional motions—a critical parameter for establishing the effective rehabilitation dose [ 8 ]—has been challenged by the trade-off between pragmatism and precision. Most individuals with stroke have persistent motor deficits in an upper extremity (UE) [ 1 ], resulting in a significant loss of function and independence in activities of daily living (ADLs) [ 1 – 3 ]. Stroke rehabilitation seeks to promote UE motor recovery and restore UE function. To this end, a major focus of rehabilitation is the repeated practice of ADLs, which are comprised of functional motions made by the UE [ 4 ]. Functional motions are goal-directed and purposeful, typically targeting objects in the context of executing an ADL [ 4 , 5 ]. From the primitive identification standpoint, the trained Seq2Seq model and counting algorithm identified and tallied primitives 370 times faster than human coders. For example, to process 6.4 hours of recorded activities, Seq2Seq had an execution time of 1.4 hours (13 s per minute of recording) whereas trained human coders required 513.6 hours or 12.8 workweeks (4,815 s per minute of recording). Furthermore, if inference is made on a high-performing computer with an advanced graphical processing unit, processing lags between incoming IMU data and primitive prediction are estimated at ~15 seconds. These observations indicate that PrimSeq could provide near-immediate feedback about primitive counts, which would help patients and therapists meet training goals during rehabilitation sessions [ 41 ]. Finally, we examined the practicality of using the PrimSeq pipeline. From the motion capture standpoint, most patients reported minimal difficulty with donning and calibrating the IMU array and minimal discomfort with wearing them (median scores 0 and 1, respectively, on an 11-point visual analog scale; S3 Fig ). IMU setup time was minimal, requiring on average 12.9 ± 4.1 minutes. Electromagnetic sensor drift was also late to emerge during recording, occurring after 63.7 ± 28.1 minutes, with recalibration taking 1–2 minutes. Collectively, these observations indicate that an IMU system could support motion capture that is well tolerated, efficient, and stable. Seq2Seq had a significantly higher sensitivity than the other models (Seq2Seq, 0.767; CNN, 0.727; ASRF, 0.720; RF, 0.497; all t 498 > 20.7, all p < 0.0001). Seq2Seq had a significantly lower FDR than CNN and RF (Seq2Seq, 0.166; CNN, 0.196; RF, 0.213; all t 498 >22.8, all p < 0.0001), but was outperformed by ASRF (0.160; t 498 = 7.5, p < 0.0001). Seq2Seq also had a significantly higher F 1 score than the other models (Seq2Seq, 0.799; CNN, 0.763; ASRF, 0.775; RF, 0.609; all t 498 >18.4, all p < 0.0001) indicating its best overall classification performance of the models tested. Prediction outcomes were aggregated from primitives, activities, and patients. Shown are the sensitivity, false discovery rate (FDR), and F1 score for Seq2Seq, convolutional neural network (CNN), action segment refinement framework (ASRF), and random forest (RF). Seq2Seq had the highest sensitivity (Seq2Seq, 0.767; CNN, 0.727; ASRF, 0.720; RF, 0.497). Seq2Seq FDR was lower than CNN and RF but marginally higher than ASRF (Seq2Seq, 0.166; CNN, 0.196; ASRF, 0.160; RF, 0.213). Seq2Seq had the highest F1 score (Seq2Seq, 0.799; CNN, 0.763; ASRF, 0.775; RF, 0.609). Finally, we benchmarked Seq2Seq against competitive models used in human action recognition: convolutional neural network (CNN), action segment refinement framework (ASRF), and random forest (RF; Fig 6 and S2 Fig ) [ 31 , 38 , 39 ]. The CNN and RF made predictions at each 10-ms time point, which were smoothed to generate primitive sequences [ 40 ]. ASRF is a state-of-the-art action recognition method that directly generates primitive sequences. We aggregated patients, primitive classes, and activities to examine the overall sensitivity and FDR of each model. We also examined the F 1 score, the harmonic mean between sensitivity and precision (1-FDR), which reflects global classification performance. We used bootstrapping and unpaired, two-tailed t-tests to statistically compare the classification performance of Seq2Seq against the other models. We also examined if Seq2Seq performance was affected by impairment level ( Fig 5C ). Seq2Seq had a stable sensitivity (0.71–0.82) that did not vary with UE-FMA score (Spearman’s correlation (ρ(6) = 0.54, p = 0.171, 95% confidence interval (CI) [-0.12, 0.87]). The FDR ranged more widely (0.31–0.64) and showed a trend for decreasing as UE-FMA scores increased (ρ(6) = -0.69, p = 0.069, 95% CI [-0.11, -0.91]). This trend was driven by one patient (UE-FMA 37), whose stabilizations and transports were excessively overcounted by Seq2Seq. We next examined Seq2Seq classification performance for each activity ( Fig 5B ). The model had excellent performance with structured activities, such as moving an object to fixed locations on a shelf (mean sensitivity 0.92, FDR 0.07) and tabletop (mean sensitivity 0.91, FDR 0.06). Its performance declined with more naturalistic activities, such as drinking (mean sensitivity 0.74, FDR 0.16) and feeding (mean sensitivity 0.74, FDR 0.19). Seq2Seq had its lowest performance with the tooth-brushing activity (mean sensitivity 0.62, FDR 0.21). We first examined Seq2Seq classification performance for each primitive class ( Fig 5A ). The model had high mean sensitivities for most primitives (0.76–0.81) except stabilizations (0.64). The model also had low mean FDRs for most primitives (0.11–0.16) except stabilizations (0.23). For stabilizations, distinct prediction errors drove the modest classification performance: their spurious removal (false negatives) decreased sensitivity, whereas their spurious addition (false positives) increased overcounting. (A) Classification performance per primitive class . Prediction outcomes (FP, FN, TP) for activities were combined. Shown are mean ± standard deviation of Seq2Seq sensitivity and FDR for patients in the test set; each dot represents a single subject. Mean sensitivity was high and mean FDR was low for most primitives except stabilizations. (B) Classification performance per activity . Prediction outcomes for primitive classes were combined. Shown are mean ± standard deviation of Seq2Seq sensitivity and FDR for patients in the test set; each dot represents a single subject. Mean sensitivity was high and mean FDR was low for structured activities such as the shelf and tabletop tasks, but were more modest for more complex activities. (C) Classification performance per patient . Prediction outcomes for activities and primitives were combined. Shown are sensitivity and FDR values per patient with respect to their upper extremity Fugl-Meyer Assessment (UE-FMA) score. Seq2Seq sensitivity was not affected by impairment level (p = 0.171), but there was a trend for reduced FDRs with higher UE-FMA scores (p = 0.069), driven by one patient. To assess the overall classification performance of Seq2Seq to predict primitives, we computed sensitivity and false discovery rate (FDR). Sensitivity represents the proportion of true primitives that were correctly predicted and included in the count. The FDR, a measure of overcounting, represents the proportion of predicted primitives that were incorrectly predicted and included in the count. We assessed Seq2Seq classification performance for separate primitive classes, activities, and patient impairment levels ( Fig 5 ). Overall, Seq2Seq had a tolerable error rate, but had the most difficulty classifying stabilizations. The motion phenotype of stabilizations, which allows for some minimal motion in the UE [ 5 ], may account for different classification errors. Seq2Seq could blend the minimal motion of stabilizations into the beginning or end of an adjacent motion-based primitive (i.e., reach, reposition, transport), leading to its deletion. Conversely, the model could mistake periods of diminished motion in adjacent primitives as a stabilization, leading to an insertion or swapping-in. In addition, the model had an increased frequency of swapping-in stabilizations for idles ( S1 Fig ). This error could be attributed to the lack of IMU finger data necessary to identify grasp, which is a major phenotypic distinction between these two minimal-motion primitives [ 5 ]. We examined the frequency of these error types with respect to true counts in each primitive class, which adjusts for differences in number of primitives performed ( Fig 4B ). Seq2Seq had a modest frequency of deletion errors for most primitives (10.6–13.6%) except stabilizations (25.5%) and a modest frequency of swap-out errors (5.9–11.2%) for all primitives. Seq2Seq also had a modest frequency of insertion errors (4.6–13.3%) and swap-in errors (4.0–13.4%) for all primitives. (A) Sequence matching . Shown is a schematic depicting the different types of prediction outcomes, comparing the ground truth sequence (top row) against the predicted sequence (bottom row). Seq2Seq could produce a true positive (correct prediction; green), false negative (deletion or swap-out error; pink), or false positive (insertion or swap-in error; blue). In the example shown, transport was correctly predicted; stabilize was incorrectly deleted; reach was incorrectly swapped-out while idle was incorrectly swapped-in; and reach was incorrectly inserted. (B) Frequency of prediction errors per primitive class . Shown are the mean frequency ± standard deviation of prediction errors for patients in the test set; each dot represents a single subject. Activities were combined, and erroneous counts were normalized to ground truth counts in each primitive class. Deletion errors happened when primitives were incorrectly removed from the prediction, and occurred with modestly low frequency, except for stabilizations. Swap-out errors happened when primitives were incorrectly removed from the prediction and instead predicted as another class, and occurred with modestly low frequency. Insertion errors happened when primitives were incorrectly added to the prediction, and occurred with low frequency, except for stabilizations. Swap-in errors happened when primitives were incorrectly predicted instead of the actual primitive class, and occurred with modestly low frequency. PrimSeq generated primitive counts that closely approximate true counts, but gross tallies do not reveal if the Seq2Seq model identified primitives that were actually performed. For example, the model may fail to predict a reach that happened but may later predict a reach that did not happen; the net result of these two errors is that a reach is spuriously credited to the count. We thus examined the nature of predictions made by Seq2Seq ( Fig 4 ). We compared the predicted and ground truth sequences using the Levenshtein algorithm [ 37 ], identifying two types of prediction errors: false negatives and false positives ( Fig 4A ). A false negative occurred when Seq2Seq did not predict a primitive that actually happened because it erroneously missed the primitive (deletion error) or erroneously predicted another primitive class (swap-out error). A false positive occurred when Seq2Seq predicted a primitive that did not actually happen because it erroneously added the primitive (insertion error) or erroneously predicted this primitive class (swap-in error). Shown are the mean ± standard deviation of the ground truth and predicted counts for patients in the test set; each dot represents a single subject. (A) Counts per primitive class . Activities were combined. The pipeline generated counts that were similar to true counts for each primitive class (mean percent of true counts: reach, 92.2%; reposition, 86.6%; transport, 99.5%; stabilization, 91.1%; and idle, 93.3%). (B) Counts per activity . Primitive classes were combined. The pipeline generated counts that were similar to true counts for each activity (mean percent of true counts: shelf task, 109.1%; tabletop task, 102.5%; feeding, 97.6%; drinking, 93.7%; combing hair, 89.8%; donning glasses, 85.2%; applying deodorant, 81.1%; washing face, 79.1%; and brushing teeth, 81.4%). We first examined the counting performance of PrimSeq ( Fig 3 ). In the separate primitive classes, the approach counted on average 282 reaches, 206 repositions, 408 transports, 291 stabilizations, and 258 idles across combined activities ( Fig 3A ). In the separate rehabilitation activities, the approach counted on average 40–308 primitives across combined classes ( Fig 3B ). To assess the similarity of PrimSeq counts to the actual number of primitives performed, we compared predicted versus ground truth counts per primitive class and activity. PrimSeq generated primitive counts that were 86.1–99.6% of true counts for the separate primitive classes ( Fig 3A ) and 79.1–109.1% of true counts for the separate activities ( Fig 3B ). We also examined counting errors at the single-subject level, finding that they were consistently low for primitive classes (reach 7.2 ± 9.9%, reposition 13.3 ± 9.2%, transport 0.4 ± 17.7%, stabilization 8.0 ± 42.7%, and idle 6.6 ± 10.5%) and for activities (shelf task -10.7 ± 26.0%, tabletop task -2.3 ± 13.4%, feeding 5.9 ± 16.7%, drinking 5.5 ± 25.0%, combing 11.8 ± 7.8%, donning glasses 14.5 ± 23.7%, applying deodorant 11.5 ± 26.2%, face washing 12.2 ± 30.1%, and tooth-brushing 6.3 ± 31.2%). In most cases, the variance in predicted counts could be attributed to inter-individual variance in true counts. However, for stabilizations, an excess variance in predicted counts could be explained by increased prediction errors, which we further discuss below. (A) Motion capture . Upper-body motion data are captured with IMUs during performance of rehabilitation activities. IMU data are divided into six-second windows. (B) Primitive prediction . The windowed IMU data are fed into the sequence-to sequence (Seq2Seq) deep learning model. Seq2Seq uses a Gated Recurrent Unit (GRU) to sequentially encode IMU data into a feature vector, which provides a condensed representation of relevant motion information. A second GRU then sequentially decodes the feature vector to generate the primitive sequence. (C) Primitive count . A counting algorithm then removes primitive duplicates at window boundaries and tallies the predicted primitives. The PrimSeq pipeline has three main steps ( Fig 2 ). First, IMU data from rehabilitation activities are recorded and divided into 6-second windows ( Fig 2A ). Second, this windowed data is fed to the Seq2Seq model. The encoder GRU generates a single feature vector capturing relevant motion information, which is processed by the decoder GRU to estimate a sequence of primitives ( Fig 2B ). Third, a counting algorithm removes any duplicates at window boundaries and tallies primitives from the sequences ( Fig 2C ). We designed a sequence-to-sequence (Seq2Seq) encoder-decoder deep learning model that predicts primitives from IMU data patterns. Its encoder module is a three-layer bi-directional Gated Recurrent Unit (GRU) with 3,072 hidden representations. Its decoder module is a single-layer GRU with 6,144 hidden representations. Using this architecture, Seq2Seq maps a window of motion data to a sequence of primitives (see Materials and methods for architecture design and training details). (A) Activity sampling . As patients performed rehabilitation activities, functional motion was synchronously captured with two video cameras (dark green arrow) placed orthogonal to the workspace and nine inertial measurement units (IMUs, light green arrow) affixed to the upper body. (B) Data recording . The video cameras generated 2-view, high-resolution data. The IMU system generated 76-dimensional kinematic data (accelerations, quaternions, and joint angles). A skeletal avatar of patient motion and joint angle offsets were monitored for electromagnetic sensor drift. (C) Primitive labeling . Trained coders used the video recordings to identify and annotate functional primitives (dotted vertical lines). These annotations labeled and segmented the corresponding IMU data. Interrater reliability was high between the coders and expert (Cohen’s Κ for reach, 0.96; reposition, 0.97; transport, 0.97; stabilization, 0.98; idle, 0.96). To collect a variety of functional primitives to train the deep learning model, we recorded and labeled functional motion from 41 chronic stroke patients with UE paresis ( Fig 1 and Table 1 ). While patients performed common rehabilitation activities ( Fig 1A and S1 Table ), we captured upper-body motion with an array of IMUs and video cameras ( Fig 1B ). To generate ground truth labels, trained human coders viewed the videotaped activities and segmented them into functional primitives ( Fig 1C ). The coders annotated the beginning and end of each primitive on the video, which applied a primitive label to the corresponding segment of IMU data. Interrater reliability of primitive labeling was high between the coders and an expert (Cohen’s K ≥ 0.96). We split the labeled IMU data into a training set (n = 33 patients; 51,616 primitives; see primitive class distribution in Materials and methods ) and an independent test set (n = 8 patients; 12,545 primitives). Data splits were balanced for impairment level and paretic side ( Table 1 ). We note that the number and variety of primitives, not the number of subjects, makes this dataset robust for the development of deep learning approaches. We used the labeled training set for model fitting, parameter adjustment, and hyperparameter tuning. We used the previously unseen test set for an unbiased model evaluation, employing ground truth labels to assess primitive counting and classification performance. In conclusion, we present a novel pipeline that measures functional motion repetitions in UE rehabilitation activities. PrimSeq is a foundational step toward the precise and pragmatic quantification of rehabilitation dose, and overcomes considerable time, personnel, and financial barriers. Our approach has the potential to support rigorous rehabilitation research and quantitative clinical delivery, which are vitally needed to improve stroke outcomes. Our approach has some limitations to consider. PrimSeq identifies functional motions but does not measure how normally they are performed. This information is important for tracking recovery and tailoring rehabilitation. Future work could characterize and reference the normative kinematics of the primitive classes, which could generate continuous measurements of abnormal primitive performance. In addition, PrimSeq was trained on motion from chronic stroke patients performing a circumscribed battery of rehabilitation activities. To increase clinical utility, additional model training and refinement could be undertaken “in the wild” with subacute stroke patients undergoing inpatient rehabilitation. A more extensive sampling of primitives could be expected to boost classification performance on unstructured activities, and make PrimSeq robust for application in different recovery stages. Finally, the classification performance of Seq2Seq was limited in some cases (e.g. stabilizations, tooth-brushing activity). Future work could employ alternative deep learning models with explainable artificial intelligence to identify sources of confusion, which could then be targeted to improve classification performance [ 45 , 46 ]. Alternatively, the motion capture setup could be expanded to generate information that is lacking but may be critical for more precise classification. For example, the grasp of an object is a major phenotypic feature that distinguishes stabilizations (grasp present) from idles (grasp absent) [ 5 ], but our current IMU array does not capture finger motion. The incorporation of videography data and computer vision [ 47 – 49 ] or the addition of an instrumented glove [ 50 ] could provide grasp data, delivering key motion details that are necessary for classification. The generalizability of our approach depends in part on the adoption of functional primitives as units of measure. As with any classification in machine learning, users must confirm that the predicted classes are relevant to their question or application. We reason that, for the purpose of objectively measuring rehabilitation training, primitives are reliable and effective units of measure. As singular motion and minimal-motion events, primitives directly transcribe the series of actions taken by individuals to execute more complex activities [ 5 ]. Because primitives are circumscribed and mutually exclusive, they avoid lumping together a variable quantity or series of motions under one class designation, as can occur with functional motion detection [ 18 – 20 ] or human activity recognition [ 42 – 44 ]. We have consistently found that rehabilitation activities can be entirely broken down into constituent functional primitives [ 5 , 31 , 32 ]. Primitives thus provide a metric by which rehabilitation training can be consistently measured, enabling the critical appraisal and replication of dosed rehabilitation interventions. To date, the measurement of functional motions in UE stroke rehabilitation has been an elusive technical challenge, hampered by imprecision or impracticality. To address these obstacles, we developed PrimSeq, an approach that integrates wearable IMU data, a trained deep learning model, and a counting algorithm to quantify functional primitives performed during rehabilitation. We found that PrimSeq generated accurate primitive counts across primitive classes and activities. We also found that the Seq2Seq model also had a moderate-to-high classification performance across primitive classes, rehabilitation activities, and impairment levels, outperforming state-of-the-art action recognition models. Finally, we found that PrimSeq was a practical approach for UE motion capture and processing. These results indicate that PrimSeq provides a pragmatic solution for the accurate identification of motion content and quantity during rehabilitation. Materials and methods Subjects We studied 41 chronic stroke patients with upper extremity (UE) paresis. Patients gave written informed consent to participate. This study was approved by the Institutional Review Board at New York University Langone Health, in accordance with the Declaration of Helsinki. Patient demographics and clinical characteristics are reported in Table 1. Enrollment criteria Eligibility was determined by electronic medical records, patient self-report, and physical examination. Patients were included if they were ≥ 18 years old, premorbidly right-handed, able to give informed consent, and had unilateral motor stroke with contralateral UE weakness scoring < 5/5 in any major muscle group [51]. Patients were excluded if they had: hemorrhagic stroke with mass effect, or subarachnoid or intraventricular hemorrhage; traumatic brain injury; musculoskeletal, major medical, or non-stroke neurological condition that interferes motor function; contracture at shoulder, elbow, or wrist; moderate UE dysmetria or truncal ataxia; apraxia; visuospatial neglect; global inattention; or legal blindness. Stroke was confirmed by radiographic report. Lesions in non-motor areas or the opposite hemisphere were allowed barring bilateral weakness. Both ischemic and hemorrhagic stroke were included, as motor deficits do not substantially differ between the two types [52]. Stroke patients were chronic (> 6 months post-stroke) except for two patients (3.1 and 4.6 months post-stroke). Primitive dataset generation Patients participated in two to three sessions lasting ~2.5 hours. Sessions were typically one to three days apart (average 2.6 days). At the first session, we recorded patient height and measured UE impairment level with the UE Fugl-Meyer Assessment (FMA), where a higher score (maximum 66) indicates less impairment [53]. Patients then performed five trials of nine rehabilitation activities while their upper body motion was recorded (S1 Table and Fig 1A). We identified activities using a standardized manual of occupational therapy (OT) [54]. From these, we identified activities commonly practiced during inpatient stroke rehabilitation through survey of seven OTs with expertise in stroke rehabilitation. Patients were seated in front of a workspace (table or sink counter) at a distance that allowed the nonparetic UE to reach, without trunk flexion, to the furthest target object. Workspaces were adjusted to standard heights (table, 76 cm; counter, 91 cm). We placed the target objects at fixed locations using marked, laminated cardboard mats (table) or measured distances (counter). We used standardized instructions that outlined the major goals of the activity. Because most activities in the battery are bimanual, we instructed patients to use their paretic UE to the best of their ability. Motion capture To record patient motion, we affixed nine inertial measurement units (IMUs; Noraxon, USA) to the C7 and T10 spine, pelvis, and both hands using Tegaderm tape (3M, USA) and to both arms and forearms using Velcro straps (Fig 1A). Each IMU is small (length: 3.8 cm; width: 5.2 cm; height: 1.8 cm), lightweight (34 g), and captures 3D linear acceleration, 3D angular velocity, and 3D magnetic heading at 100 Hz. The motion capture software (myoMotion, Noraxon, USA) applies a Kalman filter to the linear accelerations, angular velocities, and magnetic heading to generate 3D unit quaternions for each sensor [55]. We used coordinate transformation matrices to transform the generated quaternions to a sensor-centric framework, which represents the rotation of each sensor around its own axes. The motion capture software also applies a proprietary height-scaled skeletal model to the IMU data to generate 22 anatomical angles of the upper body (S2 Table). The motion capture system thus generates a 76-dimensional dataset every 10 ms consisting of the following: 27 dimensions of accelerations (9 IMUs × 3D accelerations per IMU), 27 dimensions of quaternions (9 IMUs × 3D quaternions per IMU), and 22 joint angles. These data are displayed on a software interface alongside an avatar of the patient (Fig 1B). As an additional feature, we added the side of the patient’s paretic UE (left or right) to each 10 ms time step, resulting in a 77-dimensional dataset. The motion capture system records patient motion with high precision (accelerometry accuracy ± 0.001 g; gyroscopic accuracy ± 1.25°; anatomical angle accuracy ± 2°), performing as well as the gold-standard optical system [56]. We monitored online for electromagnetic sensor drift by visually inspecting the joint angle data for baseline shifts and ensuring that avatar motions matched those of the patient. Patients were immediately recalibrated if drift was observed. Recalibration required standing with UEs straight at the sides (arms, forearms, and wrists in neutral position with elbows extended) and took less than two minutes. We recorded UE motion with two high-speed (60 Hz), high-definition (1088 × 704 resolution) cameras (Ninox, Noraxon) positioned orthogonally less than two meters from the patient (Fig 1A). The cameras have a focal length of f4.0 mm and a large viewing window (length: 2.5 m, width: 2.5 m; Fig 1B). The cameras ran on the same clock as the IMUs and video and IMU recordings were synchronized. Data labeling Human coders identified the functional primitives performed in the rehabilitation activities. The five classes of functional primitives are reach (UE motion to make contact with a target object), reposition (UE motion to move proximate to a target object), transport (UE motion to convey a target object in space), stabilization (minimal UE motion to hold a target object still), and idle (minimal UE motion to stand at the ready near a target object). Coders were trained on a functional motion taxonomy that operationalizes primitive identification [5]. The coders used the video recordings to identify and label the start and end of each primitive, which simultaneously segmented and labeled the synchronously recorded IMU data. To ensure the reliable labeling of primitives, an expert (A.P.) inspected one-third of all coded videos. Interrater reliability between the coders and expert was high, with Cohen’s K coefficients ≥ 0.96. Coders took on average 79.8 minutes to annotate one minute of recording. We split the resulting ground truth dataset of into a training set (n = 33 patients; 51,616 primitives: 9840 reaches, 8028 repositions, 12471 transports, 11445 stabilizations, and 9832 idles) and test set (n = 8 patients; 12,545 primitives: 2510 reaches, 1948 repositions, 3331 transports, 2475 stabilizations, and 2281 idles) to independently train and test the deep learning model. Patient selection was random but constrained to balance impairment level and paretic side. The IMU dataset, including its data splits, are available on https://simtk.org/projects/primseq. Deep learning model development Inspired by speech recognition models [57], we used a sequence-to-sequence (Seq2Seq) deep learning model to perform the task of predicting primitive sequences. To handle the higher dimensionality of the IMU data, we increased the model’s input nodes to 77 (from 40 for speech) and increased the hidden dimensionality to 3,072 (from 512 for speech). To provide sufficient context of the time series given lower sampling rates of 100 Hz for motion data (versus 16,000 Hz for speech), we expanded the window size of the input data to 6 s (from 10 ms for speech). The architecture of Seq2Seq has two modules: a feature encoder consisting of a three-layer, bi-directional Gated Recurrent Unit (GRU) with 3,072 hidden representations, and a feature decoder consisting of a single-layer GRU with 6,144 hidden representations. Seq2Seq performs primitive identification in two steps (Fig 2). The encoder GRU first encodes the data window to generate a 6,144-element feature vector. This step reduces the dimensionality of the high-dimensional IMU data, which enables it to learn relevant features from the IMU data for the downstream task of sequence prediction. The decoder GRU then decodes this feature vector to generate the sequence of primitives. The generated sequence of primitives is then passed through a counting algorithm that tallies the functional motion repetitions while removing duplicate primitives at the window boundaries. Additional model details are presented in recent work [36]. We trained Seq2Seq by minimizing a loss function based on the cross-entropy between the predicted and ground truth primitive sequences using the Adam optimizer [58]. We used a learning rate of 5x10-4. Because primitive overcounting may lead to accidental under-training of patients in rehabilitation, we prioritized keeping the average false discovery rate (FDR) < 20% while maximizing the average sensitivity during model training. We ensured this balance by stopping the model training early based on the Action Error Rate (AER), computed as the total number of changes needed on the predicted sequence to match the ground truth, normalized to the length of the ground truth sequence. We used a window size of 6 s for primitive prediction with Seq2Seq. During model training, the middle 4 s of the window was predicted, with the flanking 1 s of data providing the model additional temporal context for prediction. We further maximized the training data by adding a window slide of 0.5 s, which also helped the model learn primitive boundaries. During model testing, the window size was 6 s and middle 4 s of the window was predicted. To enable the flanking during model testing, we set the slide to 4 s. A preliminary experiment was performed with window sizes of 2 s, 4 s, and 8 s. The window size of 6 s resulted in the lowest validation AER. We selected and cross-validated the hyperparameters for Seq2Seq with four different splits of the training set. In each split, 24 or 25 patients were used for training and 9 or 8 patients were used for validation. We selected the hyperparameters for each of the four models based on their validation AERs. Each split yields a separate model that generates independent prediction probabilities per primitive. The prediction probabilities from the four models were averaged, or ensembled, and the primitive with the highest probability was taken as the Seq2Seq prediction. The ensembled prediction was also fed back into the four models to inform the next prediction in the data window. After Seq2Seq training and hyperparameter estimations were done on the training set, we applied the trained Seq2Seq model to the test set to assess its counting and classification performance in data from previously unseen patients. The test set was not used for feature selection, preprocessing steps, or parameter tuning. Code implementing the model, including instructions for training and hyperparameter selection and comparisons with other action-recognition methods on benchmark datasets, are available on https://github.com/aakashrkaku/seq2seq_hrar. An overlay of model predictions with respect to ground truth primitives is demonstrated on a patient video (S1 Video). Analysis of counting performance To visualize the ability of PrimSeq to correctly count primitives, we tallied the predicted and ground truth primitive counts per subject in the test set. These counts are displayed as means and standard deviations in Fig 3A and 3B. To analyze PrimSeq counting performance for each primitive class, we combined all activities and normalized the predicted counts to ground truth counts. To analyze PrimSeq counting performance for each activity, we combined primitive classes and normalized the predicted counts to ground truth counts. Counting performance is reported as the mean percent and standard deviation of true counts. We also examined counting errors at the single-subject level, because mean tallies may obscure erroneous counting (e.g. an average of under-counts and over-counts would wash out errors). We calculated counting error per subject as the difference between true and predicted counts, normalized to true count. Single-subject counting errors are reported as mean percent and standard deviation. Analysis of prediction outcomes and error frequency To examine the nature of predictions that Seq2Seq made per primitive class, we combined activities and compared predicted against ground truth sequences for each patient in the test set. We used the Levenshtein sequence-comparison algorithm to match the predicted and ground truth primitive sequences [37]. This step generated the prediction outcomes of false negative (FN), false positive (FP), or true positive (TP; Fig 4A). False negatives, or primitives spuriously removed from the prediction, could arise from a deletion error (the model did not predict a primitive that actually happened) or a swap-out error (the model did not predict the actual primitive but instead predicted an incorrect class). False positives, or primitives spuriously added to the prediction, could arise from an insertion error (the model predicted a primitive that did not actually happen) or a swap-in error (the model predicted an incorrect primitive class instead of the actual primitive). True positives were primitives that were correctly predicted. We examined the frequency and type of prediction errors (FN, FP) made by Seq2Seq, normalizing prediction errors to ground truth counts to adjust for different quantities of primitives. The frequencies of prediction errors are presented as mean and standard deviation for the test set patients. To further assess which classes of primitives were mistaken for each other by Seq2Seq, we generated a confusion matrix to examine swap-out and swap-in errors (S1 Fig). Analysis of classification performance To examine the classification performance of Seq2Seq, we computed the classification performance metrics of sensitivity and false discovery rate (FDR) with respect to primitive class, activity, and impairment level. Sensitivity, also known as true positive rate or recall, represents the proportion of ground truth primitives that were correctly predicted. It is calculated as: FDR, a type of overcount, represents the proportion of predicted primitives that were incorrectly predicted. It is calculated as: To calculate sensitivity and FDR per primitive class, we combined prediction outcomes (i.e., TP, FN, and FP) from all activities for each test set patient (Fig 5A). To calculate sensitivity and FDR per activity, we combined prediction outcomes from all primitives for each test set patient (Fig 5B). Sensitivity and FDR are reported as means and standard deviations across test set patients. Finally, to assess classification performance at different levels of UE impairment, we combined prediction outcomes from all primitives and activities for each test set patient and calculated sensitivity and FDR (Fig 5C). We examined if these performance metrics varied with ordinal UE-FMA scores using Spearman’s correlation (ρ). Model benchmarking We compared Seq2Seq against three benchmark models used in human action recognition: convolutional neural network (CNN), random forest (RF), and action segment refinement framework (ASRF). We examined the classification performance of a CNN that we previously developed to predict primitives from IMU data [31]. Each layer in the CNN computes linear combinations of outputs of the previous layer, weighted by the coefficients of convolutional filters. The model includes an initial module that helps to map different physical quantities captured by IMU system (e.g., accelerations, joint angles, and quaternions) to a common representation space. The model also uses adaptive feature-normalization to increase the robustness of the model to shifts in the distribution of the data, which can occur when the model is applied to new patients. We also examined the classification performance of RF, a conventional machine learning model, which has previously been used for human activity recognition [38], including distinguishing functional from nonfunctional motion using wrist-worn sensors [18–20]. RF uses a number of decision trees on randomly selected sub-samples of the dataset to make predictions. We input into the model a set of statistical features for each data dimension, including its mean, maximum, minimum, standard deviation, and root mean square. These features capture useful information for motion identification, such as the energy and variance of the motion. CNN and RF generate primitive predictions at each 10-ms time point. To generate primitive sequences, we smoothed the pointwise predictions of these models using a weighted running average approach. To perform the smoothing, we used a Kaiser window [40] whose parameters (window size and relative sidelobe attenuation) were selected using the best validation performance. We also examined the classification performance of ASRF, a state-of-the-art deep learning model for action recognition [39]. ASRF is composed of two CNN modules: a segmentation module and a boundary detection module. The segmentation module performs the pointwise predictions of the primitives, and the boundary detection module detects the boundaries of the primitives. These pointwise primitive predictions are combined with the detected boundaries for smoothing and final sequence generation. During smoothing, the model takes the most frequent pointwise prediction between two detected boundaries as the final prediction for that segment. To benchmark Seq2Seq against these alternative models, we combined prediction outcomes from all patients, primitives, and activities (confusion matrices are shown in S2 Fig). In addition to calculating each model’s overall sensitivity and FDR, we also calculated its F 1 score. The F 1 score, a balance between sensitivity and FDR, captures the global classification performance of a model. The F 1 score ranges between 0 and 1, and a value of 1 indicates perfect classification. It is calculated as: Statistical examination of model performance We report the sensitivities, FDRs, and F 1 scores for each model on the test set. To examine if the models significantly differed in their classification performance, we bootstrapped the test set, which consisted of 324 single ADL trials aggregated across test subjects. From this test set, we randomly subsampled 81 trials with replacement to create a bootstrap set. We created 250 such bootstrapped sets, which were independently fed into each model to generate a distribution of performance metrics (sensitivities, FDRs, and F 1 scores) for each model. We used unpaired, two-tailed Student’s t-tests to compare the performance metrics of Seq2Seq against each model, and used Bonferroni correction for multiple comparisons. We performed all statistical analyses in Python. Significance was set at α = 0.05. [END] --- [1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000044 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/