Sleep Medicine Epidemiology
Brigham and Women's Hospital ยท Harvard Medical School

Enough is enough

When too many variables is a problem

By

I have parried a number of data requests this week, which has led to more explanations about the nature of our polysomnography (PSG) datasets. Everyone’s eyes light up in disbelief when I mention that the dataset contains “around 1,200 variables” — the exact number is 1,235. For MESA Sleep, our actigraphy dataset was comprised of more than 1,300 variables. I cringe a little bit each time we share out our 2,500+ variable Excel workbook that describes the contents of our MESA datasets, which actually represents 226 printed pages of variable descriptions. Thankfully, we have started with only a subset of those variables for our “official” MESA Data Dictionary.

The data checking and cleaning processes are also complicated by the presence of a huge number of variables. Since we are importing and keeping nearly every variable that the PSG and actigraphy software will offer in its exports, we become reliant on the software itself to make sure that the data outputs are valid. Sad to say, but we have found instances over the years where this isn’t entirely the case. For instance, we recently discovered that an older version of the PSG software output “-1” instead of system missing on a number of variables. These values have persisted in analytic datasets for 15 years without being scrubbed. Thankfully, the affected variables are quite obscure and unlikely to have ever been part of an actual analysis. Having such data hang around and be put into the hands of the analyst, however, is a dreadful thought for a data manager.

We have a project starting soon that seeks to pull together many of our polysomnography databases into a sleep data resource that is available to the public for novel analyses. Given wider exposure, my worry is likely to increase that someone out there may choose something from these “obscure” realms of the datasets, not knowing or realizing that a chosen variable may contain implausible, illogical, or invalid values that were never caught by those who generated the analytic files. With MESA Sleep, we have started compiling and publishing our documentation of the PSG and actigraphy data collection and scoring processes. We intend to do this for other studies that may eventually be a part of this public resource, and at the same time I hope we can finally brave the depths of our complex datasets in full and root out any and all issues that remain.