Preparation for starting the analysis
- WMO
- Non-WMO
Data cleaning
After the last study procedure of the last participant, the collected data undergoes a cleaning process to produce a final dataset for statistical analyses or content review. The data collected in the Case Report Form (CRF/eCRF) are examined for errors and missing information. This process - known as data cleaning - is outlined in the Data Management Plan (DMP) that was established before the start of the study.
The aim of data cleaning is to produce a dataset as accurate and complete as possible by identifying and correcting errors. The following steps are involved, in this order:
- Check for duplicate entries (i.e. the same respondent appearing more than once);
- Identify ghost patients (i.e. invalid or non-existent respondent numbers occurring in the dataset);
- Ensuring mandatory variables are completed (e.g. the respondent number must always be filled in);
- Detect out-of-range values (e.g. impossible entries such as a height of 3 metres);
- Find logical inconsistencies between variables (e.g. cases like “adult children”);
- Conduct longitudinal data cleaning, if applicable.
Note: It’s important to address out-of-range values before assessing logical inconsistencies, as resolving the former often reduces the number of inconsistencies detected later.
Locking the database
Once all data have been entered and any errors corrected, data cleaning is officially concluded. Before statistical analysis can begin, the database must be locked to prevent further changes.
The F07 Lock study data form can be used to document the authorization, timepoint and reason for locking or unlocking the database. For a hard-copy database, the principal investigator must initial and date each page. This locking procedure was established in the DMP before the start of the study.
In some cases, further data cleaning may be required even after locking the database. Since this can no longer be done in the eCRF (where all modifications are automatically tracked via an audit trail), it is important to document such modifications very accurately in the SPSS syntax or R script - to ensure transparency.
For additional information and support during the data cleaning process or locking the database, contact the Research Data Management Helpdesk.