Preperation for starting the analysis

Data cleaning

After the last visit of the last subject, the data are cleaned in order to arrive at a final dataset on which the statistical analyses or content review can be performed. The data that have been collected in the CRF/eCRF are inspected for errors or missing information. This is known as data cleaning. This process was established in the data management plan before the start of the study.

The aim of the data cleaning process is to obtain a data file which is as clean as possible, i.e. that as many errors as possible have been removed. Data cleaning involves monitoring the following (in this order):

  • Presence of duplicates in the file (the same respondent occurring more than once);
  • Presence of ghost patients (non-existent respondent numbers occurring in the file);
  • Compulsory completion of a variable (it is, for instance, essential that the respondent number is always filled in);
  • Out-of-range values (impossible variable values, for instance, a height of 3 metres);
  • Logical inconsistencies between variables, for instance, “pregnant men”;
  • If applicable, longitudinal data cleaning will need to be subsequently carried out.

    Work in the right order: Firstly, deal with the “out-of-range” issues, and only then carry out inconsistency assessments, as the risk of finding inconsistencies is smaller when the “out-of-range” improvements have been made.

    Additional information and support with regard to data cleaning can be obtained from the department of Research Data Management.

    Locking the database

    Once all research data have been entered and any input errors have been corrected, the monitoring is officially closed. Before the statistical analysis of the data can begin, the database must first be locked so that no further changes can be made. The Research Data Management website provides additional information on this matter. For a hard-copy database, the principal investigator must initial and date the pages. This process was established in the data management plan before the start of the study.

    In some cases, it may still be necessary to ‘clean’ the data even after closing the database. Because this can no longer be done in the eCRF in which all mutations are automatically registered (audit trail), it is important to record all such mutations very accurately in the SPSS syntax or R script.