Implementing statistical analysis Research Roadmap Amsterdam UMC

Implementing statistical analysis

WMO
Non-WMO

The necessary techniques and working methods relating to the statistical analyses have already been described in the statistical analysis plan (SAP).

To refine your knowledge of statistics, the methodologists and statisticians of the EDS department have developed the e-learning on Practical Biostatistics. The website of EpidM provides a variety of further and refresher training on statistical methods and techniques of clinical research.

In addition, a Biostatistics Wiki environment has been arranged, containing short and accessible explanations of statistical techniques that are commonly used in medical research.

Several other practical suggestions that can help to conduct statistical analyses in an organised manner are listed below.

Suggestions for conducting statistical analyses in an organised manner

It is extremely important for the final, cleaned dataset to be retained in its original state. For this reason, the first step should always be to make a working copy of the final dataset. Save the final, original dataset as a read-only file with a new name and in a separate folder.

The working copy will then serve as the basis for all further statistical analyses.

Always use a syntax (in SPPS) or a script (in R) when performing data operations (e.g. recoding, merging variables) and statistical analyses. In addition to helping you work in a systematic and organised manner, working with syntaxes makes it possible to replicate your analyses at a later point in time or to have them repeated and checked by other researchers.

Moreover, cumbersome tasks (e.g. recoding) can be repeated easily and efficiently by copying a syntax from a previous syntax file.

Add descriptions and comments to the syntax in comprehensible language, such that the purpose and result of each processing operation is clear (including to other people).

Save the syntax files with a meaningful file name (.sps files in SPSS), and do the same for the corresponding output files containing the results of the statistical analyses (.spv files in SPSS).

Data analysis documentation

It is important in respect of reproducibility and efficiency of data analysis that clear documentation of the data analysis takes place. This may be undertaken by creating a log file for all the relevant analyses. This file needs to start off with the research question to be answered and the date of the analysis, and should end with a(n) (provisional) answer to the question.

A lof file (e.g. SPSS syntax) can be used to document your analyses (e.g. for an article) to allow you and others to easily retrieve and reproduce everything. Don’t forget to always include the name and location of the datafile (e.g. ‘get file’ in SPSS), so you know which file is related to your analysis (and where they are stored). Log files should include the code for all statistical tests conducted, to serve as an analysis logbook. Place your code in a logical order and distinguish between variable definitions and analyses (e.g. firstly all variable definitions, than the analyses for table 1, then table 2, etc.).

Tip: annotate your log files (e.g. by using * followed by text in SPSS syntax). Annotations are an important part of documentation of your data analyses and facilitate reproduction of your results end recycling of your code.

Handling missing data

Missing data are a common problem in all kinds of research. The way you deal with it depends on how much data is missing, the kind of missing data (single items, a full questionnaire, a measurement wave), why it is missing, and how it is missing (at random or not at random) . Handling missing data is an important step in several phases of your study.

The default option in standard software packages as SPSS, Stata or SAS is that cases with missing values are not included in the analyses. Deleting cases or persons results in a smaller sample size and larger standard errors. As a result the power to find a significant result decreases and the chance that you correctly accept the alternative hypothesis of an effect (compared to the null hypothesis of no effect) is smaller. Secondly, you introduce bias in effect estimates. when there is a difference in characteristics between responders and non-responders. When the group of non-responders is large, and you delete them, your sample characteristics are different from your original sample and from the population you study. Therefore you need to inspect the missing data, before doing further analyses. Thus, always check the missing data in your data set before starting your analyses, and do never simply delete persons in your dataset with missing values.