Skip to content
how to prepare data

I’ve been answering questions about how to deal with raw data for the last couple of years, so I thought I might as well make an easy-how-to blog about how to prepare data for analyses. This is by no means a substitute for taking courses that will address the mathematics behind the statistical assumptions we need to meet (or indeed the assumptions themselves). I am assuming that readers will have at some point studied statistics, but that they may not have dealt with their own data.

A brief aside on learning (and teaching ) statistics…

Unfortunately, many statistics classes focus on theory. Students are often given data to run analyses, and those data are usually already cleaned, or generated to provide practice (e.g., missing data).  I have never seen a professor give students raw data. Although collecting data in a research methods course is often an important part of undergraduate psychology degree requirements, graduate students may get through a PhD program without ever being reminded of what they (perhaps) learned as undergrads. They may never be taught (outside of their own lab) how to deal with real data… even in a department that includes an excellent quantitative area.

Equally unfortunately, people–including me–have a hard time learning when the information doesn’t matter to their research goals; when it’s not relevant to them, now.  I know that transforming variables to meet assumptions is covered in the required series of stats courses here at OU, and yet many people seem to get through the same courses without really learning how to deal with non-normally distributed variables.  The simple truth is that all skills require practice. Until you sit down with your own dataset and try to turn it into something you can use to test hypotheses, you are probably not really going to learn. And if you do not keep using that knowledge, you’re going to forget it… especially if you are a graduate student and have to shove new information into your head all the time.

In my experience, shoving new information into my head requires shoving old information out of it. There’s only so much room in my head 😉

I am assuming here that the data have already been entered, either manually or via an online survey.

I am also going to give instructions for SPSS, because i use SPSS to clean and recode data. Although other programs are necessary for many analyses (e.g., SEM, IRT…), there is none that is as easy to work with as SPSS when it comes to data cleaning and coding (excluding big data, aka gigantic datasets). Of course, everything I describe here can be done in different programs, but you may have to get creative (for example, in SPSS you can recode into different variables, or take the maximum value of a series of variables, very easily whereas it’s much more complicated in SAS).

Those of you who use Qualtrics or Survey Monkey can simply download an SPSS file.

Step 0: Store the raw data.

Save a copy of your raw dataset and NEVER EVER change it.
I save a new version of my dataset after every major change. At the very least, I will have a new version after each and every one of the steps I list below.

Step 1: Make sure there are no data entry mistakes.

For example, if the range of values is from 1-5 (a Likert scale), and there is a 55, with manual data entry, it was clearly a mistake. This won’t happen with an online survey, but you might have (will almost always have unless you restrict the range on Qualtrics) someone who enters their year of birth (e.g., 1999) instead of their age in years (19). You might also get “19 years” if you haven’t limited input to numeric.

The simplest way to do this is run a frequency analysis on all the variables that might be subject to error (again, I am assuming a normal dataset, not thousands of cases and/or variables; that would be another blogpost).  If you identify a variable with an impossible value, sort it from largest to smallest (right click the variable name in the data view on SPSS, and select sort–descending).  Change “1999” to “19” or “55” to “5” or “19 years” to 19. If you have a case like this, where the variable would be classified as string rather than numeric (because it has words), you will need to go to variable view and change the type of variable to numeric.

WRITE DOWN ALL ANOMALIES FROM HERE ON OUT.

I have a Word document called “data cleaning” for every study.  You will regret it sooner or later if you do not keep such records.

Step 2: Check for Bad Data (round 1).

Age.

clean raw data

If you are collecting data from adults, and you’ve posted an online survey, you will have some underage participants. You might also have them in your university’s subject pool. Either way, you’ll need to delete them unless you’ve got IRB permission to use them.  Sort the age variable (ascending) and delete all the cases who are under 18.

Time.

One of the first things I do is create a variable that tells me how much time overall a person has taken (for online surveys, in lab or not).  Subtract the start data from the end date and divide by 60 (if you want it in minutes).  Sort the new variable by size (ascending to check for short times).

Cut-offs for discarding cases are somewhat arbitrary and depend on the survey.

For example, if I’ve got a survey I believe will take around 30 minutes, I’m probably going to delete all cases that took fewer than 10 minutes (I usually state this in pre-registration). Using outliers as a decision rule rarely works, because 3.5 or 3 SDs below the mean will probably be negative numbers, but you could try it.
Of course, there are always people who take hours or days.  I sometimes delete these too, it depends how important it is that people complete the survey in one sitting.
I also time each task and questionnaire and discard data accordingly. You can use Qualtrics to record the time spent on each page of the survey.

Flatliners.

I call people who select all the same response on multiple item questionnaires flatliners.  I check for them visually whenever possible. To do so, I go to the data view in SPSS, select all the variables on a questionnaire, and make the view narrow (see image below). I then scroll down and look for flatliners. If I find a suspicious case, I flag it (I make a variable called “Bad_Data” and set values to indicate various things) and inspect all the other responses for that case (person).

It’s important to do this before you reverse-score any variables.

People who select impossible answers.

For example, I show participants in lab a variety of TV shows.  I ask the participant to indicate which show they watched… sometimes they select a show I know is impossible because I wasn’t using it. When this is the case, I have to go back to my hard copy records in the lab and sort out which show was used that day.

Step 3: Change the data into what you need for computation and analyses.

A few examples of the many things you might have to do to prepare your data for analyses. More complex operations are needed for repeated measures designs (such as changing a dataset from wide to long), but these will have to wait for another blog entry.

Recoding string variables.

If you are going to need to make groups (e.g., you randomly assign participants to one of two conditions, but there are three possible options in each condition), it is easier to first recode string variables into numeric ones. You can use SPSS’s Transform–Automatic Recode to do this. Make sure you tell SPSS to treat empty string cells as missing data.

Making a condition variable (eg)

If you have 3 possible stimuli per condition, you will need to make a new variable.  For example, I might want to show participants TV shows that are either science fiction, documentaries, or sit-coms. To maximise generalizability, I would need to have as many possible stimuli within condition as possible.  So if participants are randomly assigned to 1/9 shows (three per condition), I will need to make a new variable where everyone who saw one of the three sci-fi shows is coded as sci-fi, and so on.

Such a design is properly analyzed in a mixed model, nesting show within condition, but as long as there are no differences between shows within condition, you can get away with using simple ANOVAs (e.g.).  You will need to test for mean differences in all relevant variables as well as differences in correlations if you are using any covariates (these would be interactions).  If there are any such differences, you should use a mixed model.

Reverse coding.

Any items within a questionnaire that are reverse coded (e.g., “I am not happy” when the scale as a whole is coded in the direction of more happiness) will need to be recoded (e.g., 5=1, 4=2, 3=3, 2-4, 1=5; SPSS provides a drop-down menu for this).  Ideally, you will read the literature–the original article describing the measures you are using–and know which items should be reverse-coded. If you don’t have that information, you can run a reliability analysis (drop down menu in SPSS and almost any other statistical software). Make sure you select “descriptives for scale if item deleted” in the options menu. Any items that need reverse-coding will have a negative item-total correlation, BUT do not use that information alone, because if there are many reversed items, the values will not tell you much. Look at the items and apply logic.

Scale reliability.

Those of us in psychology often use questionnaires in the form of multi-item scales (where participants response by selecting strongly disagree, disagree, neither agree nor disagree, agree, or strongly agree). This is not a blog about measurement, so I will assume readers know what I am talking about. If they don’t, they should probably take a class or read a text book or many papers 😉 Perhaps later I will add sources.

In the meantime: before proceeding any further with scales, you need to make sure that your measures are sufficiently reliable.  You can use alpha, omega, or whatever (again, take a class or read a book for further information), but it is important to verify reliability.  If you have poor reliability, you may be able to improve it by discarding items; this is the point at which you need to do this (and write it down; it will need to be explicitly stated in any paper submitted for publication).

Creating scale totals.

Using only the items you have decided are good, sum or take the mean of all items on a given scale (or subscale).  It doesn’t matter whether you use sums or means, except for interpretation.  Lately I’ve started using means, but for years I always summed all items for a total score.

Step 4: Correcting skew.

This is what got me started on this blog, questions about transforming variables.

Correlational analyses (this includes zero-order correlations, regression, ANCOVA–which is really regression) assume normally distributed continuous variables.  This is a very simplistic way to put this! If you want to (and you should) understand statistics and what assumptions are really about, what it means to have something that is or is not normally distributed at the population level, you’ll need to take a statistics course or three.  Here, I am just going to describe a few ways to deal with skewed variables in a dataset. Kurtosis is another violation of normality, but you probably won’t have to worry about it until you know enough not to need this blog.

  1. How to check for skew.

    You can get skew from various analyses (e.g., frequencies), but in SPSS I prefer to use Explore. I select histogram for plot (and not stem and leaf) because I like to see histograms (they are oriented the right way for visualizing normal distribution curves).  You can also get a Q-Q plot. That said, I only use the Skewness statistic in the table to assess skew.  Divide it by the standard error. If the result is > |3|, you have a skewed variable.

    Why the plots? I like to see them, particularly because it may be that a single outlier is causing the skew.  If that person has the lowest possible score (or highest), s/he has selected all the same responses for the items (if no reverse-scoring).  I will take a closer look at that case.

    The box plot also tells you if there are outliers. I do not discard univariate outliers unless I think the data is bad.

  2. What to do about skew.

    There are several transformations that can correct for skew. I will discuss a few common ones: square root, log, and inverse.  Rank order transformations are also good with some data.  You can google that.
    Start with the least drastic (square root). If that corrects the skew, stop there. If it does not, keep trying.  Often more extreme transformations will only make it worse, so you must compare all of them (including the original variable).

    Example syntax:

    Square root:
    Compute New_Var_name = SQRT(Original_name).
    Execute.
    Log:
    Compute New_Var_name =LG10(Original__name).
    Execute.
    Inverse:
    Compute New_Var_name = 1/(Original__name).
    Execute.
    Remember, taking the inverse means high values will now be low values.  Correlations will therefore have the opposite sign. You can multiply by (-1) if you want to make it easier to understand.
    I name the transformed variables such that I know what I have done (Original_name_SQRT for example)

    Some mathematics:

    Also remember that you cannot have zero in the denominator and you cannot take the log of a zero or anything less than zero.  You might not know your data has zero or < 0, so it is always a good idea when you check the distribution of your new variables to make sure your sample size has not inexplicably shrunk 😉

    Negatively skewed variables:

    These must be reflected when transforming.  You can also use an exponential transformation, but that will really change the variable and I don’t recommend it.

    To reflect a variable, simply add 1 to the maximum score and subtract the variable from it. ((Maxscore +1) – variable)
    Example:
    Compute Var_Name_Sq_Rf = SQRT(Max+1 – Var_Name).
    As with inverse transformations, this will make high scores low and vice versa, so remember to reverse the sign of all relationships. You can also multiply by -1.
    Compute Var_Name_Sq_Rf = SQRT(Max+1 – Var_Name)*-1.

    When to use transformed variables?

    Anytime they are part of a regression analysis. This includes zero-order correlations and covariates in ANCOVAs (ANOVA is just a regression with different assumptions where the software does all the work and you don’t get slopes unless you ask for them) (and you can do everything ANOVA and ANCOVA with regression).

    You do not need to transform variables to compare mean differences in a t-test, or in an ANOVA. You do need to transform the outcome variable in ANCOVA.

    What about interpretation?

    For regression analyses, this has never been a problem for me, because for my research I am interested in proportion of variance explained.  I use squared semi-partial correlations. The slope tells me the direction of the relationship (be careful with reflected or inverse transformations). If you want to use the regression equation to know exactly how much one unit change in x affects change in y, you will need to convert to the original units (and that just depends on what you’ve done and a lot of other things, so you can figure it out for yourself 😉

    For mean comparison, this only really matters if you have had to transform your DV.  I have reported statistics and effect sizes from the analyses using the transformed DV, along with DV (adjusted) means per condition from analyses using untransformed DV. I am not convinced this is necessary, because the effect size (d, calculated by dividing the difference between the means of interest by the standard error of the model) tells you all you need to know.  But some readers (and more importantly, editors and reviewers) like to see raw means.

Step 5: Checking for outliers.

This should come before looking at statistics and means of the analyses used to test your hypotheses.  It is part of preparing your data. I’m not going to give a thorough explanation of outliers and how to deal with them here. I will say that you need to worry about multivariate outliers. The best way to check for them is to run a regression using a set of previously established criteria to discard cases.

My rule is to discard any case that is an outlier on two or more criteria (studentized deleted residuals [most important really], Cook’s D, leverage, Mahalanobis distance, dfbeta, etc.. yes, some are redundant).  SPSS or any other program will give you these values if you ask for them. You can then sort columns (e.g., residuals) to find the outliers.  You can use my Statistics Calculator page to figure out cut-offs by plugging in numbers.

I know that MV outliers are covered in advanced regression courses. You probably need to take such a course before analyzing your data.  This little section is just a reminder that your data is not ready until you have checked for outliers. Remember to annotate your process, and make a MV_Outliers variable to identify them.

Step 6: Run preliminary, primary, secondary, and exploratory analyses.

It is possible (and more common than people like to admit) to have only exploratory analyses. But that’s a topic for another blog post.
By this time, you should have multiple copies of your dataset stored. My final copy is usually a “short” version with only the variables I will use in my analyses.

Caveat

I have probably forgotten things, do not hesitate to comment and point this out.  I am sure there are also people who do this differently.  There are other good ways to prepare raw data for analysis. You can comment about the better ways to clean data too 😉

This is a very basic step-by-step process that I do every time I have a new dataset. However, I have written it entirely from memory in short time, so please do not use it as the Data Cleaning Bible.  There are other things you may have to do. A CliffsNotes for Data Cleaning such as this blog post is no substitute for taking courses, finding mentors, asking questions, and reading a lot of articles.  Most of what I know I learned by doing, starting with the notes taken in class, augmented by talking to a lot of mentors and professors and innumerable mistakes and trial and error.
 

Buy Me a Coffee at ko-fi.com

Leave a Reply