This contribution to the series of methodology columns addresses frequently encountered issues in analyzing large data sets for research, including ensuring that the data are both usable and reliable. The authors welcome comments, suggestions, and questions about the content and presentation of this column from all readers. This column is not comprehensive and does not replace information found in textbooks or peer-reviewed articles. The authors have used textbooks and articles, in addition to their experience, for reference and recommend that readers also refer to these sources for further explanation of the content in this column.Scenario: Allan, the Director of Facilities of Hospital XYZ, and Gwen, the senior researcher at a healthcare facilities design research consulting firm, have been discussing methods for conducting effective surveys to determine the relationship between newly constructed buildings and patient/staffattitudes and behaviors. After previously discussing pretesting, Gwen has called Allan to begin planning the data analysis. Readers can access the data file that Allan is using online at http://www.herdjournal.com/Media/DocumentLibrary/HERD_datasetwitherrors.xlsx. Their conversation follows.Gwen: Hi, Allan, it sounds like pretesting went well, and you've administered your final survey.Allan: We have! I got the surveys back and entered the results into MicrosoftExcel. I think we're now ready to analyze the data.Gwen: That's great, but we need to be careful before we jump into the data analysis. There are a few more steps to make sure the data are ready.Allan: OK, where do we start?Gwen: The first step before data analysis is to ensure that your data are properly formatted. Your records should be in a spreadsheet format such that each column corresponds to a single variable and each row corresponds to a single unit of analysis. In our case, the variables are the questions from the survey, and the units of analysis are the survey participants. Each variable should have a name that is simple and unique, does not have spaces, and does not begin with a number. In our data set, we had each participant answer three questions about how satisfied they were with their working conditions before and after the move. We could refer to these variables as SATISFACTION1A, SATISFACTION2A, and SATISFACTION3A for the responses before the move and SATISFACTION 1B, SATISFACTION2B, and SATISFACTION3B for the responses after the move. With these responses separated into different variables, it will be easier to perform analyses comparing each.You'll also want to make sure that each participant has a unique numeric identifier in the first column. Participants from all populations should be entered as rows in the same spreadsheet, and you should add an indicator variable to note each participant's population. In our data set, for example, participants from the PICU on the fifth floor of the hospital would have a 1 in the POPULATION column, and participants from the PICU on the seventh floor would have a 2 in the POPULATION column. When using numeric codes for text values, you should create a separate key file that explains the value for each code.When entering your data, you should format all or most of the data as numbers to allow computer programs to more easily describe and perform tests on your data. For example, even though the participants entered M or F when we asked them to report their gender, you would enter them as 1 or 2 in the GENDER column, indicating the corresponding values in your key file. Another consideration that you might encounter is collapsing continuous variables, those that have a range of values between two points, to categorical or ordinal variables, which contain a prespecified number of groups. For example, you might be tempted to group ages when you enter your data. It is best to always retain the continuous variables in your spreadsheet, in case you decide you don't want the data collapsed at a later point. …
Read full abstract