Lesson 26: Working with Date and Longitudinal Data
Lesson 26: Working with Date and Longitudinal DataOverview
In this lesson, we return to a number of topics that we already learned about in Part II of this course. Such topics include handling date variables, selecting first and/or last observations, and using the MEANS and SUMMARY procedures to calculate summary statistics. Here, however, our focus will be on using the techniques to analyze longitudinal data, that is, data that are collected over time.
Objectives
Upon completing this lesson, you should be able to do the following:
- read date values into SAS using appropriate informats
- tell SAS to display your date values using appropriate formats
- use dates to calculate new values
- use the YRDIF function to calculate age
- understand how longitudinal data can be stored in one observation
- understand how longitudinal data can be stored in multiple observations
- use the FIRST.variable and LAST.variable to identify first and last observations, respectively
- use the LAG function to return the value of a variable from the previous observation
- use the DIF function to calculate the difference between a variable's value in the current observation and the variable's value in the previous observation
- use the RETAIN statement and the FIRST.variable and LAST.variable to calculate the difference between the first and last observation for each subject
- use the FIRST.variable and/or LAST.variable, in conjunction with the FREQ procedure, to compute frequencies on longitudinal data sets
- create summary data sets using the MEANS or SUMMARY procedure
Textbook Reference
Chapter 4 of the textbook.
26.1 - Lesson Notes
26.1 - Lesson NotesB. Processing Date Variables
If you are pressed for time, and you feel really comfortable with handling dates in SAS, you might opt to skip reading this section. It truly should be reviewed for you. On the other hand, it might not hurt to see the review of the YRDIF function, among other things.
Page 127. Don't forget that the form of the date constant is always 'DDMONYYYY'd — '01JAN2005'd for example — regardless of the formats that you are using with your dates.
Page 128. I'll emphasize just what the authors say at the bottom of the page.... the formats of your dates need not match the informats that you used to read in your dates. That is, you could use the mmddyy8. informat to read in your dates, but the date9. format to display your dates.
D. Longitudinal Data
Page 130. Unless you really study this program and review the contents of the resulting data set, it might not be obvious what is going on with the structure of the data set. If you launch and run the program:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA HOSP_PATIENTS;
INPUT #1
@1 ID $3.
@4 DATE1 MMDDYY8.
@12 HR1 3.
@15 SBP1 3.
@18 DBP1 3.
@21 DX1 3.
@24 DOCFEE1 4.
@28 LABFEE1 4.
#2
@4 DATE2 MMDDYY8.
@12 HR2 3.
@15 SBP2 3.
@18 DBP2 3.
@21 DX2 3.
@24 DOCFEE2 4.
@28 LABFEE2 4.
#3
@4 DATE3 MMDDYY8.
@12 HR3 3.
@15 SBP3 3.
@18 DBP3 3.
@21 DX3 3.
@24 DOCFEE3 4.
@28 LABFEE3 4.
#4
@4 DATE4 MMDDYY8.
@12 HR4 3.
@15 SBP4 3.
@18 DBP4 3.
@21 DX4 3.
@24 DOCFEE4 4.
@28 LABFEE4 4.;
FORMAT DATE1-DATE4 MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
007
007
0090903198306611007013700300000
009
009
009
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;
PROC PRINT data = HOSP_PATIENTS;
RUN;
and review the output from the PRINT procedure:
Obs | ID | DATE1 | HR1 | SBP1 | DBP1 | DX1 | DOCFEE1 | LABFEE1 | DATE2 | HR2 | SBP2 | DBP2 | DX2 | DOCFEE2 | LABFEE2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 007 | 10/21/1983 | 70 | 120 | 80 | 14 | 40 | 150 | 12/01/1983 | 72 | 130 | 90 | 20 | 50 | 200 |
2 | 009 | 09/03/1983 | 66 | 110 | 70 | 137 | 30 | 0 | . | . | . | . | . | . | . |
3 | 005 | 07/05/1983 | 74 | 140 | 82 | 13 | 90 | 0 | 01/15/1982 | 80 | 180 | 96 | 14 | 200 | 1500 |
Obs | DATE3 | HR3 | SBP3 | DBP3 | DX3 | DOCFEE3 | LABFEE3 | DATE4 | HR4 | SBP4 | DBP4 | DX4 | DOCFEE4 | LABFEE4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | . | . | . | . | . | . | . | . | . | . | . | . | . | |
2 | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
3 | 06/18/1982 | 70 | 170 | 84 | 14 | 80 | 400 | 07/03/1983 | 64 | 140 | 84 | 14 | 80 | 200 |
you can see that the program reads multiple lines of data to create one observation for each subject. That is, the program creates a fat data set, in which all of the data for one subject is contained in one observation.
By the way, you should know by now that it drives me nuts not to see a DATA step closed off with a RUN; statement. Therefore, I've dutifully added the RUN; statement to the program. I also added the PRINT procedure so that we could view the contents of the resulting data set. In general, you should always PRINT your data set after you read it in so that you can verify that SAS read your data in as expected.
G. Computing the Difference Between the First and Last Observation for each Subject
Page 139. If you launch and run the program:
DATA PATIENTS;
INPUT @1 ID $3.
@4 DATE MMDDYY8.
@12 HR 3.
@15 SBP 3.
@18 DBP 3.
@21 DX 3.
@24 DOCFEE 4.
@28 LABFEE 4.;
FORMAT DATE MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
0090903198306611007013700300000
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;
PROC SORT DATA=PATIENTS;
BY ID DATE;
RUN;
DATA FIRST_LAST;
SET PATIENTS;
BY ID;
***Data set PATIENTS is sorted by ID and DATE;
RETAIN FIRST_HR FIRST_SBP FIRST_DBP;
***Omit patients with only one visit;
IF FIRST.ID AND LAST.ID THEN DELETE;
***If it is the first visit assign values to the
retained variables;
IF FIRST.ID THEN DO;
FIRST_HR = HR;
FIRST_SBP = SBP;
FIRST_DBP = DBP;
END;
IF LAST.ID THEN DO;
D_HR = HR - FIRST_HR;
D_SBP = SBP - FIRST_SBP;
D_DBP = DBP - FIRST_DBP;
OUTPUT;
END;
RUN;
PROC PRINT DATA= FIRST_LAST NOOBS;
title 'The first approach';
RUN;
you can see that the output displayed in the textbook is incorrect. Here's what the first_last data set really looks like:
The first approach
ID | DATE | HR | SBP | DBP | DX | DOCFEE | LABFEE | FIRST HR | FIRST SBP | FIRST DBP | D HR | D SBP | D DBP |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
005 | 07/05/1983 | 74 | 140 | 82 | 13 | 90 | 0 | 80 | 180 | 96 | -6 | -40 | -14 |
007 | 12/01/1983 | 72 | 130 | 90 | 20 | 50 | 200 | 70 | 120 | 80 | 2 | 10 | 10 |
Page 140. The authors should have left well enough alone as their second program doesn't produce the correct differences! Their error is a simple typo, though, that is easy to correct. The assignment statement for D_DBP:
D_DBP = DBP - LAG(SBP);
should instead be:
D_DBP = DBP - LAG(DBP);
If you launch and run the corrected program:
DATA PATIENTS;
INPUT @1 ID $3.
@4 DATE MMDDYY8.
@12 HR 3.
@15 SBP 3.
@18 DBP 3.
@21 DX 3.
@24 DOCFEE 4.
@28 LABFEE 4.;
FORMAT DATE MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
0090903198306611007013700300000
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;
PROC SORT DATA=PATIENTS;
BY ID DATE;
RUN;
DATA FIRST_LAST2;
SET PATIENTS;
BY ID;
***Data set PATIENTS is sorted by ID and DATE;
***Omit patients with only one visit;
IF FIRST.ID AND LAST.ID THEN DELETE;
***If it is the first or last visit execute the LAG
function;
IF FIRST.ID OR LAST.ID THEN DO;
D_HR = HR - LAG(HR);
D_SBP = SBP - LAG(SBP);
D_DBP = DBP - LAG(DBP);
END;
IF LAST.ID THEN OUTPUT;
RUN;
PROC PRINT DATA = FIRST_LAST2 NOOBS;
title 'The second approach';
RUN
you can see that their second approach produces the same first_last data set:
The second approach
ID | DATE | HR | SBP | DBP | DX | DOCFEE | LABFEE | D_HR | D_SBP | D_DBP |
---|---|---|---|---|---|---|---|---|---|---|
005 | 07/05/183 | 74 | 140 | 82 | 13 | 90 | 0 | -6 | -40 | -14 |
007 | 12/01/1983 | 72 | 130 | 90 | 20 | 50 | 200 | 2 | 10 | 10 |
H. Computing Frequencies on Longitudinal Data Sets
Page 140. If you launch and run the program:
PROC FREQ DATA=PATIENTS ORDER=FREQ;
TITLE "Diagnoses in Decreasing Frequency Order";
TABLES DX;
RUN;
you can see that indeed the diagnoses are listed in frequency order from the most common diagnosis to the least:
Diagnoses in Decreasing Frequency Order The FREQ Procedure
The FREQ Procedure
DX | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|
14 | 4 | 57.14 | 4 | 57.14 |
13 | 1 | 14.29 | 5 | 71.43 |
20 | 1 | 14.29 | 6 | 85.71 |
137 | 1 | 14.25 | 7 | 100.00 |
Page 141. Here's what the output looks like when you run the code to count a diagnosis only once for a given patient:
Diagnoses in Decreasing Frequency Order The FREQ Procedure
The FREQ Procedure
DX | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|
14 | 2 | 40.00 | 2 | 40.00 |
13 | 1 | 20.00 | 3 | 60.00 |
20 | 1 | 20.00 | 4 | 80.00 |
137 | 1 | 20.00 | 5 | 100.00 |
I. Creating Summary Data Sets with PROC MEANS or PROC SUMMARY
Page 152. In the last paragraph, the authors suggest that it would be okay not to specify variable names in the OUTPUT statement. I wholeheartedly agree with the authors that it is probably a bad idea. (Soapbox alert!) Your number one job as a SAS programmer should be to write code that is clear not only to you but to anyone else who reads your code... or the output that your code generates.
J. Outputting Statistics Other than Means
Page 153. I like the use of the AUTONAME option as a way of minimizing the potential confusion caused by not specifying variable names in the OUTPUT statement.
26.2 - Summary
26.2 - SummaryIn this lesson, we revisited a number of techniques that we previously learned so that we can use them when working with longitudinal data.
The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work. Then, you can use the techniques to analyze your own longitudinal data!