Lesson 26: Working with Date and Longitudinal Data

Lesson 26: Working with Date and Longitudinal Data

Overview

In this lesson, we return to a number of topics that we already learned about in Part II of this course. Such topics include handling date variables, selecting first and/or last observations, and using the MEANS and SUMMARY procedures to calculate summary statistics. Here, however, our focus will be on using the techniques to analyze longitudinal data, that is, data that are collected over time.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to do the following:

  • read date values into SAS using appropriate informats
  • tell SAS to display your date values using appropriate formats
  • use dates to calculate new values
  • use the YRDIF function to calculate age
  • understand how longitudinal data can be stored in one observation
  • understand how longitudinal data can be stored in multiple observations
  • use the FIRST.variable and LAST.variable to identify first and last observations, respectively
  • use the LAG function to return the value of a variable from the previous observation
  • use the DIF function to calculate the difference between a variable's value in the current observation and the variable's value in the previous observation
  • use the RETAIN statement and the FIRST.variable and LAST.variable to calculate the difference between the first and last observation for each subject
  • use the FIRST.variable and/or LAST.variable, in conjunction with the FREQ procedure, to compute frequencies on longitudinal data sets
  • create summary data sets using the MEANS or SUMMARY procedure

Textbook Reference

 Chapter 4 of the textbook.


26.1 - Lesson Notes

26.1 - Lesson Notes

B. Processing Date Variables

If you are pressed for time, and you feel really comfortable with handling dates in SAS, you might opt to skip reading this section. It truly should be reviewed for you. On the other hand, it might not hurt to see the review of the YRDIF function, among other things.

Page 127. Don't forget that the form of the date constant is always 'DDMONYYYY'd — '01JAN2005'd for example — regardless of the formats that you are using with your dates.

Page 128. I'll emphasize just what the authors say at the bottom of the page.... the formats of your dates need not match the informats that you used to read in your dates. That is, you could use the mmddyy8. informat to read in your dates, but the date9. format to display your dates.

D. Longitudinal Data

Page 130. Unless you really study this program and review the contents of the resulting data set, it might not be obvious what is going on with the structure of the data set. If you launch and run the program:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 

DATA HOSP_PATIENTS;
   INPUT #1 
      @1  ID           $3.
      @4  DATE1   MMDDYY8. 
      @12 HR1           3.
      @15 SBP1          3.
      @18 DBP1          3.
      @21 DX1           3.
      @24 DOCFEE1       4.
      @28 LABFEE1       4.
         #2
      @4  DATE2   MMDDYY8. 
      @12 HR2           3.
      @15 SBP2          3.
      @18 DBP2          3.
      @21 DX2           3.
      @24 DOCFEE2       4.
      @28 LABFEE2       4.
         #3
      @4  DATE3   MMDDYY8. 
      @12 HR3           3.
      @15 SBP3          3.
      @18 DBP3          3.
      @21 DX3           3.
      @24 DOCFEE3       4.
      @28 LABFEE3       4.
         #4
      @4  DATE4   MMDDYY8. 
      @12 HR4           3.
      @15 SBP4          3.
      @18 DBP4          3.
      @21 DX4           3.
      @24 DOCFEE4       4.
      @28 LABFEE4       4.;
   FORMAT DATE1-DATE4 MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
007
007
0090903198306611007013700300000
009
009
009
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;

PROC PRINT data = HOSP_PATIENTS;
RUN;

and review the output from the PRINT procedure:

Obs ID DATE1 HR1 SBP1 DBP1 DX1 DOCFEE1 LABFEE1 DATE2 HR2 SBP2 DBP2 DX2 DOCFEE2 LABFEE2
1 007 10/21/1983 70 120 80 14 40 150 12/01/1983 72 130 90 20 50 200
2 009 09/03/1983 66 110 70 137 30 0 . . . . . . .
3 005 07/05/1983 74 140 82 13 90 0 01/15/1982 80 180 96 14 200 1500
Obs DATE3 HR3 SBP3 DBP3 DX3 DOCFEE3 LABFEE3 DATE4 HR4 SBP4 DBP4 DX4 DOCFEE4 LABFEE4
1 . . . .   . . . . . . . . .
2 . . . . . . . . . . . . . .
3 06/18/1982 70 170 84 14 80 400 07/03/1983 64 140 84 14 80 200

you can see that the program reads multiple lines of data to create one observation for each subject. That is, the program creates a fat data set, in which all of the data for one subject is contained in one observation.

By the way, you should know by now that it drives me nuts not to see a DATA step closed off with a RUN; statement. Therefore, I've dutifully added the RUN; statement to the program. I also added the PRINT procedure so that we could view the contents of the resulting data set. In general, you should always PRINT your data set after you read it in so that you can verify that SAS read your data in as expected.

G. Computing the Difference Between the First and Last Observation for each Subject

Page 139. If you launch and run the program:

  DATA PATIENTS;
INPUT @1  ID          $3.
		@4  DATE   MMDDYY8. 
		@12 HR           3.
		@15 SBP          3.
		@18 DBP          3.
		@21 DX           3.
		@24 DOCFEE       4.
		@28 LABFEE       4.;
FORMAT DATE MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
0090903198306611007013700300000
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;

PROC SORT DATA=PATIENTS;
BY ID DATE;
RUN;

DATA FIRST_LAST;
SET PATIENTS;
BY ID;
***Data set PATIENTS is sorted by ID and DATE;
RETAIN FIRST_HR FIRST_SBP FIRST_DBP;
***Omit patients with only one visit;
IF FIRST.ID AND LAST.ID THEN DELETE;
***If it is the first visit assign values to the
	retained variables;
IF FIRST.ID THEN DO;
	FIRST_HR = HR;
	FIRST_SBP = SBP;
	FIRST_DBP = DBP;
END;
IF LAST.ID THEN DO;
	D_HR = HR - FIRST_HR;
	D_SBP = SBP - FIRST_SBP;
	D_DBP = DBP - FIRST_DBP;
	OUTPUT;
END;
RUN;

PROC PRINT DATA= FIRST_LAST NOOBS;
title 'The first approach';
RUN;

you can see that the output displayed in the textbook is incorrect. Here's what the first_last data set really looks like:

The first approach

ID DATE HR SBP DBP DX DOCFEE LABFEE FIRST HR FIRST SBP FIRST DBP D HR D SBP D DBP
005 07/05/1983 74 140 82 13 90 0 80 180 96 -6 -40 -14
007 12/01/1983 72 130 90 20 50 200 70 120 80 2 10 10

Page 140. The authors should have left well enough alone as their second program doesn't produce the correct differences! Their error is a simple typo, though, that is easy to correct. The assignment statement for D_DBP:

D_DBP = DBP - LAG(SBP);

should instead be:

D_DBP = DBP - LAG(DBP);

If you launch and run the corrected program:

  DATA PATIENTS;
	INPUT @1  ID          $3.
		  @4  DATE   MMDDYY8. 
		  @12 HR           3.
		  @15 SBP          3.
		  @18 DBP          3.
		  @21 DX           3.
		  @24 DOCFEE       4.
		  @28 LABFEE       4.;
	FORMAT DATE MMDDYY10.;
 DATALINES;
 0071021198307012008001400400150
 0071201198307213009002000500200
 0090903198306611007013700300000
 0050705198307414008201300900000
 0050115198208018009601402001500
 0050618198207017008401400800400
 0050703198306414008401400800200
 ;
 RUN;
 
 PROC SORT DATA=PATIENTS;
	BY ID DATE;
 RUN;
 
 DATA FIRST_LAST2;
	SET PATIENTS;
	BY ID;
	***Data set PATIENTS is sorted by ID and DATE;
	***Omit patients with only one visit;
	IF FIRST.ID AND LAST.ID THEN DELETE;
	***If it is the first or last visit execute the LAG
	   function;
	IF FIRST.ID OR LAST.ID THEN DO;
	   D_HR = HR - LAG(HR);
	   D_SBP = SBP - LAG(SBP);
	   D_DBP = DBP - LAG(DBP);
	END;
	IF LAST.ID THEN OUTPUT;
 RUN;

 PROC PRINT DATA = FIRST_LAST2 NOOBS;
	 title 'The second approach';
 RUN

you can see that their second approach produces the same first_last data set:

The second approach

ID DATE HR SBP DBP DX DOCFEE LABFEE D_HR D_SBP D_DBP
005 07/05/183 74 140 82 13 90 0 -6 -40 -14
007 12/01/1983 72 130 90 20 50 200 2 10 10

H. Computing Frequencies on Longitudinal Data Sets

Page 140. If you launch and run the program:

  PROC FREQ DATA=PATIENTS ORDER=FREQ;
	TITLE "Diagnoses in Decreasing Frequency Order";
	TABLES DX;
 RUN;

you can see that indeed the diagnoses are listed in frequency order from the most common diagnosis to the least:

Diagnoses in Decreasing Frequency Order The FREQ Procedure

The FREQ Procedure

DX Frequency Percent Cumulative Frequency Cumulative Percent
14 4 57.14 4 57.14
13 1 14.29 5 71.43
20 1 14.29 6 85.71
137 1 14.25 7 100.00

Page 141. Here's what the output looks like when you run the code to count a diagnosis only once for a given patient:

Diagnoses in Decreasing Frequency Order The FREQ Procedure

The FREQ Procedure

DX Frequency Percent Cumulative Frequency Cumulative Percent
14 2 40.00 2 40.00
13 1 20.00 3 60.00
20 1 20.00 4 80.00
137 1 20.00 5 100.00

I. Creating Summary Data Sets with PROC MEANS or PROC SUMMARY

Page 152. In the last paragraph, the authors suggest that it would be okay not to specify variable names in the OUTPUT statement. I wholeheartedly agree with the authors that it is probably a bad idea. (Soapbox alert!) Your number one job as a SAS programmer should be to write code that is clear not only to you but to anyone else who reads your code... or the output that your code generates.

J. Outputting Statistics Other than Means

Page 153. I like the use of the AUTONAME option as a way of minimizing the potential confusion caused by not specifying variable names in the OUTPUT statement.


26.2 - Summary

26.2 - Summary

In this lesson, we revisited a number of techniques that we previously learned so that we can use them when working with longitudinal data.

The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work. Then, you can use the techniques to analyze your own longitudinal data!


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility