Lesson 26: Working with Date and Longitudinal Data

Overview

In this lesson, we return to a number of topics that we already learned about in Part II of this course. Such topics include handling date variables, selecting first and/or last observations, and using the MEANS and SUMMARY procedures to calculate summary statistics. Here, however, our focus will be on using the techniques to analyze longitudinal data, that is, data that are collected over time.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to do the following:

read date values into SAS using appropriate informats
tell SAS to display your date values using appropriate formats
use dates to calculate new values
use the YRDIF function to calculate age
understand how longitudinal data can be stored in one observation
understand how longitudinal data can be stored in multiple observations
use the FIRST.variable and LAST.variable to identify first and last observations, respectively
use the LAG function to return the value of a variable from the previous observation
use the DIF function to calculate the difference between a variable's value in the current observation and the variable's value in the previous observation
use the RETAIN statement and the FIRST.variable and LAST.variable to calculate the difference between the first and last observation for each subject
use the FIRST.variable and/or LAST.variable, in conjunction with the FREQ procedure, to compute frequencies on longitudinal data sets
create summary data sets using the MEANS or SUMMARY procedure

Textbook Reference

Chapter 4 of the textbook.

26.1 - Lesson Notes

B. Processing Date Variables

If you are pressed for time, and you feel really comfortable with handling dates in SAS, you might opt to skip reading this section. It truly should be reviewed for you. On the other hand, it might not hurt to see the review of the YRDIF function, among other things.

Page 127. Don't forget that the form of the date constant is always 'DDMONYYYY'd — '01JAN2005'd for example — regardless of the formats that you are using with your dates.

Page 128. I'll emphasize just what the authors say at the bottom of the page.... the formats of your dates need not match the informats that you used to read in your dates. That is, you could use the mmddyy8. informat to read in your dates, but the date9. format to display your dates.

D. Longitudinal Data

Page 130. Unless you really study this program and review the contents of the resulting data set, it might not be obvious what is going on with the structure of the data set. If you launch and run the program:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 

DATA HOSP_PATIENTS;
   INPUT #1 
      @1  ID           $3.
      @4  DATE1   MMDDYY8. 
      @12 HR1           3.
      @15 SBP1          3.
      @18 DBP1          3.
      @21 DX1           3.
      @24 DOCFEE1       4.
      @28 LABFEE1       4.
         #2
      @4  DATE2   MMDDYY8. 
      @12 HR2           3.
      @15 SBP2          3.
      @18 DBP2          3.
      @21 DX2           3.
      @24 DOCFEE2       4.
      @28 LABFEE2       4.
         #3
      @4  DATE3   MMDDYY8. 
      @12 HR3           3.
      @15 SBP3          3.
      @18 DBP3          3.
      @21 DX3           3.
      @24 DOCFEE3       4.
      @28 LABFEE3       4.
         #4
      @4  DATE4   MMDDYY8. 
      @12 HR4           3.
      @15 SBP4          3.
      @18 DBP4          3.
      @21 DX4           3.
      @24 DOCFEE4       4.
      @28 LABFEE4       4.;
   FORMAT DATE1-DATE4 MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
007
007
0090903198306611007013700300000
009
009
009
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;

PROC PRINT data = HOSP_PATIENTS;
RUN;

and review the output from the PRINT procedure:

Obs	ID	DATE1	HR1	SBP1	DBP1	DX1	DOCFEE1	LABFEE1	DATE2	HR2	SBP2	DBP2	DX2	DOCFEE2	LABFEE2
1	007	10/21/1983	70	120	80	14	40	150	12/01/1983	72	130	90	20	50	200
2	009	09/03/1983	66	110	70	137	30	0	.	.	.	.	.	.	.
3	005	07/05/1983	74	140	82	13	90	0	01/15/1982	80	180	96	14	200	1500

Obs	DATE3	HR3	SBP3	DBP3	DX3	DOCFEE3	LABFEE3	DATE4	HR4	SBP4	DBP4	DX4	DOCFEE4	LABFEE4
1	.	.	.	.		.	.	.	.	.	.	.	.	.
2	.	.	.	.	.	.	.	.	.	.	.	.	.	.
3	06/18/1982	70	170	84	14	80	400	07/03/1983	64	140	84	14	80	200

you can see that the program reads multiple lines of data to create one observation for each subject. That is, the program creates a fat data set, in which all of the data for one subject is contained in one observation.

By the way, you should know by now that it drives me nuts not to see a DATA step closed off with a RUN; statement. Therefore, I've dutifully added the RUN; statement to the program. I also added the PRINT procedure so that we could view the contents of the resulting data set. In general, you should always PRINT your data set after you read it in so that you can verify that SAS read your data in as expected.

G. Computing the Difference Between the First and Last Observation for each Subject

Page 139. If you launch and run the program:

  DATA PATIENTS;
INPUT @1  ID          $3.
		@4  DATE   MMDDYY8. 
		@12 HR           3.
		@15 SBP          3.
		@18 DBP          3.
		@21 DX           3.
		@24 DOCFEE       4.
		@28 LABFEE       4.;
FORMAT DATE MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
0090903198306611007013700300000
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;

PROC SORT DATA=PATIENTS;
BY ID DATE;
RUN;

DATA FIRST_LAST;
SET PATIENTS;
BY ID;
***Data set PATIENTS is sorted by ID and DATE;
RETAIN FIRST_HR FIRST_SBP FIRST_DBP;
***Omit patients with only one visit;
IF FIRST.ID AND LAST.ID THEN DELETE;
***If it is the first visit assign values to the
	retained variables;
IF FIRST.ID THEN DO;
	FIRST_HR = HR;
	FIRST_SBP = SBP;
	FIRST_DBP = DBP;
END;
IF LAST.ID THEN DO;
	D_HR = HR - FIRST_HR;
	D_SBP = SBP - FIRST_SBP;
	D_DBP = DBP - FIRST_DBP;
	OUTPUT;
END;
RUN;

PROC PRINT DATA= FIRST_LAST NOOBS;
title 'The first approach';
RUN;

you can see that the output displayed in the textbook is incorrect. Here's what the first_last data set really looks like:

The first approach

ID	DATE	HR	SBP	DBP	DX	DOCFEE	LABFEE	FIRST HR	FIRST SBP	FIRST DBP	D HR	D SBP	D DBP
005	07/05/1983	74	140	82	13	90	0	80	180	96	-6	-40	-14
007	12/01/1983	72	130	90	20	50	200	70	120	80	2	10	10

Page 140. The authors should have left well enough alone as their second program doesn't produce the correct differences! Their error is a simple typo, though, that is easy to correct. The assignment statement for D_DBP:

D_DBP = DBP - LAG(SBP);

should instead be:

D_DBP = DBP - LAG(DBP);

If you launch and run the corrected program:

  DATA PATIENTS;
	INPUT @1  ID          $3.
		  @4  DATE   MMDDYY8. 
		  @12 HR           3.
		  @15 SBP          3.
		  @18 DBP          3.
		  @21 DX           3.
		  @24 DOCFEE       4.
		  @28 LABFEE       4.;
	FORMAT DATE MMDDYY10.;
 DATALINES;
 0071021198307012008001400400150
 0071201198307213009002000500200
 0090903198306611007013700300000
 0050705198307414008201300900000
 0050115198208018009601402001500
 0050618198207017008401400800400
 0050703198306414008401400800200
 ;
 RUN;
 
 PROC SORT DATA=PATIENTS;
	BY ID DATE;
 RUN;
 
 DATA FIRST_LAST2;
	SET PATIENTS;
	BY ID;
	***Data set PATIENTS is sorted by ID and DATE;
	***Omit patients with only one visit;
	IF FIRST.ID AND LAST.ID THEN DELETE;
	***If it is the first or last visit execute the LAG
	   function;
	IF FIRST.ID OR LAST.ID THEN DO;
	   D_HR = HR - LAG(HR);
	   D_SBP = SBP - LAG(SBP);
	   D_DBP = DBP - LAG(DBP);
	END;
	IF LAST.ID THEN OUTPUT;
 RUN;

 PROC PRINT DATA = FIRST_LAST2 NOOBS;
	 title 'The second approach';
 RUN

you can see that their second approach produces the same first_last data set:

The second approach

ID	DATE	HR	SBP	DBP	DX	DOCFEE	LABFEE	D_HR	D_SBP	D_DBP
005	07/05/183	74	140	82	13	90	0	-6	-40	-14
007	12/01/1983	72	130	90	20	50	200	2	10	10

H. Computing Frequencies on Longitudinal Data Sets

Page 140. If you launch and run the program:

  PROC FREQ DATA=PATIENTS ORDER=FREQ;
	TITLE "Diagnoses in Decreasing Frequency Order";
	TABLES DX;
 RUN;

you can see that indeed the diagnoses are listed in frequency order from the most common diagnosis to the least:

Diagnoses in Decreasing Frequency Order The FREQ Procedure

The FREQ Procedure

DX	Frequency	Percent	Cumulative Frequency	Cumulative Percent
14	4	57.14	4	57.14
13	1	14.29	5	71.43
20	1	14.29	6	85.71
137	1	14.25	7	100.00

Page 141. Here's what the output looks like when you run the code to count a diagnosis only once for a given patient:

Diagnoses in Decreasing Frequency Order The FREQ Procedure

The FREQ Procedure

DX	Frequency	Percent	Cumulative Frequency	Cumulative Percent
14	2	40.00	2	40.00
13	1	20.00	3	60.00
20	1	20.00	4	80.00
137	1	20.00	5	100.00

I. Creating Summary Data Sets with PROC MEANS or PROC SUMMARY

Page 152. In the last paragraph, the authors suggest that it would be okay not to specify variable names in the OUTPUT statement. I wholeheartedly agree with the authors that it is probably a bad idea. (Soapbox alert!) Your number one job as a SAS programmer should be to write code that is clear not only to you but to anyone else who reads your code... or the output that your code generates.

J. Outputting Statistics Other than Means

Page 153. I like the use of the AUTONAME option as a way of minimizing the potential confusion caused by not specifying variable names in the OUTPUT statement.

26.2 - Summary

In this lesson, we revisited a number of techniques that we previously learned so that we can use them when working with longitudinal data.

The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work. Then, you can use the techniques to analyze your own longitudinal data!

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility