26.1 - Lesson Notes

B. Processing Date Variables

If you are pressed for time, and you feel really comfortable with handling dates in SAS, you might opt to skip reading this section. It truly should be reviewed for you. On the other hand, it might not hurt to see the review of the YRDIF function, among other things.

Page 127. Don't forget that the form of the date constant is always 'DDMONYYYY'd — '01JAN2005'd for example — regardless of the formats that you are using with your dates.

Page 128. I'll emphasize just what the authors say at the bottom of the page.... the formats of your dates need not match the informats that you used to read in your dates. That is, you could use the mmddyy8. informat to read in your dates, but the date9. format to display your dates.

D. Longitudinal Data

Page 130. Unless you really study this program and review the contents of the resulting data set, it might not be obvious what is going on with the structure of the data set. If you launch and run the program:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 

DATA HOSP_PATIENTS;
   INPUT #1 
      @1  ID           $3.
      @4  DATE1   MMDDYY8. 
      @12 HR1           3.
      @15 SBP1          3.
      @18 DBP1          3.
      @21 DX1           3.
      @24 DOCFEE1       4.
      @28 LABFEE1       4.
         #2
      @4  DATE2   MMDDYY8. 
      @12 HR2           3.
      @15 SBP2          3.
      @18 DBP2          3.
      @21 DX2           3.
      @24 DOCFEE2       4.
      @28 LABFEE2       4.
         #3
      @4  DATE3   MMDDYY8. 
      @12 HR3           3.
      @15 SBP3          3.
      @18 DBP3          3.
      @21 DX3           3.
      @24 DOCFEE3       4.
      @28 LABFEE3       4.
         #4
      @4  DATE4   MMDDYY8. 
      @12 HR4           3.
      @15 SBP4          3.
      @18 DBP4          3.
      @21 DX4           3.
      @24 DOCFEE4       4.
      @28 LABFEE4       4.;
   FORMAT DATE1-DATE4 MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
007
007
0090903198306611007013700300000
009
009
009
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;

PROC PRINT data = HOSP_PATIENTS;
RUN;

and review the output from the PRINT procedure:

Obs ID DATE1 HR1 SBP1 DBP1 DX1 DOCFEE1 LABFEE1 DATE2 HR2 SBP2 DBP2 DX2 DOCFEE2 LABFEE2
1 007 10/21/1983 70 120 80 14 40 150 12/01/1983 72 130 90 20 50 200
2 009 09/03/1983 66 110 70 137 30 0 . . . . . . .
3 005 07/05/1983 74 140 82 13 90 0 01/15/1982 80 180 96 14 200 1500
Obs DATE3 HR3 SBP3 DBP3 DX3 DOCFEE3 LABFEE3 DATE4 HR4 SBP4 DBP4 DX4 DOCFEE4 LABFEE4
1 . . . .   . . . . . . . . .
2 . . . . . . . . . . . . . .
3 06/18/1982 70 170 84 14 80 400 07/03/1983 64 140 84 14 80 200

you can see that the program reads multiple lines of data to create one observation for each subject. That is, the program creates a fat data set, in which all of the data for one subject is contained in one observation.

By the way, you should know by now that it drives me nuts not to see a DATA step closed off with a RUN; statement. Therefore, I've dutifully added the RUN; statement to the program. I also added the PRINT procedure so that we could view the contents of the resulting data set. In general, you should always PRINT your data set after you read it in so that you can verify that SAS read your data in as expected.

G. Computing the Difference Between the First and Last Observation for each Subject

Page 139. If you launch and run the program:

  DATA PATIENTS;
INPUT @1  ID          $3.
		@4  DATE   MMDDYY8. 
		@12 HR           3.
		@15 SBP          3.
		@18 DBP          3.
		@21 DX           3.
		@24 DOCFEE       4.
		@28 LABFEE       4.;
FORMAT DATE MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
0090903198306611007013700300000
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;

PROC SORT DATA=PATIENTS;
BY ID DATE;
RUN;

DATA FIRST_LAST;
SET PATIENTS;
BY ID;
***Data set PATIENTS is sorted by ID and DATE;
RETAIN FIRST_HR FIRST_SBP FIRST_DBP;
***Omit patients with only one visit;
IF FIRST.ID AND LAST.ID THEN DELETE;
***If it is the first visit assign values to the
	retained variables;
IF FIRST.ID THEN DO;
	FIRST_HR = HR;
	FIRST_SBP = SBP;
	FIRST_DBP = DBP;
END;
IF LAST.ID THEN DO;
	D_HR = HR - FIRST_HR;
	D_SBP = SBP - FIRST_SBP;
	D_DBP = DBP - FIRST_DBP;
	OUTPUT;
END;
RUN;

PROC PRINT DATA= FIRST_LAST NOOBS;
title 'The first approach';
RUN;

you can see that the output displayed in the textbook is incorrect. Here's what the first_last data set really looks like:

The first approach

ID DATE HR SBP DBP DX DOCFEE LABFEE FIRST HR FIRST SBP FIRST DBP D HR D SBP D DBP
005 07/05/1983 74 140 82 13 90 0 80 180 96 -6 -40 -14
007 12/01/1983 72 130 90 20 50 200 70 120 80 2 10 10

Page 140. The authors should have left well enough alone as their second program doesn't produce the correct differences! Their error is a simple typo, though, that is easy to correct. The assignment statement for D_DBP:

D_DBP = DBP - LAG(SBP);

should instead be:

D_DBP = DBP - LAG(DBP);

If you launch and run the corrected program:

  DATA PATIENTS;
	INPUT @1  ID          $3.
		  @4  DATE   MMDDYY8. 
		  @12 HR           3.
		  @15 SBP          3.
		  @18 DBP          3.
		  @21 DX           3.
		  @24 DOCFEE       4.
		  @28 LABFEE       4.;
	FORMAT DATE MMDDYY10.;
 DATALINES;
 0071021198307012008001400400150
 0071201198307213009002000500200
 0090903198306611007013700300000
 0050705198307414008201300900000
 0050115198208018009601402001500
 0050618198207017008401400800400
 0050703198306414008401400800200
 ;
 RUN;
 
 PROC SORT DATA=PATIENTS;
	BY ID DATE;
 RUN;
 
 DATA FIRST_LAST2;
	SET PATIENTS;
	BY ID;
	***Data set PATIENTS is sorted by ID and DATE;
	***Omit patients with only one visit;
	IF FIRST.ID AND LAST.ID THEN DELETE;
	***If it is the first or last visit execute the LAG
	   function;
	IF FIRST.ID OR LAST.ID THEN DO;
	   D_HR = HR - LAG(HR);
	   D_SBP = SBP - LAG(SBP);
	   D_DBP = DBP - LAG(DBP);
	END;
	IF LAST.ID THEN OUTPUT;
 RUN;

 PROC PRINT DATA = FIRST_LAST2 NOOBS;
	 title 'The second approach';
 RUN

you can see that their second approach produces the same first_last data set:

The second approach

ID DATE HR SBP DBP DX DOCFEE LABFEE D_HR D_SBP D_DBP
005 07/05/183 74 140 82 13 90 0 -6 -40 -14
007 12/01/1983 72 130 90 20 50 200 2 10 10

H. Computing Frequencies on Longitudinal Data Sets

Page 140. If you launch and run the program:

  PROC FREQ DATA=PATIENTS ORDER=FREQ;
	TITLE "Diagnoses in Decreasing Frequency Order";
	TABLES DX;
 RUN;

you can see that indeed the diagnoses are listed in frequency order from the most common diagnosis to the least:

Diagnoses in Decreasing Frequency Order The FREQ Procedure

The FREQ Procedure

DX Frequency Percent Cumulative Frequency Cumulative Percent
14 4 57.14 4 57.14
13 1 14.29 5 71.43
20 1 14.29 6 85.71
137 1 14.25 7 100.00

Page 141. Here's what the output looks like when you run the code to count a diagnosis only once for a given patient:

Diagnoses in Decreasing Frequency Order The FREQ Procedure

The FREQ Procedure

DX Frequency Percent Cumulative Frequency Cumulative Percent
14 2 40.00 2 40.00
13 1 20.00 3 60.00
20 1 20.00 4 80.00
137 1 20.00 5 100.00

I. Creating Summary Data Sets with PROC MEANS or PROC SUMMARY

Page 152. In the last paragraph, the authors suggest that it would be okay not to specify variable names in the OUTPUT statement. I wholeheartedly agree with the authors that it is probably a bad idea. (Soapbox alert!) Your number one job as a SAS programmer should be to write code that is clear not only to you but to anyone else who reads your code... or the output that your code generates.

J. Outputting Statistics Other than Means

Page 153. I like the use of the AUTONAME option as a way of minimizing the potential confusion caused by not specifying variable names in the OUTPUT statement.