26.1 - Lesson Notes

B. Processing Date Variables

If you are pressed for time, and you feel really comfortable with handling dates in SAS, you might opt to skip reading this section. It truly should be reviewed for you. On the other hand, it might not hurt to see the review of the YRDIF function, among other things.

Page 127. Don't forget that the form of the date constant is always 'DDMONYYYY'd — '01JAN2005'd for example — regardless of the formats that you are using with your dates.

Page 128. I'll emphasize just what the authors say at the bottom of the page.... the formats of your dates need not match the informats that you used to read in your dates. That is, you could use the mmddyy8. informat to read in your dates, but the date9. format to display your dates.

D. Longitudinal Data

Page 130. Unless you really study this program and review the contents of the resulting data set, it might not be obvious what is going on with the structure of the data set. If you launch and run the program:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 

DATA HOSP_PATIENTS;
   INPUT #1 
      @1  ID           $3.
      @4  DATE1   MMDDYY8. 
      @12 HR1           3.
      @15 SBP1          3.
      @18 DBP1          3.
      @21 DX1           3.
      @24 DOCFEE1       4.
      @28 LABFEE1       4.
         #2
      @4  DATE2   MMDDYY8. 
      @12 HR2           3.
      @15 SBP2          3.
      @18 DBP2          3.
      @21 DX2           3.
      @24 DOCFEE2       4.
      @28 LABFEE2       4.
         #3
      @4  DATE3   MMDDYY8. 
      @12 HR3           3.
      @15 SBP3          3.
      @18 DBP3          3.
      @21 DX3           3.
      @24 DOCFEE3       4.
      @28 LABFEE3       4.
         #4
      @4  DATE4   MMDDYY8. 
      @12 HR4           3.
      @15 SBP4          3.
      @18 DBP4          3.
      @21 DX4           3.
      @24 DOCFEE4       4.
      @28 LABFEE4       4.;
   FORMAT DATE1-DATE4 MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
007
007
0090903198306611007013700300000
009
009
009
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;

PROC PRINT data = HOSP_PATIENTS;
RUN;

and review the output from the PRINT procedure:

Obs	ID	DATE1	HR1	SBP1	DBP1	DX1	DOCFEE1	LABFEE1	DATE2	HR2	SBP2	DBP2	DX2	DOCFEE2	LABFEE2
1	007	10/21/1983	70	120	80	14	40	150	12/01/1983	72	130	90	20	50	200
2	009	09/03/1983	66	110	70	137	30	0	.	.	.	.	.	.	.
3	005	07/05/1983	74	140	82	13	90	0	01/15/1982	80	180	96	14	200	1500

Obs	DATE3	HR3	SBP3	DBP3	DX3	DOCFEE3	LABFEE3	DATE4	HR4	SBP4	DBP4	DX4	DOCFEE4	LABFEE4
1	.	.	.	.		.	.	.	.	.	.	.	.	.
2	.	.	.	.	.	.	.	.	.	.	.	.	.	.
3	06/18/1982	70	170	84	14	80	400	07/03/1983	64	140	84	14	80	200

you can see that the program reads multiple lines of data to create one observation for each subject. That is, the program creates a fat data set, in which all of the data for one subject is contained in one observation.

By the way, you should know by now that it drives me nuts not to see a DATA step closed off with a RUN; statement. Therefore, I've dutifully added the RUN; statement to the program. I also added the PRINT procedure so that we could view the contents of the resulting data set. In general, you should always PRINT your data set after you read it in so that you can verify that SAS read your data in as expected.

G. Computing the Difference Between the First and Last Observation for each Subject

Page 139. If you launch and run the program:

  DATA PATIENTS;
INPUT @1  ID          $3.
		@4  DATE   MMDDYY8. 
		@12 HR           3.
		@15 SBP          3.
		@18 DBP          3.
		@21 DX           3.
		@24 DOCFEE       4.
		@28 LABFEE       4.;
FORMAT DATE MMDDYY10.;
DATALINES;
0071021198307012008001400400150
0071201198307213009002000500200
0090903198306611007013700300000
0050705198307414008201300900000
0050115198208018009601402001500
0050618198207017008401400800400
0050703198306414008401400800200
;
RUN;

PROC SORT DATA=PATIENTS;
BY ID DATE;
RUN;

DATA FIRST_LAST;
SET PATIENTS;
BY ID;
***Data set PATIENTS is sorted by ID and DATE;
RETAIN FIRST_HR FIRST_SBP FIRST_DBP;
***Omit patients with only one visit;
IF FIRST.ID AND LAST.ID THEN DELETE;
***If it is the first visit assign values to the
	retained variables;
IF FIRST.ID THEN DO;
	FIRST_HR = HR;
	FIRST_SBP = SBP;
	FIRST_DBP = DBP;
END;
IF LAST.ID THEN DO;
	D_HR = HR - FIRST_HR;
	D_SBP = SBP - FIRST_SBP;
	D_DBP = DBP - FIRST_DBP;
	OUTPUT;
END;
RUN;

PROC PRINT DATA= FIRST_LAST NOOBS;
title 'The first approach';
RUN;

you can see that the output displayed in the textbook is incorrect. Here's what the first_last data set really looks like:

The first approach

ID	DATE	HR	SBP	DBP	DX	DOCFEE	LABFEE	FIRST HR	FIRST SBP	FIRST DBP	D HR	D SBP	D DBP
005	07/05/1983	74	140	82	13	90	0	80	180	96	-6	-40	-14
007	12/01/1983	72	130	90	20	50	200	70	120	80	2	10	10

Page 140. The authors should have left well enough alone as their second program doesn't produce the correct differences! Their error is a simple typo, though, that is easy to correct. The assignment statement for D_DBP:

D_DBP = DBP - LAG(SBP);

should instead be:

D_DBP = DBP - LAG(DBP);

If you launch and run the corrected program:

  DATA PATIENTS;
	INPUT @1  ID          $3.
		  @4  DATE   MMDDYY8. 
		  @12 HR           3.
		  @15 SBP          3.
		  @18 DBP          3.
		  @21 DX           3.
		  @24 DOCFEE       4.
		  @28 LABFEE       4.;
	FORMAT DATE MMDDYY10.;
 DATALINES;
 0071021198307012008001400400150
 0071201198307213009002000500200
 0090903198306611007013700300000
 0050705198307414008201300900000
 0050115198208018009601402001500
 0050618198207017008401400800400
 0050703198306414008401400800200
 ;
 RUN;
 
 PROC SORT DATA=PATIENTS;
	BY ID DATE;
 RUN;
 
 DATA FIRST_LAST2;
	SET PATIENTS;
	BY ID;
	***Data set PATIENTS is sorted by ID and DATE;
	***Omit patients with only one visit;
	IF FIRST.ID AND LAST.ID THEN DELETE;
	***If it is the first or last visit execute the LAG
	   function;
	IF FIRST.ID OR LAST.ID THEN DO;
	   D_HR = HR - LAG(HR);
	   D_SBP = SBP - LAG(SBP);
	   D_DBP = DBP - LAG(DBP);
	END;
	IF LAST.ID THEN OUTPUT;
 RUN;

 PROC PRINT DATA = FIRST_LAST2 NOOBS;
	 title 'The second approach';
 RUN

you can see that their second approach produces the same first_last data set:

The second approach

ID	DATE	HR	SBP	DBP	DX	DOCFEE	LABFEE	D_HR	D_SBP	D_DBP
005	07/05/183	74	140	82	13	90	0	-6	-40	-14
007	12/01/1983	72	130	90	20	50	200	2	10	10

H. Computing Frequencies on Longitudinal Data Sets

Page 140. If you launch and run the program:

  PROC FREQ DATA=PATIENTS ORDER=FREQ;
	TITLE "Diagnoses in Decreasing Frequency Order";
	TABLES DX;
 RUN;

you can see that indeed the diagnoses are listed in frequency order from the most common diagnosis to the least:

Diagnoses in Decreasing Frequency Order The FREQ Procedure

The FREQ Procedure

DX	Frequency	Percent	Cumulative Frequency	Cumulative Percent
14	4	57.14	4	57.14
13	1	14.29	5	71.43
20	1	14.29	6	85.71
137	1	14.25	7	100.00

Page 141. Here's what the output looks like when you run the code to count a diagnosis only once for a given patient:

Diagnoses in Decreasing Frequency Order The FREQ Procedure

The FREQ Procedure

DX	Frequency	Percent	Cumulative Frequency	Cumulative Percent
14	2	40.00	2	40.00
13	1	20.00	3	60.00
20	1	20.00	4	80.00
137	1	20.00	5	100.00

I. Creating Summary Data Sets with PROC MEANS or PROC SUMMARY

Page 152. In the last paragraph, the authors suggest that it would be okay not to specify variable names in the OUTPUT statement. I wholeheartedly agree with the authors that it is probably a bad idea. (Soapbox alert!) Your number one job as a SAS programmer should be to write code that is clear not only to you but to anyone else who reads your code... or the output that your code generates.

J. Outputting Statistics Other than Means

Page 153. I like the use of the AUTONAME option as a way of minimizing the potential confusion caused by not specifying variable names in the OUTPUT statement.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility