18.5 - Creating Samples

Because a DO loop executes statements iteratively, it provides an easy way to select a sample of observations from a large data set. Let's take a look at an example!

Example 18.12 Section

The following program uses an iterative DO loop and the SET statement's POINT= option to select every 100th observation from the permanent data set called stat481.log11 which contains 8,624 observations:

OPTIONS LS = 72 PS = 34 NODATE NONUMBER;
LIBNAME stat481 'C:\yourdrivename\Stat481WC\06doloops\sasndata';
 
DATA sample;
    DO i = 100 to 8600 by 100;
            set stat481.log11 point = i;
            output;
    END;
    stop;
RUN;
PROC PRINT data = sample NOOBS;
    title 'Subset of Logged Observations for Hospital 11';
RUN;

Subset of Logged Observations for Hospital 11
SUBJ	V_TYPE	V_DATE	FORM_CD
110004	0	04/22/93	prior
110027	3	01/25/94	med
110027	36	08/27/96	cmed
110029	12	09/27/94	purg
110029	42	04/01/97	sympts
110039	18	06/06/95	void
110040	1	01/24/94	void
110040	39	02/18/97	cmed
110045	15	05/09/95	symph
110049	0	01/25/94	sympts
110049	30	07/23/96	phytrt
110051	12	12/13/94	void
110052	3	05/10/94	void
110052	55	05/06/97	close
110053	24	02/06/96	cmed
110055	6	08/30/94	sympts
110057	0	03/15/94	preg
110057	27	06/26/96	symph
110058	12	04/11/95	med
110059	0	03/18/94	phs
110059	24	03/19/96	void
110062	0	03/31/94	preg
110062	24	05/14/96	purg
110066	0	04/12/94	purg
110067	3	08/04/94	purg
110068	3	08/30/94	void
110070	3	08/30/94	phytrt
110074	0	06/16/94	urod
110075	15	10/31/95	med
110076	3	10/04/94	void
110077	21	05/10/96	med
110078	12	10/03/95	diet
110080	0	07/07/94	sympts
110080	24	06/25/96	sympts
110081	18	02/09/96	void
110082	12	08/22/95	phytrt
110083	0	02/10/95	ucult
110085	0	10/11/94	phytrt
110086	18	04/30/96	diet
110087	3	05/30/95	phytrt
110088	0	03/07/95	excl2
110091	12	04/02/96	void
110092	6	09/19/95	void
110093	12	03/05/96	med
110094	9	03/26/96	purg
110095	6	12/05/95	phytrt
110096	6	12/19/95	urn
110097	21	03/18/97	med
110100	0	07/14/95	def1
110100	21	04/22/97	void
110104	1	10/23/95	symph
110107	0	09/22/95	urod
110110	0	11/10/95	prior
110111	0	10/17/95	prior
110112	15	01/21/97	phytrt
110114	0	11/10/95	diet
110115	0	12/01/95	preg
110117	0	12/11/95	void
110118	12	01/21/97	purg
110120	0	01/09/96	excl2
110121	9	09/03/96	cmed
110123	0	01/23/96	back
110124	0	02/05/96	urn
110125	9	12/10/96	phytrt
110127	1	03/27/96	purg
110128	6	09/17/96	void
110131	3	06/04/96	med
110134	0	04/15/96	hem
110135	1	05/16/96	med
110136	9	01/21/97	void
110138	6	12/03/96	med
110140	0	05/21/96	void
110142	0	06/04/96	prior
110144	0	06/07/96	hmrpt
110145	6	01/14/97	void
110147	3	09/17/96	void
110149	0	06/28/96	urod
110152	0	07/19/96	incl
110154	0	07/22/96	void
110155	6	01/28/97	qul
110158	0	08/26/96	cmed
110161	0	10/01/96	prior
110163	6	03/18/97	diet
110165	3	01/14/97	cmed
110167	0	11/19/96	purg
110171	0	01/21/97	hem

Let's work our way through the code. The DO statement tells SAS to start at 100, increase i by 100 each time, and end at 8600. That is, SAS will execute the DO loop when the index variable i equals 100, 200, 300, ..., 8600.

Now the SET statement contains an option that we've not seen before, namely the POINT= option. The POINT= option tells SAS not to read the stat481.log11 data set sequentially as is done by default, but rather to read the observation number specified by the POINT= option directly from the data set. For example, when i = 100, and therefore POINT = 100, SAS reads the 100th observation in the stat481.log11 data set. And when i = 3200, and therefore POINT = 3200, SAS reads the 3200th observation in the stat481.log11 data set.

The OUTPUT statement, of course, tells SAS to write to the output data set the observation that has been selected. If we did not place the OUTPUT statement within the DO loop, the resulting data set would contain only one observation, that is, the last observation read into the program data vector.

The STOP statement, which is new to us, is necessary because we are using the POINT= option. As you know, the DATA step by default continues to read observations until it reaches the end-of-file marker in the input data. Because the POINT= option reads only specified observations, SAS cannot read an end-of-file marker as it would if the file were being read sequentially. The STOP statement tells SAS to stop processing the current DATA step immediately and to resume processing statements after the end of the current DATA step. It is the use of the STOP statement, therefore, that keeps us from sending SAS into the no man's land of continuous looping.

Now, right-click to download and save the stat481.log11 data set in a convenient location on your computer. Launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Then, run the program and review the output from the PRINT procedure to see the selected observations. You shouldn't be surprised to see that the sample data set contains 86 observations:

PROC PRINT data = sample NOOBS;
NOTE: Writing HTML Body file: sashtml1.htm
       title 'Subset of Logged Observations for Hospital 11';
RUN;
NOTE: There were 86 observations read from the data set WORK.SAMPLE.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.64 seconds
      cpu time            0.29 seconds

as the iterative DO loop executes 8600 divided by 100, or 86 times.

Note! It is important to emphasize that the method we illustrated here for selecting a sample from a large data set has nothing random about it. That is, we selected a patterned sample, not a random sample, from a large data set. That's why this section is called Creating Samples, not Creating Random Samples. We'll learn how to select a random sample from a large data set in Stat 482.