18.5 - Creating Samples

Because a DO loop executes statements iteratively, it provides an easy way to select a sample of observations from a large data set. Let's take a look at an example!

Example 18.12 Section

The following program uses an iterative DO loop and the SET statement's POINT= option to select every 100th observation from the permanent data set called stat481.log11 which contains 8,624 observations:

OPTIONS LS = 72 PS = 34 NODATE NONUMBER;
LIBNAME stat481 'C:\yourdrivename\Stat481WC\06doloops\sasndata';
 
DATA sample;
    DO i = 100 to 8600 by 100;
            set stat481.log11 point = i;
            output;
    END;
    stop;
RUN;
PROC PRINT data = sample NOOBS;
    title 'Subset of Logged Observations for Hospital 11';
RUN;

Subset of Logged Observations for Hospital 11
SUBJV_TYPEV_DATEFORM_CD
110004004/22/93prior
110027301/25/94med
1100273608/27/96cmed
1100291209/27/94purg
1100294204/01/97sympts
1100391806/06/95void
110040101/24/94void
1100403902/18/97cmed
1100451505/09/95symph
110049001/25/94sympts
1100493007/23/96phytrt
1100511212/13/94void
110052305/10/94void
1100525505/06/97close
1100532402/06/96cmed
110055608/30/94sympts
110057003/15/94preg
1100572706/26/96symph
1100581204/11/95med
110059003/18/94phs
1100592403/19/96void
110062003/31/94preg
1100622405/14/96purg
110066004/12/94purg
110067308/04/94purg
110068308/30/94void
110070308/30/94phytrt
110074006/16/94urod
1100751510/31/95med
110076310/04/94void
1100772105/10/96med
1100781210/03/95diet
110080007/07/94sympts
1100802406/25/96sympts
1100811802/09/96void
1100821208/22/95phytrt
110083002/10/95ucult
110085010/11/94phytrt
1100861804/30/96diet
110087305/30/95phytrt
110088003/07/95excl2
1100911204/02/96void
110092609/19/95void
1100931203/05/96med
110094903/26/96purg
110095612/05/95phytrt
110096612/19/95urn
1100972103/18/97med
110100007/14/95def1
1101002104/22/97void
110104110/23/95symph
110107009/22/95urod
110110011/10/95prior
110111010/17/95prior
1101121501/21/97phytrt
110114011/10/95diet
110115012/01/95preg
110117012/11/95void
1101181201/21/97purg
110120001/09/96excl2
110121909/03/96cmed
110123001/23/96back
110124002/05/96urn
110125912/10/96phytrt
110127103/27/96purg
110128609/17/96void
110131306/04/96med
110134004/15/96hem
110135105/16/96med
110136901/21/97void
110138612/03/96med
110140005/21/96void
110142006/04/96prior
110144006/07/96hmrpt
110145601/14/97void
110147309/17/96void
110149006/28/96urod
110152007/19/96incl
110154007/22/96void
110155601/28/97qul
110158008/26/96cmed
110161010/01/96prior
110163603/18/97diet
110165301/14/97cmed
110167011/19/96purg
110171001/21/97hem

Let's work our way through the code. The DO statement tells SAS to start at 100, increase i by 100 each time, and end at 8600. That is, SAS will execute the DO loop when the index variable i equals 100, 200, 300, ..., 8600.

Now the SET statement contains an option that we've not seen before, namely the POINT= option. The POINT= option tells SAS not to read the stat481.log11 data set sequentially as is done by default, but rather to read the observation number specified by the POINT= option directly from the data set. For example, when i = 100, and therefore POINT = 100, SAS reads the 100th observation in the stat481.log11 data set. And when i = 3200, and therefore POINT = 3200, SAS reads the 3200th observation in the stat481.log11 data set.

The OUTPUT statement, of course, tells SAS to write to the output data set the observation that has been selected. If we did not place the OUTPUT statement within the DO loop, the resulting data set would contain only one observation, that is, the last observation read into the program data vector.

The STOP statement, which is new to us, is necessary because we are using the POINT= option. As you know, the DATA step by default continues to read observations until it reaches the end-of-file marker in the input data. Because the POINT= option reads only specified observations, SAS cannot read an end-of-file marker as it would if the file were being read sequentially. The STOP statement tells SAS to stop processing the current DATA step immediately and to resume processing statements after the end of the current DATA step. It is the use of the STOP statement, therefore, that keeps us from sending SAS into the no man's land of continuous looping.

Now, right-click to download and save the stat481.log11 data set in a convenient location on your computer. Launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Then, run  the program and review the output from the PRINT procedure to see the selected observations. You shouldn't be surprised to see that the sample data set contains 86 observations:

PROC PRINT data = sample NOOBS;
NOTE: Writing HTML Body file: sashtml1.htm
       title 'Subset of Logged Observations for Hospital 11';
RUN;
NOTE: There were 86 observations read from the data set WORK.SAMPLE.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.64 seconds
      cpu time            0.29 seconds

as the iterative DO loop executes 8600 divided by 100, or 86 times.

Note! It is important to emphasize that the method we illustrated here for selecting a sample from a large data set has nothing random about it. That is, we selected a patterned sample, not a random sample, from a large data set. That's why this section is called Creating Samples, not Creating Random Samples. We'll learn how to select a random sample from a large data set in Stat 482.