34.2 - Random Sampling with Replacement34.2 - Random Sampling with Replacement
In the previous section, all of the samples that we selected were without replacement. That is, once an observation was selected from the data set, it could not be selected again. Now, we'll investigate how to take random samples with replacement. That is, if an observation is selected once, it does not prevent it from being selected again.
The following code illustrates how to use the DATA step to randomly select an exact-sized random sample with replacement. Specifically, the program uses the ranuni function in conjunction with the POINT= option of the SET statement to tell SAS to randomly sample exactly 15 of the 50 observations from the permanent SAS data set mailing:
DATA sample4A; choose=int(ranuni(58)*n)+1; set stat482.mailing point=choose nobs=n; i+1; if i > 15 then stop; RUN; PROC PRINT data=sample4A; title1 'Sample4A: Exact-Sized Unrestricted Random Sample'; title2 'Selects units with equal probabilities & with replacement'; RUN;
Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample of 15 observations from the mailing data set.
The key to understanding how this code works is to understand what the expression:
choose = int(ranuni(58)*n) + 1
accomplishes. As you know, ranuni(58) tells SAS to use an initial seed of 58 to generate a uniform random number between 0 and 1. For the sake of example, suppose SAS generates the number 0.99. Then, the value of choose becomes 50 as calculated here:
choose = int(0.99*50) + 1 = int(49.5) + 1 = 49 + 1 = 50
And, if SAS generates the number 0.01, the value of choose becomes 1 as calculated here:
choose = int(0.01*50) + 1 = int(0.5) + 1 = 0 + 1 = 1
In this way, you can see how the expression always generates a positive integer 1, 2, 3, ..., up to n, the number of observations in your data set. All we need to do then is to tell SAS to generate such a random integer over and over again until we reach our desired sample size.
Here's a summary of the approach:
- Use the NOBS= option of the SET statement to determine n, the number of observations in the original data set.
- Use the above choose= assignment statement to generate a random integer between 1 and n. (Note that the choose= assignment statement must be placed before the SET statement. If it is not, SAS would not know which observation to read first.)
- Use the POINT= option of the SET statement to select the choose'th observation from the original data set. The POINT= option tells SAS to read the SAS data set using direct access by observation number. In general, with the POINT= option, you name a temporary variable (here, choose) whose value is the number of the observation you want the SET statement to read.
- Perform the above two steps repeatedly, keeping count of the number of observations selected. The expression i + 1 takes care of the counting for us: by default, SAS sets i to 0 on the first iteration of the DATA step, and then increases i by 1 for each subsequent iteration.
- Once you've selected the number of observations desired (15, here), tell SAS to STOP. Note that when using the POINT= option, you must use a STOP statement to tell SAS when to stop processing the DATA step.
That's all there is to it! Again, you might want to change the seed (the 58) and the sample size (the 15) a few times to see how it affects the sample.
The following code illustrates an alternative way of randomly selecting an exact-sized random sample with replacement. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly 15 of the 50 observations from the permanent SAS data set mailing:
PROC SURVEYSELECT data = stat482.mailing out = sample4B method = URS seed = 12345 sampsize = 15; title; RUN; PROC PRINT data = sample4B; title1 'Sample4B: Exact-Sized Unrestricted Random Sample'; title2 'Selects units with equal probabilities & with replacement'; title3 '(using PROC SURVEYSELECT)'; RUN;
Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample of 15 observations from the mailing data set. Note that the only difference between this code and the previous SURVEYSELECT code is the method = URS statement here replaces the method = SRS statement there. Here, URS tells SAS to use the unrestricted random sampling method to select observations, that is, with equal probability and with replacement. (Oh, yeah, I guess the specified seed differs from the previous code, too, but that's no matter.)
Again, you might want to change the seed (seed) and sample size (sampsize) a few times to see how it affects the sample.